Today we whet our whistle taking a deep dive into the architecture underlying a Data Lake! Jordan attempts to paint a picture of in what a data lake consists! Stay Tuned!!!
Also, be sure to check out the new Data Couture YouTube page for all new content on Tuesdays and Thursdays at
To keep up with the podcast be sure to visit our website at http://www.datacouture.org, follow us on twitter @datacouturepod, and on instagram @datacouturepodcast. And, if you’d like to help support future episodes, then consider becoming a patron at patreon.com/datacouture!
Welcome to data tour the podcast covering data culture at work at home and on the go. I’m your host, Jordan Bohall. If you’d like to stay up to date with all things, data and data couture then head over to our social media pages. If you’d like to help support the show, and check out our
Patreon firstname.lastname@example.org forward slash data couture know under the show.
Welcome to data couture on today’s Tech Talk, we’re going to be diving deep into a data lake and what a data lake is the architecture surrounding it and why you need one for your analytics capabilities. Before we get to that, let me remind all of my faithful listeners that we have
A new YouTube series out. I’m calling it data driven and it comes out on Tuesdays and Thursdays. There’s already one loaded up on YouTube now and another one will be coming out tomorrow. So that means that you guys get five full days of data cutter content Monday, Wednesday and Friday for the podcast and Tuesdays and Thursdays for the vlog series. So check that out. Be sure to like and subscribe hit that bell notification when you do that. Will you stay up to date on every blog that I put out? All right now under the show.
Okay, welcome to the first part of the show. So today like I mentioned, it’s going to be a deep dive into data lakes on Mondays show. We talked a little bit about OLTP types of databases, relational databases, and also a little
about data lakes, but I really want to get up close and personal with what a data lake is specifically the architecture of a data lake and why it is so beneficial to big data analytics. So first, let’s talk about what a data lake is. You can think of a data lake as a type of storage space, a storage repository that can store large amounts of data. And this data can be structured it can be unstructured it can be semi structured. Point is a data lake is a place to store every type of data. And then importantly, you can store all of those different types of data and its native format in its raw format. So you don’t need to do all sorts of processing. Before you can put it into some sort of hierarchical structure like you would normally in a relational data warehouse. And so, you know, these data lakes, they’re called Data lakes because they’re very much like relics out in the world IRL, so to speak. That’s what the cool
Anyways. So just like normal like, you know, in a data lake, you have all sorts of source systems that are coming in. The source systems provide all sorts of data types of structured, unstructured, semi structured types of data. And it can flow directly into your data lake without too much hassle, right. And so in this way, a data lake can potentially democratize data, because it can, all the sorts of data and all of its wonderful native raw format can be available to anyone across your organization, which means that your business owners, your business users and your data scientists alike can draw all sorts of conclusions from the data, because they have access to literally everything across your organization. And then similarly,
you know, given how cheap it is to actually stand up some of the underlying technologies to do
Data Warehouse and also the relative cheapness of storage and processing solutions that exist. Data lakes don’t have to be a lot of money to initially stand up. And you can, like I said, get all sorts of different data that you couldn’t do otherwise, and an OLTP, or a relational style data warehouse. And then finally, data or data lakes are characterized by their architecture, which we’ll get into more of this in a second. But it is a flat architecture. This means that every single data element in the data lake gets its own unique identifier. And then it’s tagged with some metadata information, and then it just sits there. To compare this to a relational data warehouse. Your architecture is very much tiered. And you have different levels of aggregation, different levels of granularity. That’s not the case in a data like you get the lowest level of granularity possible. It’s tagged in bag so to speak, and that’s where the data rests.
Now, let’s talk about why you wouldn’t want a data like, Well, one of the big reasons is given tools and storage engines like Hadoop, you can store all this disparate information in a very easy and I don’t know, acceptable format. So you don’t need to model data into some sort of, I don’t know, enterprise wide data schema with the data like, you can just prop up Hadoop, throw the data in there effectively. And then with the increase in data volume, velocity and variety of the characteristics of big data,
you can store all this information, which means that your analytics are going to increase. Next, if you want to do very complex things and machine learning or artificial intelligence, your data lake is the source for you because you can have lots and lots and lots and lots of data. You can have you can truly
The big data around, which means that you can do sophisticated artificial intelligence or higher end machine learning analyses on that data. So these are some of the few reasons why you might choose a data lake for your organization. So let’s pause for a second and in the next section, we will get into what an actual architecture looks like for a data lake. So stay tuned.
All right, welcome to the second section. Now, before we get into the architecture discussion, let’s talk about some key concepts that are absolutely necessary before we can really start to understand a data lake architecture. Yeah, I’m totally cliffhanging. You guys cliffhanging you guys. I’m not sure what that right descriptor would be anyways, doing a cliffhanger. So here are some key concepts for data like that.
will help us to understand the architecture piece that comes shortly. So the first is the notion of data ingestion, data ingestion can allow the way that we connect the data lake to a source system to get that data from the source and load it into the data lake. And so we need to be sure that we understand that there are that the data ingestion should support all types of your structured your unstructured and your semi structured data. Similarly, you should be able to handle multiple ingestion methods. And so whether that is real time, which is awesome if you’re doing something that requires real time, data processing, say, some sort of machine you can determine if it’s working or not, or if it’s about to break down that kind of thing using sensors. There’s of course, batch data ingestion techniques, where you just take large amounts of data and maybe once a day or once a week or once a month and just push it all the way into your data.
Like, there are something or there are methods called micro badging, where maybe you pick it up every day, you pick up the data every 15 minutes, every 30 minutes, every hour, whatever it is, you’re not doing the full day’s batch, but you are doing smaller batches across the day.
You know, we do this in my own organization, when it comes to our phone records, right, we pick it up every 15 minutes. That’s a technique called micro badging. In any case,
the way that you ingest the data should be supported. Or you should be able to support lots of different data sources, like all sorts of different databases, maybe web or email servers, Internet of Things devices, File Transfer Protocol, servers,
web data, like social media data, any sort of video or audio data, anything like this. And so this type of ingestion is key for data lakes.
second key term is that of data storage. That’s a lot of what we’re talking about the data storage and a data lake well should be scalable, you should be able to go from zero to hero and a cost effective way that allows very quick access to the data and then to data exploration, which, again, supports all the different types of format of data. Right. The next key term is that, I guess it’s more of a notion or an umbrella term, but that of data governance. And one way to think of data governance is as a process of managing the availability, the usability and security, the integrity of the data used across an organization. Of course, data governance has quite a bit more involved with that, and we’ll get into that in another episode. But the next key term is that of security for data lakes. security needs to be implemented at every single layer, and we’ll talk about the different layers in just a minute when we get into the architecture discussion. But security needs to be implemented across every level.
The data like, it starts with the import process itself, the ingestion part and then it goes into the storage piece data rest data being manipulated into something more usable all the way to the consumption part where you are actually actively querying the data and using it for analytical processes. Things like authentication, accounting, authorization, and data protection are some of the very, very, very important features of data security. And if you work in any industry that is highly regulated, you better secure your customers data, it’s of the utmost importance. One, it’s the ethical thing to do. That’s probably the most important but to it’s the legal thing to do.
And three, it’s, it’s what your customers have interested in you when it comes to handling their data. Now the next piece for a data lake is the notion of data quality. Data Quality is extraordinarily difficult.
Especially if you have a ton of legacy data. But data quality is an essential component of a data lake architecture. When you are attempting to extract any kind of business value from your data,
well, let’s just say if you have poor data quality, you’re going to get for insights from that, that that data, like we say, should n shit out. You got shit data going in, you’re going to get insights coming out of it, right? So you should do everything you can to ensure the quality of that raw that native data that you are importing that you’re ingesting into your data lake.
Gosh, what else do I want to talk about? You know, I think that’ll be enough to get us into the architecture discussion. So let’s hop over to that piece. Stay tuned.
Okay, welcome back. Welcome to the data lake architecture section, no more beating around the bush. And so when I talk about architectures, whether that’s have data warehouses, actual houses, engine diagrams, whatever it is, I like to think about them going from left to right, just like we read books, right. And on the left is usually the the more basic components. And then on the right is usually where it ends up. So in this case, with the data lake architecture on the left side, we’re going to talk about all the source systems. And on the far right side, we’re going to talk about the type of insights and interactions that we’re able to get out of the data lake now squashing between your left your source side and right your insight layer is going to be your middle layer, and the middle layer is where all the data lake architectures really going to sit and when I think about that middle
I like to think from top or from bottom to top, bottom being the most fundamental layer, all the way to the top, which is the highest level of thinking about a data lake. So you get left side, your sources coming in middle, you’re a little squished area, the bottom of it’s going to be your most basic all the way up to your highest level. And then to the right is the insight layer where you gain insights from the data. So let’s start on the left side. Left side is your ingestion layer, or your ingestion tier, just depending on how you want to think about it. And these are all of your sources. And your sources, like I said, can be any sort of real time sensor data, for example, it can be micro batch data, like every 15 minutes every 30 minutes, every hour, what have you, that kind of thing. It can be full on batch ingestion, where you’re getting, you know, one big batch of data maybe once a day, once
week, once a month, what have you and so we have this kind of natural division between types of data ingestion from the real time to the every once in a while micro matching to the more distant spread out matching of various types of data. And again, remember that it can be structured, unstructured, or semi structured data, which means it can be everything from audio and video to, you know, standard Excel type sheets. Doesn’t matter anything in between there, right. Alright, so now let’s move into that middle layer, that squishy layer the layer of the actual architecture of your data lake. Now data lakes can be built upon a variety of different technologies. You can use Hadoop, for example, that’s a classic technology on top of which you build your data lake or you can use Microsoft’s offering which they have an Azure or Azure Bonjour anyways, they have an Azure Data Lake. They also have an
Apache data lake through Microsoft or you can just use Apache Spark which is a type of a dupe implementation. Nevertheless, there are lots of different technologies that we can use for this middle squishy layer, the architecture of the data lake. I want to talk about a Hadoop architecture. Because at the bottom, the most fundamental part of your data lake architecture is your storage. Now, one very good way of storing data in a data lake is something is a technology called HDFS. HDFS stands for Hadoop Distributed File System. And a Hadoop Distributed File System is a type of surprised distributed file system that can run on fairly low cost hardware. And it’s also highly fault tolerant so it’s not going to fail right. And you can use cheap servers, it doesn’t really matter. The point of an HDFS type file system
for your mobile
Basic layer for your storage is that HDFS can provide a very high throughput of access to the data, which means that when you have very, very large data sets, this type of storage allows very, very quick access and then further processing, right. So we think about our architecture then we have the ingestion layer. And now we’re in the middle layer, which is the architecture of the lake. And in that at the very bottom is your HDFS. Now the next thing that sits on top that type of storage is called the distillation layer. And the distillation layer or the additional distillation tier then takes the data from your HDFS storage, and then it converts it into structured data for easier analysis. So, you know, you might be thinking I’m pulling a fast one on you because I’m like, Oh, yeah, you don’t really need structured data to have a data lake and can do all sorts of cool analyses. Well, the end of the day for using our Python
You’re going to have to bring in a matrix or a data frame. And to do that, you need some structured data, right. And so sitting on top of your HDFS storage is going to be this distillation piece where you’re going to be using massively parallel processing techniques. Or you can use some in memory techniques to then transform whatever’s floating around in your data lake into something that’s far more structured, far more usable. Of course, that’s not always the case when you’re dealing with truly unstructured data like image or video or chat data, right. But in that case,
there are other techniques that we can layer and like some and or some statistical techniques that we can then jump up to the next tier, which is processing but for the most case, we try to
make the data into something more structured so that we can more easily analyze it right. So now in our architecture, we have this Hadoop Distributed File System storage. On top of that, we have the
elation tear distillation there where we’re trying to convert things into structured data trying to convert the data into something structured. The next piece is the processing glare. So this is layer three within the architecture. Think of a cake right on the third layer of the cake. We’re going to be running all of our statistical algorithms, we’re going to be running user queries, because we’ve already converted some of this data like data into something that’s more structured, more usable. And of course, the the data that we’re using for either our analytical efforts or through our querying is going to be
real time, it could be batch data could be sent my batch data, right can be all these sorts of things. But this is where we do a lot of the processing so processing so that we don’t have to rely on our later tier where we’re gaining insights to do all the work for us. We try to do it as much as possible, and our data lake Now finally, the very top layer of the cake, maybe the icing is the operations.
is the highest level. This is the way it looks all shiny and glossy, so you don’t see the mess underneath if you’re me and try to cover your mistakes with nice cake batter.
the operation is here is where you deal with the system management and the monitoring monitoring of your data lake. And the monitoring includes things like auditing,
proficiency, data management, workflow management, systems management, these kinds of things, and they more or less allow you to govern all the layers below maybe that are messy that you want to hide some mistakes, they put some nice icing on right. And that’s all before you move to the far right of our structure. And so again, far left all of our sources that are going to get adjusted your middle layer, that’s where all the heavy duty machinery comes in at the bottom you have your HDFS storage then you have the distillation
you’re converting the data into something more structured than you have your processing processing. tier. I think I feel like I’m saying it wrong, but I’m not processing tier where you’re actually doing your analyses or running your queries. And then on top of that, wrapping it all together, you have your operations tier. Now we’re moving to the far right side, which is the insights, tear, insights tear, well, big surprise. It’s the research side, it’s the insights side, it’s the usability side, where using technology using languages like SQL or NoSQL queries, or any sort of visualization software like Tableau or Power BI, hell even Excel, you can pull that data out of your Lake out of these tears, and then start gaining insights from them. And that, my dear friends, is a data like architecture, start with sources get it adjusted, put it into some storage that offers scalable and quick access to the data
Turn that data into something that’s a bit more structured, start doing processing on that somewhat structured data, make sure it’s governed appropriately, pull it out and do a bit more analytics to do a bit more visualization, do what you have to and gain insight from them. And that is all I have. For now. When it comes to data lake architectures, of course, there’s much more technical things to be said far more technical things to be said. But I don’t think that’s going to fit in the next 30 seconds of this podcast. So until next time, keep getting down and dirty with data.
That’s it for the show. Thank you for listening. Be sure to follow us on any of our many social media pages and be sure to like and subscribe down below so that you get the latest from data couture. Finally, if you’d like to help support the show, then consider it heading over to our
Patreon email@example.com or
slash data could tour writing, editing and production of this podcast is done by your host, Jordan Bohall. So until next time, keep getting down and dirty with data