Episode 62: (Implement This!) Have you Considered This Before Standing up your Data Warehouse?

Today Jordan considers the many areas that you need to think about as you attempt the journey toward implementing a Data Warehouse at your organization!

If you take seriously the many aspect involved in standing up a data warehouse, then you will be well down the road to advanced analytics capabilities at your company!

Also, be sure to check out the new Data Couture YouTube page for all new content on Tuesdays and Thursdays at


To keep up with the podcast be sure to visit our website at www.datacouture.org, follow us on twitter @datacouturepod, and on instagram @datacouturepodcast. And, if you’d like to help support future episodes, then consider becoming a patron at patreon.com/datacouture!


Welcome to data tour the podcast covering data culture at work at home and on the go. I’m your host, Jordan Bohall. If you’d like to stay up to date with all things, data and data couture, then head over to our social media pages. If you’d like to help support the show, and check out our Patreon page@patreon.com, forward slash data couture, know under the show.
Welcome to data couture. I’m your host Jordan. And on today’s episode of implement this, we’re going to be talking about how to implement a data warehouse in your organization. Now we’re going to talk about two major areas, namely, the IT side of it, the IT systems side, but then also a lot of the considerations that you need to take into account once you have those server resources available. So that’s what we’re going to be getting at today. But I would like to remind all of my wonderful listeners and like to give a shout out to my listeners in Sydney, Australia for keeping this alive in Australia. Thank you guys so much. And I promise not to do some sort of hack Australian accent because that’s horrible. And any case, beyond that, I have the YouTube blog going strong, titled data driven. And unfortunately, no one came up with data driven, but I happen to like it. So I’m giving myself stickers and a button. In any case, be sure to check it out. It’s available on Tuesdays and Thursdays, which means you get five full days of data to work content. So now on to the show.
Okay, welcome back. So the first part I want to talk about the first half is that of the IT systems requirements, I’m going to talk about five different requirements that you should consider before standing up your data warehouse whatsoever. So the first is the notion of workload separation. And what I mean by this well, when you’re going to do your ETL process, the process of extracting data from source systems, transforming it so that you get the right types of data and to the right types of fields and unloading it into your data warehouse. Well, the ATL process takes up a ton of CPU usage. In fact, you should be preparing for 100% of compute usage. For whichever server however you decide to stand up your data warehouse. And then on top of that, you also have the analysts workload. So one thing you can think of is the ATL process is pretty regular. Once you identify source systems, you can set up your email process to run overnight. For example, if you’re doing batch uploads, while you let your analysts do their workload, or your engineers do their workload during the day, say you have streaming data, though, frankly, I’m talking about relational data warehouses in this instance. But say you have a data lake setup and you need all sorts of streaming data. Well, you can still partition the processes according to who’s using what or if you’re using some sort of parallel processing, processing situation, then you can have, say, a couple servers pulling in all this data and then pushing it over to the servers or to the partition areas where your engineers are in data scientists and analysts are actually using it, thereby helping you to determine where you should separate the workload, which machines, you should be separating the workload. The second is the notion of maximizing your data loading throughput. And so, you know, sometimes, again, if you’re using data lakes, or if you’re, if you’re dealing with truly big data, you need to rapidly do your ATL process, which means that you need to maximize the throughput that is the complete or the full, or the whole, the total amount of work completed for your email process or your extraction process. And that should be
it should be chosen above any sort of individual query by your data scientists or your analysts. And so again, you should go for that parallel processing stream so that your compute power your CPU usage can be 100% in one area, where it’s not another, which is challenging, because in order to set up these NPP, these massively parallel processing systems, you’re gonna have to buy lots of machinery, but at the end of the day, it allows you to maximize this data loading throughput. And then the next idea is that of maximizing speed of your queries while minimizing latency. And so you’re going to have users and these users are going to want to access their dashboards, they’re going to want to get their queries as quickly as possible, they’re not going to care at all about the ETL process or any sort of batch loading, all they want is a very quick speed, a low latency on their dashboarding queries. So one way to do this is by having multiple levels of caching, and you can group these caching, that way, you can deliver the sub like, you know, the instantaneous the level, you know, snap of a finger level performance on dashboards, while also again, segmenting using various techniques of workload, so that performance isn’t going to be degraded whenever you have a data scientist or new sort of bi engineer or data analyst running other queries, while your end users are trying to access their dashboards, right. Another idea on the IT side of the house is and not necessarily just in the IT side, but certainly both the IT side and the data engineer site is having a very quick time to value. And so what I mean by that is go through of the purchasing perhaps of hardware, you set up the NPP the massively parallel processing, you do all of this sort of segmentation of workload? Well, now what do you do now you have all this data, should you just dump it all in at once and then sorted out until you get something that’s usable, also called the admin method, or in mind method I made famous by Bill and mon, or should you go the route of the Kimball methodology, made famous by Ralph gimbal, where you take just what you need put in the warehouse and build it up over time, my position and my thought on the matter is that you should go the gimbal methodology. That way, you can put only in the data warehouse, what you need for the users at the end of the day. And so you can take stock of what your business requirements are put that data in there, then you have end users who get the data much more quickly than having all this data coming in and taking up all of your compute resources, and then trying to figure out what to do with all this data now that it’s in there.
And then finally, is
all covered today, necessarily, is the notion of sizing. And so you know, different people, different users have different needs for the size of compute power, for example. And so you’re at all processes are going to take up a lot of your compute power, your dashboard users, if you’re cashing it correctly, will take very, very little compute power, your bi engineers or analysts will take somewhere in the mid range for compute power, and then your data scientists are going to be your heavy users just like your UTM processing. And they’re going to take a ton of data. Well, the ideal solution for this, namely, by sizing independently, is to run your retail processes on very, very powerful machines, right, like the real workhorses of your server set, for example. And then having separate ones where you can quarantine your bi engineers or analysts or your dashboard users. And the same goes for the data scientists have give them the sort of really beefy, robust, sort of compute applications, right. And so while you do this paralyzed parallelization, that’s a hard word to say. You can then effectively size independently the needs of each one of your business users while also guaranteeing the sort of speed that you need. Right. So now let’s move on to the other side of the house. Namely, when you’re actually creating the data warehouse, right. And like I said, I very much along the lines and very much a fanboy of the Kimball methodology. And to do that, to just put in the data that you need for the business users for the actual business objective? Well, first, you need to actually determine what you’re trying to achieve with your data warehouse. So first step, you need to talk to a lot of people, you need to build rapport, you need to figure out how your business users are going to use the data that you have intend to put in there. And so the first step is to determine overall business objectives, which will lead to the second step, namely, collecting that data, discovering where the data lives, and analyzing what sort of fields, what tables, what sort of schema you need, what sort of architecture and I’m a big fan again, Alliance aligned with Kimball with Ralph Campbell, of the star schema, or the snowflake architecture, which I’ll be talking about. didn’t talk about that earlier. Anyways, it’s a very nice way to create high performing tables and query structures, right. And so that’ll allow you to determine which data you need to collect to put into your data warehouse. And along with that as identifying your business processes across your organization. Because again, once you start creating the star schema and these, the snowflake architecture, well, then you can start creating data marts and data marts allow you to do a number of things. One is you can put all and only those stars only that part of the snowflake architecture that a particular business unit needs or a business vertical into a data mart, which allows them to query the data much much more quickly. But also you can put lots of security wrappers around them so that around these data marks right and so that the business users only can see the data mart that they are allowed to have access to, while more or less making a pic anything else in the data warehouse, which can then ensure a level of security and privacy
as is needed, right.
The fourth is then creating the conceptual data model, which I’ve already jumped the gun with the star schema, snowflake architecture with your data marts. But of course, you might go into a cube structure. But really all that matters is that it aligns with your business objectives right. Now, the next piece is making sure that the transformation that you create, namely, in your UTM, process, extract, transform, and load, make sure that you’re transforming the data to ensure consistency, accuracy, cleanliness, all these sorts of things so that when you do Place your data into the tables, you can maintain this reliable, consistent bulletproof architecture for your data warehouse. And I should say the final set is, or the final idea on the side of the house is to determine how long you’re going to be retaining data. So a relational data warehouse is almost defined by having lots of historical data. And this is very important whenever you’re trying to do any sort of trending if you’re trying to build any sort of models, because God knows if we didn’t have historical data, or machine learning models would never operate, because it’s just some sort of brute force that lives off of lots of historical data, right, that’s all machine learning models truly do at the end of the day. Well, you need to determine how much data you need to keep for how long and so whether that’s with backups, or whether that’s with maybe only need three years of historical data, you should determine a best practice given how much storage space you have, how much compute power you have been given or that you’re planning to get right. All these things matter for this tracking of the data. Also, you need to have lots of metadata, wrapping everything that you have in there that will you know, exactly the data lineage for where the data is coming from just in case you need to track back, like say, your entire data engineering team gets run over by a train, which hopefully that never happens, because that’s horrible. But nevertheless, say the whole team gets run over by a train. Well, you need to be able to track that data lineage to the next group of data engineers coming in can see where everything’s coming from, see what the stored proxy look like, see what the ATL processes built on, see how the transformations work, see how the load processing works and see how all the automation optimization has happened over the course of the building of your data warehouse. And so all of these things, both the the the IT system side of the house, the architecture of the IT systems, as well as the architecture, and all of the really business decisions that go into building a data warehouse, you need to consider both sides of the house before you set off on your data warehouse journey. But once you do, you’re gonna have a data warehouse, that one is probably going to be fairly agnostic as to which source systems get put into your relational data warehouse, and you’re going to be able to set it up for any sort of failure, you’re going to create something that’s robust, it’s secure, that is user friendly, as well as can handle all these load issues that come up. And so it’s, it’s very good to consider both sides. So that’s all I have when it comes to implementing a data warehouse solution, at least a relational data warehouse solution at your organization. So next, until next time, let’s get down and dirty with data. That’s it for the show. Thank you for listening. Be sure to follow us on any of our many social media pages and be sure to like and subscribe down below so that you get the latest from data couture. Finally, if you’d like to help support the show, then consider heading over to our Patreon page@patreon.com forward slash data could tour writing, editing and production of this podcast is done by your host Jordan Bohall. So until next time, keep getting down and dirty with data

Liked it? Take a second to support Data Couture on Patreon!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.