Episode 60: To Data Warehouse or not to Data Warehouse

Today we take on the topic that I consider the most important for determining the success or failure of an advance analytics initiative.

Standing up the right Data Warehouse for your application is tantamount to ensuring that you will have the best foundations for machine learning, artificial intelligence, predictive analytics, visualizations, or any other statistical operation.

Find out which one you need today!

Also, be sure to check out the new Data Couture YouTube page for all new content on Tuesdays and Thursdays at

https://www.youtube.com/channel/UCU7tZYon0GIv4SND1FILP4g

To keep up with the podcast be sure to visit our website at datacouture.org, follow us on twitter @datacouturepod, and on instagram @datacouturepodcast. And, if you’d like to help support future episodes, then consider becoming a patron at patreon.com/datacouture!

Transcripts:

Welcome to data tour the podcast covering data culture at work at home and on the go. I’m your host, Jordan Bohall. If you’d like to stay up to date with all things, data and data couture then head over to our social media pages. If you’d like to help support the show, and check out our
Patreon page@patreon.com forward slash data couture know under the show.
Welcome to data cutter. I’m your host Jordan and on today’s episode, we’re going to ask that age old literary question to data warehouse or not to data warehouse. That is should you be standing up or implementing a data warehouse solution at your organization? Why or why not? And one last thing before we get to the episode is
Tomorrow we launch our very first real YouTube vlog. I had some, you know, episodes up there with some looping video before. But now you get to see my smiling mug talking about what’s on my mind when it comes to the data profession. And in any case, stay tuned for that. So now let’s get on to the show.
Okay, so welcome to the first part of the show, and I want to kick it off by giving a bit of a rant frankly, yeah, I guess I’m not prone to rant so much, but it’s one thing that has really been bothering me lately and that is in two parts. The first is something that I recently read and that is that
something like 37 or 39% of all business owners or CEOs or executives find having your employees be digitally literate
Data literate is priority. And so they’re investing in it. But regardless what the 30 ish percent mark is, that means that at least 60% of all business owners and all executives don’t find digital literacy or
data driven employees particularly important, they don’t find the need to prioritize or to otherwise invest and doing that for their staff. And I find that exceedingly bizarre and short sighted. I mean, I don’t know. Like, I’ll ask you guys, am I just super excited about what my industry is producing? Or is this really going to be the wave of the future? And I think all signs point to Yes, this is going to be where all business all you know functions are heading is in some sort of digital format, some sort of online format, what have you. And if you want to have a successful company, well, you better have people that are digitally literate data literate. What have you
anywhere in between because your company is going to go down in flames 60% of people who don’t find it necessary to do that for their employees, just a word of warning. So that’s the first part of it. The second part of it is
the financial industries reticence to adopt new technology more quickly. And I don’t necessarily mean your local bank or credit union, or even trading desk. Those people tend to be right up on the verge of what’s going on. In the data profession. They You know, there’s there are a lot of conferences in the life that are really espousing for the sort of power and
I don’t know business saving tools that the data profession can offer right now. And so like the your typical credit union or bank, they’re right on board with it. I’m talking about the vendors to these financial institutions, they’re out there Hawking their software solutions that are built on
Very legacy technology, and they don’t seem to have a keen eye for the way things are going. So for example, take your standard financial core. So for those of you that don’t know, financial core, is responsible for keeping track of things like your balances your transactions, any sort of like marketing, whether or not you need to, you know, want marketing to come to you, or if you don’t want marketing, it also keeps track of addresses and personally identifiable information. It keeps all sorts of records about your loans and which products you have. And it allows you when you go through your mobile device or your online banking to access exactly what type of money you have, or what your financial portfolio looks like, or any of that kind of thing, right? And so it’s the absolute core. It’s the heart of any financial institution, and it’s exceptionally important, right? Great. Well, there are a bunch of offerings out there.
Financial core solutions providers. And you know, they they offer different flavors of kind of the same basic thing.
Of course, I haven’t fully described what a financial core is because there are lots of other parts to it that go into making a financial core. But one thing that really stands out to me are these financial core software solution providers that have their system built on an OLTP database or data warehouse. So an OLTP data base, which I’ll be talking about a little bit, or more should say, in the next section, online transactional, our online transaction processing database. And these are excellent these do a very good job at keeping track of every little micro transaction or any other sort of transaction fees, changes and balances, applying interest rates, right? All that kind of thing. However, OLTP databases are OLTP data warehouses are not great.
For doing any sort of analytics, it’s not great for doing anything higher than simply taking a transaction and putting it into the ledger of a particular customer. That’s, that’s kind of the long and short of what an old TP did database or data warehouse does. And so these these sales people, these companies are out there Hawking their financial cores, which again, is the heartbeat, it’s absolutely necessary can make or break any financial institution. But then the part that drives me crazy, and this relates back to the first point where people are investing in data literacy, or the digital literacy of their employees, and that’s that they actively and in kind of, I don’t know how to put it, I think the very arrogant way, whenever they whenever the suggestion is raised that Oh, well. Surely I’ll need a data warehouse so that I can do all these really interesting advanced analytics practices. They immediately
Shut it down because they equate advanced analytics with report writing and you know, and OLTP database data warehouse can definitely support basic basic report writing. But what happens if you want to do some sort of logistic regression or any sort of statistical modeling? On the data that you have in your financial core? Well, you can’t do that and don’t know well TP database or data warehouse? Hell, that’s basic stuff, right? Like what if you want to get into true machine learning where you’re doing random forest modeling or you’re running naive Bayesian algorithms against your your chat, your chat logs that your contact center have been very faithfully storing right? Well, guess what all TP database can do that. Especially like now if we want to go to true artificial intelligence and do things like deep learning or put together some sort of neural network, a old CPE data warehouse
database cannot handle that kind of thing. And so I find it highly naive and highly just, I don’t know a legacy way of thinking. And it ties to that. That piece where employers don’t want to invest and improving the data literacy or the
digital literacy of their employees. These kinds of things go part and parcel because, well, I’m seeing a bunch of sales people that don’t know any reason why anyone would want more power than the simple financial Core Data Warehouse that they’re offering. Right? drives me absolutely crazy.
Okay, sorry about that rant guys. rant is officially over, like what six minutes in Jesus
to the section. All right, let’s get to the more interesting thing that is data warehouse or not to data warehouse and I can guess that you guys can guess where I’m going to go with this? Yes, you should be standing up a data warehouse. You should stand up multiple types of data warehouses
So let’s get into what those are in the next section talk to you in a second.
Okay, welcome to the second part of the show. And so we’re going to be talking about whether or not you should stand up a data warehouse at your organization. And based off the rant that I just went through, I think you can guess what I’m going to say. Namely, yeah, you need a data warehouse in order to do the really cool, exciting, advanced analytics techniques and methods that will propel your business or your life to that next level, right. And so today I’m going to talk about three different types of data warehouse, the three kind of major ones and use today of course, there are lots of different flavors and variants of what I’m about to talk about, but this all these could have started on Wednesday. I’ll do a deep dive on I think data lakes some I’m very excited about data lakes, but nevertheless, the three types
Are your OLTP database or data warehouse like your financial core that stands for online transactional processing database or data warehouse. Then you have the relational data warehouse, which is classic form, we’ll talk about something called a star schema or snowflake architecture when we get into that part of it. And then finally, there is the data lake type of data warehouse. And that is very exciting for your big data or any sort of truly advanced analytical processes and methods that you care to do at your company. So let’s start with the most basic the OLTP the online transactional processing database or data warehouse. Like I said, these sorts of things have very, very simple sequel underlying any sort of process, bringing data end so for example, say you are going to pay for something with your debit card at the grocery store, the gas station, maybe you need some energy drink.
Like I do on a regular basis, well, when you swipe that card, it goes through a transaction process, it gets routed to your bank. And that bit of data gets inserted into the transaction table for that particular database. Now these are very simple because they just have insert statements effectively where they say insert this transaction into a particular table and the oil TP database and so they they don’t have a lot of fancy code, there is any sort of statistical modeling typically, it’s it’s very rudimentary, but one very good thing about them are they are very robust. They don’t break down very easily. They can run forever and ever and ever, because they are just based on these very, very simple statements to get the data from one place and into the database, right. And so there’s not a lot going on with them.
A major downside, which, you know, I’ve already talked about how you can’t do any sort of real statistical model
With OLTP databases or data warehouses
is that they are massive data warehouses and it can take quite a bit of time to query large amounts of data from them. So if you’re wanting to run a report that you know, maybe has high level summary statistics, like how many,
I don’t know transactions that a particular customer do, okay, they can do that kind of thing. However, if you want to see the raw data for, let’s say, 300,000 customers, and you want it to have, I don’t know, let’s say 20, or 30 rows wide, so you want to see everything about their transactions, their PII information, the personally identifiable information, you want to see the date, they joined your organization, you know, blah, blah, blah, if you want to see 30 or 40 of these variables, it’s going to take a long time to query that kind of information, like 3040 minutes depending on how old your systems are, and how old the server that that particular LTP database is sitting on right. And so there
You know, they do. They’re very Rock Steady, these OL TPS, they have very simple code underlying them, which means that they are pretty much bulletproof as goes, data warehousing. But let’s talk about something that’s far more useful for any sort of advanced analytics. And that is a relational database. a relational database is somewhat similar to your old TP setup, except the data warehouse. The relational data warehouse is primarily focused on historical data. And so even though sometimes you can get some historical data out of an OLTP, the data warehouse is geared towards primarily having you know, however many years back you want to go, let’s say three years back, have complete records of everything that your customers are doing. And so with this relational database, each one of the tables within this data warehouse has something called a primary key. And then foreign keys and primary keys are great because like think of like a social security number at all.
allows you to match up every bit of data across all the different tables with one single key and a foreign key is similar, but it only relates to a few tables among the entire set of tables that you have. And so primary keys are far more robust, whereas your foreign keys are far more, you specific, right? And so now you have this setup. Now we call like, Well, one way of implementing a relational database or data warehouse is through something called a snowflake architecture or a star schema. And this kind of architecture
puts into groupings that looks like a star. So you have kind of a central unit called a fact table and then around this fact table, you have various dimensions of those facts. So for example, say you have a snow or a star flag Jesus a snowflake if I could speak this evening, and the snowflake is centered with say loan information like balances, pay updates, interest rates,
It’s that kind of thing. That would be your fact, it’s called a fact because it doesn’t change that frequently, necessarily. But then around it, you have things called dimensions dimensions could be the owner of that particular loan. So it could be your customer information, like their name, their address, their phone number, email address, that kind of thing. Then you can have another dimension that might be your the employee dimension where it’s which employee initiated the loan, which ones are managing and facilitating that loan, that kind of thing. And so you can see how you can build up all sorts of dimensions about a particular situation, in this case alone, right. And so when you have all these stars floating around in your data warehouse, these snowflakes, that’s when you get the the snowflake architecture. And with data warehouses, you can fully customize this, you have lots of really interesting sequel code that puts these bits of information into these various star schema.
And allows you to build something called a data mart. And the data mart is more or less portions of your overall relational data warehouse, broken up into these various stars within your snowflake architecture. And you can just focus, you know, let’s say 10 or 15 of these these stars, so that maybe your underwriting team can quickly access and query just that information without having to pick up any resources from the rest of the warehouse. Also, you can put security so that maybe your underwriting team only has access to that particular star schema or set of star schema. My
point is you can quickly attach any sort of
visualization tool or analytical tool to these data marts be able to query them quickly have only the information that you need, and be off and running so that you can then do the analysis that you need to propel your business forward. Right. The other great thing about relational data warehouses is that
Even though they’re not necessarily the most flexible type of data warehouse, they do allow for custom table development, which is fabulous when you want to try to build up some sort of model using machine learning techniques or even predictive analytic techniques. And your analysts can be off and running, doing this sort of analysis, that analysis that they need to do, as well as once they build their model, they can let it live within your server for your data warehouse, which means that your models are going to run much more quickly and just be much more fruitful when it comes to doing any sort of analytical processing. And so this is why your relational data warehouse is going to be far more important for your analytics team or even for the analytics of your organization then say your common or at least legacy OLTP data warehouse. Now one thing that you know the the relational data warehouse fells that is a fluidity it’s it’s not easily changed. Once you have these
These tables set up the star scheme and the snowflake architecture put together, you know, it takes a while takes manpower, at the end of the day costs money to make any sort of significant changes to the structure of the warehouse. And furthermore, you know, you can’t necessarily put lots of unstructured data into your relational data warehouse. Yeah, you can put some text in there. But then you’re just going to be doing keyword searches, which isn’t great when you’re trying to do sentiment analysis, for example, right? You’re really focused on structured data, the sort of data that fits neatly, and say, an Excel workbook. That’s what really lives in your relational data warehouse. So say you want to start, you know, mining the data from all the different social media feeds for your company, or maybe you want to do some really deep analysis on the sort of chat logs and chat information that you’re getting from your contact center. Or maybe you want to do some sort of audio
Modeling from again, contact centers or any other sort of source for audio information, or you want to like do some OCR scrubbing on PDFs or you know, so you’re going to have all this unstructured data that you can’t put into a relational data warehouse definitely can’t put into an old TP database. So what do you do? Well, the next and most exciting at the moment type of data warehouse structure is called a data lake. And the data lake is kind of what it sounds like. If you compare it to a data warehouse, a data warehouse, any sort of warehouse for that matter is structured, there are rows and things are neatly stacked and things are neatly organized. In a data lake. It’s much like you know, going out to the pond near your house, the data lake, it isn’t organized, you get only raw data, you’re going to have a ton of data in there. So it’s not very nice. Oh, and it’s going to be a lot of unstructured data. So it’s not going to be very nice for the user for
The business user to try to go in there and figure out what’s going on so that they can drive some sort of, you know, business answer some sort of analytical model for them. The data lake, however, allows for us to get into the territory of big data. And I’m going to have an episode on big data coming up. Because there’s lots of disagreement about what’s big data, what’s not big data. I don’t know. It kind of drives me crazy. But nevertheless, we’re going to be talking about it. But the data like getting back to the warehouse comments, allows you to get into this realm, it allows you to process a ton of data for when the data lakes in the hands of a data professional of a data scientist, because they can go through build out whatever they need to they can pull in all sorts of information that’s unstructured, let their models feed on it, and be able to produce something that is that next level, right? data lakes are very, very cool. However, the downside like I said, they’re not easily query bubble and so they’re not great if you want businesses.
users to have access to them. However, they are great for those advanced analytics techniques, those advanced analytics capabilities. And so the question is the question of this episode to data warehouse or not to Data Warehouse? Well, you know, you might choose one type of data warehouse for your needs. But frankly, all three of these things can live together. So for example, in my organization, we have LTP database. We have lots of other similar OLTP type databases coming from various other source systems. We have a data warehouse that we’ve built up using the snowflake architecture with the mini star schema, right.
And we’re dipping our toes so to speak into standing up a data lake now that we’re getting more and more prepared to start doing some of these more advanced artificial intelligence top and machine learning types of techniques and methods
and
All three can live together. Depending on what your your IT infrastructure looks like, you might choose one or the other. Depending on what your business needs are, you might choose one of the three. But I can assure you that all three can live together. All three have very different capabilities and very different use cases. And so really, it’s up to you Do you just need simple reports? Great. You know, LTP? Do you have the need for more advanced analytics capabilities? Not quite AI, but certainly machine learning. Great. Stand up a data warehouse? Are you looking to take the dive into the artificial intelligence realm and start doing deep learning and image recognition and speech recognition and sentiment analysis and these kinds of things? Fabulous, get a data lake or stand up a data lake right? Of course, businesses have needs across all three of these areas. So why not put all three in? That’s my suggestion. So until Wednesday, keep on getting down and dirty with data
That’s it for the show. Thank you for listening. Be sure to follow us on any of our many social media pages and be sure to like and subscribe down below so that you get the latest from data couture. Finally, if you’d like to help support the show, then consider heading over to our
Patreon page@patreon.com. forward slash data could tour writing, editing and production of this podcast is done by your host Jordan Bohall. So until next time, keep getting down and dirty with data

Liked it? Take a second to support Data Couture on Patreon!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.