"Big data" has been the topical buzzword in the tech scene for quite some time now and has an ever-important role to play in the way that technology continues to advance. If you're not already taking advantage of the plentiful insights that your customers' day-to-day transactions and interactions with your brand convey, then we hate to say it, but you're falling behind. Off-the-shelf analytics toolsets like Google Analytics and others offer great data insights to get started, but there comes a time when you need to supercharge your data analysis capabilities.
One of the sure-fire ways to get started is by collecting your data in a data warehouse, data lake.. or both! Our customers often utilize one or the other depending on their use cases, so we've decided to share a few pointers they’ve learned to help you make the right decision for your business. Data lakes and warehouses are two similar concepts, but have their own pros and cons when you're deciding where you want to house your data.
But first, let's dive into what a data warehouses and data lakes are (at least, at this moment in mid-2020):
Pros and Cons of Data Lakes
A data lake stores raw, unmanipulated data for the purpose of maintaining maximum data flexibility. Data housed in a data lake typically will not be used for heavy data analysis until it is transformed into some other format that makes for easier, more effective analysis.
Data lakes are generally used when the purpose of having data is for infrequent analysis, and/or the purpose of the data is generally uncertain. It is somewhat difficult to query this data without a data scientist, but tools such as Amazon Athena have helped make data lakes more queryable. Data lakes are also great if you know you need to collect the data, but are not exactly sure what the use case will be yet. Since the data has limited structure, you can apply whatever structure you'd like later on in a data warehouse or another tool.
A data lake stores data like water in a lake, uncollected and untransformed. Its purpose has yet to be defined, but it is there when you need it. You can also jump in and explore, but it might take time and effort to find what you're looking for.
- Malleable data: manipulate and/or transform the data into another format once it's stored.
- Easy to store data: as long as the data is queryable data (JSON, AVRO, Parquet, etc.), it'll be accepted by a data lake.
- Inexpensive to load & store data: storing unstructured data tends to be cheaper than tools such as data warehouses.
- Hard for many users to query & understand data: you'll most likely need an additional tool or two to query data housed in a data lake.
- Expensive to perform queries: those tools also can increase costs and be based on a per-query basis.
- Slower to return query results: data lakes are not optimized to run quick analyses, generally-speaking.
Pros and Cons of Data Warehouses
A data warehouse stores formatted data with the purpose of performing complex queries and data analysis at a regular interval. You need to know the purpose of the data, because you'll typically need to predefine the structure (schemas, tables, columns and data types) within your database before you can actually insert data into a warehouse. Because you formatted the data and made a conscious decision on how your data is formed, you can more easily analyze the data and manipulate it for your analysis needs.
Sticking with water analogies, pushing data into a warehouse is like collecting water from a source into barrels, transporting the barrel to a factory, packaging the water into water bottles, and storing them on pallets. The water will be used for one purpose—in this case, data queries to find useful data. The way the water is packaged up is very methodical, and a mess-up on the assembly line could cause production issues. Likewise, if data doesn't fit into a certain schema or has issues, it won't be loaded into a warehouse and be dropped.
- Easy to sift through data & query it: the structured nature of a warehouse makes it easier to find what you're looking for.
- Historical data analysis: again, because of the structured nature of the data, it's easy to visualize the data and gain insight from historical trends
- Potential for data errors & loss: data warehouses generally require that data adheres to a predetermined schema. If data does not fit into that schema, it cannot be loaded. A user must perform some cleanup of the data before it can be stored.
- Less flexible data: you'll have to transform data in warehouses in order to make it more universal. And, because the data was already fit into a predefined structure, some data may have been omitted when initially loaded in.
Storing Events in Lakes vs. Warehouses
Data lakes and warehouses are two of the most common types of event destinations used to store event data. Their strengths and weaknesses for event analysis generally align with the above, but once you decide to get serious about collecting consumer data, here is what you should know about sending event data to these tools:
- Scripts and queries run on data housed in data lakes is costly. The data is cheap and easy to access, but pulling insights out of that data can break the bank if you run too much analysis.
- Events and their related fields will almost always be loaded into a data lake. The only way that an event will not be saved into a data lake is if it is missing some key piece of data, such as a missing write key that tells our system where to send the event.
- Raw data in a data lake means much more flexibility. If data is loaded directly into a destination or warehouse, that event is transformed to meet the specific spec required by the destination. This means that you no longer have access to the data in its raw form.
- Using a data lake with MetaRouter unlocks data replay capabilities. If an event is stored in S3 in a raw format, we can run that data through any server-side destination.
- Data lakes typically require more setup to query. AWS Athena, for example, requires you to use AWS Glue to create a schema from which you can query data stored in S3. You may also need to transform your data before you're able to query. Additionally, you have to concern yourself with how you partition data, as well as the data format. AVRO and Parquet, for example, are more efficient to query than JSON and improve query performance, thus saving on cost.
- The general go-to for event analysis. Once a schema is populated with data, it is very easy to search for data within the schema.
- It is likewise easy to run queries on this data and pull insights out of. Your typical data analyst should be able to find value easily from data stored in a warehouse.
- Data loaded into a warehouse must conform to a defined schema. If data is not in the format that is expected, the event will be dropped unless the service used to load data can accommodate malformed data. Even if malformed data finds its way into your warehouse, you'll still have to perform manual cleanup of the data if you want to run your queries accurately.
- MetaRouter streams data into BigQuery, while we use a more typical batch loading style for Redshift. This might change soon. The key is that batch loading can be less expensive for users but introduces another step in the data load process. This step introduces longer wait times to see data (typically at an hour or day cadence for batch jobs), which also introduces the possibility of data loss due to inconsistent data. Streaming data is generally easier to manage and troubleshoot, but also introduces increased cost for some partners.
Data lake or data warehouse?
It’s never quite that simple, is it? Often, our clients choose both. This allows for easy data analysis when necessary, but also unlocks the flexibility and data replay capability that we mentioned earlier.
However, if you must choose one or the other, look to the above pros and cons to decide exactly what the needs are for you and your organization.
And, as a word of caution, the current solutions that are available can differ by quite a bit in terms of functionality and cost. Do your homework, understand what the major players in the space have to offer and their price points, and you'll be well on your way towards realizing the benefits of Big Data insights!
As a follow-up to this article, we'll analyze a few of the solutions out there that we have experience with, share thoughts on choosing the right cloud, and offer our recommendations on which tools are best to accomplish the complex data science needs of 2020 and beyond.