The data lake has become a major buzzword in the world of big data. Data advocates initially pitched the idea of a data lake as a solution for all unstructured data – an alternative to the restrictions of a data warehouse. But recently, the concept of the data lake has begun to earn more detractors than supporters, leaving many questioning the validity and benefits of the data lake.
The fact is that a data lake can be very helpful, so long as it is used correctly. We’re taking a look at the benefits and challenges of using data lakes to store our large datasets, and how businesses can intelligently utilize them.
What Exactly is a Data Lake?
A data lake is an enormous, easily accessible, centralized repository, that stores structured and unstructured data from source systems. Data is not classified when entered into the data lake, so data preparation is eliminated; only when data is accessed is it classified, organized or analyzed.
Data had been traditionally stored in data warehouses, which was pulled from multiple sources, transformed and structured, and defined by very specific parameters. Data warehouses were useful because the data within them was regulated and trustable. However, 80% of the data doesn’t fit this model, because it’s considered semi-structured or completely unstructured. This means it doesn’t have a pre-defined data model, or is not organized in a pre-defined way. So, to store this data, instead of having to properly integrate all of it into one model, data lakes allowed for data to stay as is, to be dealt with at another time.
Why Are Data Lakes Useful?
Data lakes are able to store data in its native format, unintegrated at the point of entry. The data requirements and schema are not defined until the data is queried for analysis. In a data lake, these requirements can be prepared for specific analytical uses as needed, which are more flexible and efficient. As opposed to the data warehouse format, where the schema is established when the data is entered, limiting data analysis to one particular use.
Another benefit of data lakes are that they eliminate information silos. Instead of storing multitudes of independently-managed datasets, the data lake now collects all sources of data into one place. This consolidation encourages data sharing and increases available information. Not only does this cut down on costs, but allows for increased insights due to the information sharing.
The Challenge with Data Lakes
A challenge in data lakes is the inability for analysts to determine data quality, because a thorough check up has not taken place. Also, there is no way to use insights from others who have worked with the data, as there is no account of the lineage of findings by previous analysts. Finally, one of the biggest risks of data lakes is security and access control. Data can be placed into a lake without any oversight, and some of the data may contain privacy and regulatory requirements that other data doesn’t.
The bottom line is, a data lake can be very useful, and make your data analysis more efficient and specialized. On the other hand, if your data lake is unregulated, and unsupervised by trusted IT professionals, you run the risk of creating a data mess. It becomes a lake filled with data from unrecognized sources, differing levels of data quality – literally a swamp where once was a lake. So, if you’re company is choosing the data lake route, keep those waters clean.