Deconstructing the Lakehouse

February 22, 2021

Some suitable meditations before we unlock the Lakehouse:

“Do not collect weapons or practice with weapons beyond what is useful.” Miyamoto Musashi, Dokkodo

Students of the Ichi school Way of Strategy should train from the start with the (normal) sword and the long sword in either hand. This is a truth: when you sacrifice your life, you must make fullest use of your weaponry. It is false not to do so, and to die with a weapon yet undrawn. Miyamoto Musashi, The Book of 5 Rings

“Absorb what is useful, discard what is useless and add what is specifically your own” Bruce Lee


The Lakehouse

Databricks introduced the Lakehouse to describe a unique set of principles that have emerged in the industry. At its core, it is a hybrid of two classic design patterns: the data warehouse and the data lake.

The data warehouse was “fathered” by Bill Inmon and further explored by Ralph Kimbell. The purpose of a data warehouse is to store curated and focused data for reporting and analysis. On the other hand, the data lake came out of the emergence of large amounts of very cheap storage. With the emergence of the data lake came Hadoop, and with its demise, several successors came on the scene, most notably Apache Spark, Apache Flink, and Apache Presto.

Each design pattern has its limitations, advantages, and use cases. A data warehouse can time to deploy and have limited scope. Even when a data warehouse has been deployed, the usefulness beyond reporting and targeted analysis is limited to the architect’s foresight. Even the most skilled architect cannot predict every possible question asked of the data.

On the other hand, data lakes can store all raw data without the worry of trying to go to the source data store. Going to the source data store is usually a very hard process with limited results. Data lakes welcome all types of data without discrimination and come with significant issues centered around the lack of management and control.

The Lakehouse aims to address the limitations by combining the effectiveness of both patterns.

Openness

Being open is a fundamental principle in designing systems for long-term usage. Not only does openness guarantee that the data is stored in a universal standard, but also open for all methods of data consumption. By storing raw data and making it available to anyone with permission, we set the groundwork for data democratization.

Diverse Data

What types of data should I be able to use?

  • Structured
  • Semi-structured
  • Geospatial
  • Graph
  • Audio
  • Video
  • Unstructured Text

Although many systems claim to be friendly to nontraditional data, it’s sadly something very different. There are hardly any relational databases or data warehouses that support structured and semistructured data equally at this time, I won’t even bother talking about graph and geospatial data. The issue comes when you read the fine print, and they only support schema on reading for some types of nontraditional data.

Schema on reading is not good enough for an enterprise system meant to last the test of time. One of the main contributors to data swamps is not supporting schema on write. A Lakehouse must enforce both schema on read and write for a reasonable amount of nontraditional data types.

A common misconception I often hear is, “we are just going to make dimensional models out of the data; why bother with strong support for diverse data?” This perspective is limited and sees data warehousing as the only useful outcome for the data.

Diverse Workflow

What types of workflows should I be able to do on my data platform?

  • Data warehousing
  • Machine Learning / AI
  • Data analysis
  • Graph analysis
  • Geospatial analysis

What languages should I be able to use on my data platform?

  • SQL (structured and unstructured)
  • R
  • Python
  • JVM Languages (Scala, Java, etc.)
  • .Net

It’s impossible to support every possible option, but care must be given to avoid boxing yourself in.

Decoupling Storage from Processing

Decoupling opens the door to flexibility.

The storage of data should have no bearing on the processing of that data. This is a double-edged sword since moving data is often expensive. In the context of a cloud-based OLTP, this isn’t typically a concern. By decoupling your data platform, you open the door to an ecosystem of options:

  • A data scientist transforming source data and storing it in a feature store for Machine Learning / AI
  • The marketing department is asking for a data pipeline to a graph database for graph analysis.
  • Performing some workloads using one processing engine like Flink or Spark but leverage technologies like Apache Presto or AWS Athena for inexpensive, fast ad-hoc queries.

Streaming data

Streaming is a fundamental service for many companies. Having a data platform that is friendly to streaming data isn’t an option. Many companies that are not dealing with streaming data will soon require it.

Transactions

Data lakes have classically failed when dealing with concurrent processing. They frankly cannot achieve the same level of reliability as databases and data warehouses. A Lakehouse is built on an ACID, a transaction-based architecture. When working concurrently, this will create a consistent view of the data. Transactions open the door for a full lineage of the table.

TLDR

The Lakehouse addresses the main issues plaguing both the data lake and the data warehouse. The basic principles of the Lakehouse can be used individually or together. An example of using one principle in isolation is making your data platform open by creating raw delta lake tables before ingesting data into your data warehouse. You gain several benefits, including more robust interoperability and a more productive path for machine learning. Most notably, The Lakehouse is 100% compatible with data warehousing techniques.

Send me a message if you’re interested in the Lakehouse or best practices for your organization. Our data engineers at SingleStone can help you navigate this process and modernize your data storage for long-term use.

Brian Lipp

Sr. Data Engineer/Backend Dev​
Brian has worked in the Data field for many years in many hybrid roles combining Data Engineering, Backend Software Engineering, and Machine Learning.

Leave a Reply

Your email address will not be published. Required fields are marked *