How do you choose the cloud data platform that best suits your needs? There's a lot to sift through. In this series, I'm breaking down your different options, sharing details on must-know features, use cases, and so much more. Let's kick things off by reviewing Databricks.
Background on Databricks
Databricks was founded by the original developers of the Apache Spark project and is dedicated to the open-source community in many ways. In addition to contributing to Apache Spark, Databricks has created new open-source projects and several others such as MLflow and Delta Lake. To stay competitive, they've chosen to optimize, rewrite code, and offer proprietary features to clients. The Databricks platform is 100% vendor-neutral and has a very small footprint. Currently, all storage and infrastructure are located in a client's cloud account.
The architecture promoted when working with Databricks is the Medallion architecture, but Kappa and Lambda architectures work perfectly too. Medallion architecture’s key feature is the lakehouse design. Some key takeaways from the lakehouse are decoupled storage and compute, ACID transactions, Schema on read/write, data governance, streaming, unstructured/semi-structured, and BI support. This philosophy isn’t limited to a particular platform and is fully compatible with approaches like Data Mesh.
- ANSI SQL
One of the unique aspects of Spark-based offerings like Databricks is the ability to use and mix a variety of languages. Data science and machine learning can be natively run in languages and with libraries that are the best of the breed. Yet in the same platform, you can build pipelines using mostly SQL and even heavily supporting tools like DBT. DBT is so well supported that even the workflow service can call DBT.
Unique data engineering features
Notebooks are a common tool for ad-hoc queries, troubleshooting, and in some cases production usage. If notebooks are not to your fancy, you can use the Databricks jobs (a Spark application). Both options have a significant amount of features — too many to list here.
There are also features that reduce the DevOps footprint by creating and managing infrastructure for you. Some examples are Autoloader, Workflows, and Delta Live Tables. Autoloader is a very unique stream/batch style ingestion service. Autoloader will automatically create metadata and queue-backed ingestion frameworks in your cloud account and can be run in streaming (microbatch) or batch, providing a wide range of flexibility.
Workflows is a data engineering, data science, and machine learning pipeline tool designed to run a DAG-like set of tasks. Tasks within the pipeline currently have the option to be a JVM jar, Python package, notebook, DBT job, or even a Delta Live Table Pipeline. Workflows have the ability to fully mix data engineering tasks with machine learning tasks all within the same minimalist infrastructure. Apart from a few infrastructure as code issues as mentioned in the DevOps section, Workflows do not allow you to choose unique clusters per task. This might seem trivial at first but can become less optimal.
Other pipeline toolings such as Apache Airflow, Azure Datafactory, Dagster, and others are fully supported. If your pipeline tooling is not supported like in the case of AWS Step Functions, you can easily interact with REST APIs and fully integrate with little effort.
Delta Live Tables are a declarative code-based approach to creating data processing pipelines and managing data quality using Python. Delta Live Tables use the Delta Lake open standard. Delta Lake is a layer on top of the typical parquet-backed data lake table. What makes the Delta Lake unique:
- ACID transactions
- audit log
- 0 infrastructure to manage
- schema evolution
- 100% schema on read/write
It's refreshing to see that, yes, the traditional relational data methodology is supported, but there's also strong support for semi-structured data (full schema on read and write), geospatial data, and even graph data.
Unique machine learning offerings
Databricks comes with a wide variety of data science and machine learning features designed by Databricks, not a third party.
Machine learning can be accomplished through several tools. SparkML a native spark library can always be used. Native libraries like Sklearn, XGboost, and Hyperopt can be parallelized and integrated into workflows. Databricks offers a glass box style autoML. This approach to autoML allows the data scientist to have full visibility in the autoML process. Databricks also offers MLOps features like MLFlow and a feature store.
For Analysts and BI Developers, Databricks offers SQL analytics, a SQL-only service for analysis and BI dashboarding. This is a fully-featured BI solution that may not have as many features as Tableau or PowerBI, but definitely gives them a run for their money when viewed in the overall ecosystem.
Databricks offers several infrastructure light features such as workflows, Delta Live Tables, and Autoloader. The design here is to allow for less infrastructure management. The infrastructure is present in your cloud account, it's just managed by Databricks. Databricks offers REST API access for using tools like Terraform, a popular infrastructure as code tool. Also on Databricks, many features allow for integrating with CI/CD-produced JVM/Python/R libraries. This allows for cases where your pipeline is fully vetted through tests, linters, and a pull-request process just like any other software you would produce.
Sadly, there are large gaps in the capabilities offered in Terraform and the REST API. This translates to manual interactions with the web GUI, which works perfectly for small groups and other use cases but will not scale and can’t be automated.
A final word on Databricks
Databricks is a powerful data platform that brings a ton of unique features to a wide variety of workloads. Sure, there are some rough edges but definitely not enough to stop you from using this platform.
We kicked off the series with Databricks, but this showdown is just getting started. I plan to feature AWS EMR/Glue, Snowflake, Azure Synapse, and something in Google Cloud. Message me your thoughts and stay tuned for more!