June 28, 2023

Do you recognize these data problems? Here’s how to solve them.

Chad Baily

Data as a Product. Data is produced in distinct business domains and is pulled into applications that are closely tailored to the needs of producers and consumers. Each product team manages the infrastructure needed to expose data as a domain-aligned product. Infrastructure can be centralized and monolithic, as in the hybrid approach, or it can be decentralized “microservices,” as in the canonical Data Mesh approach. Deployment of these applications is orchestrated by a self-serve data infrastructure platform.

In previous articles, my colleague Devin Strickland explored differences and similarities between software and data architectures. He explored how an Enterprise Data Platform (EDP), based on the principles of Data Mesh, treats data as a product and thereby keeps data systems aligned with the business context and preserves meaning.

EDP is a response to the limitations of conventional warehouse or lake architectures. When a specialized data team is isolated from the business and solely responsible for collecting, processing, and serving data to the entire organization, data becomes difficult to find, access, and use. Value is lost. Time and money are wasted.

Here we examine some of these specific problems in greater detail, and briefly discuss how our EDP approach solves them. In subsequent articles, we will explore these in greater detail and share tutorials that show a practical approach aligned with Domain-Driven Design and Data Mesh principles.

We have data quality problems! These include invalid or missing values, unclear formatting, drift, etc. Our teams don’t want to publish numbers because they don’t think they’re credible.
Solved: Domain-oriented decentralized data ownership and architecture

Much of the software world has adopted a product mindset where cross-functional teams—including business and technical experts—own the entire development process. This creates a more holistic approach to implementation and reduces error associated with writing and communicating requirements.

Centralized, standalone data teams experience similar problems with context and communication. Because so much is lost between teams, consumers come to believe that they can’t really trust the data that is published.

Taking a cue from software, we can augment existing product teams with data owners and data engineers In a way that enables those teams to create data products. These teams can take responsibility for ensuring their data product meets their own definitions of business success. More importantly, nobody knows the data like the team that produced it and they can be the ones verify credibility before it reaches a warehouse.

My IT architecture is overly complex and out of date. I want to streamline and modernize while respecting organizational differences.
Solved: Self-serve data infrastructure as a platform

Building centralized data infrastructure is complicated and expensive, and as a result decisions on how data is moved and transformed are made globally and driven by the infrastructure itself.  Creating a unique platform for each team that wants to develop data models is not achievable at scale. Data engineers, analysts, and scientists are forced to conform to a single process. All the data is transformed, or none of it is. Flipping this script and having an architecture that enables data processing based on business needs is our goal.

To create a data architecture that achieves and maintains alignment with business needs, there are a few critical elements that our tooling must provide: automatable deployments, extensibility, and guardrails to guide development. These enable cross-functional teams to create clear and testable measures for their data models and transformations allowing us to “shift governance left” (analogous to shifting testing left in product teams).

The micro-services paradigm, along with continuous integration and continuous delivery (CI/CD) has brought about just such a change in software development. Having data models as code opens the door for data automation and model driven orchestration. Just like software applications, we can deploy data pipelines in a controlled and repeatable way.

These create the correct separation of responsibilities between teams: the development of these data products is the focus of the stream-aligned producing teams. With minimal help from a platform team, they can write transformation code that is readily deployable. Freed from having to know details of the business, the platform team can focus on eliminating friction in deployment of data products to a robust data infrastructure.

We are drowning in data, but can’t figure out how to use it due to a wide variety of formats and lost context.
Solved: Federated computational governance

The people who know the data best are the ones closest to it and work with it often. Shifting the governance responsibilities away from a central body and towards product teams allows for data quality and validation to happen within the context which it is produced. This leads to higher quality data where teams can trust what is being produced and the validations it went through.

The move to a federated model does not remove the need for a central body that defines enterprise guidelines and standards. Data decays and loses value over time and if not managed diligently. And not all data is equally valuable. Through tight partnership with consumers, producers understand what data is most critical to the business so they know what to prioritize. Then, the stewards who are closest to the data develop the specific rules to meet the enterprise guidelines.

Our people don’t know what data is out there for them to use, and the producers are hesitant to make it available because they can’t be sure that it is being used responsibly.
Solved: Data catalog

It may seem as if the separation from a centralized source for data governance and a data dictionary would slow the adoption of governance policies; however, with a federated model each product team is responsible for meeting the enterprise standards. Publishing complete metadata to a data catalog is a critical component of any data solution. Without a reliable, easy to use, and robust way for data consumers, data owners, and data stewards to monitor and view data the system falls apart.

Data catalogs address the trust problem through linking policy and process to the metadata of a dataset. This includes links to documentation about relevant policy and access guidelines. Some off-the-shelf solutions support automated handling of requests and approvals for access to information. The catalog becomes a one-stop shop for data discovery and data access.

Conclusion

While we highlight a few common data platform issues here, we know that there are many more. The challenges presented here will not evaporate overnight. A thoughtful approach looking to deliver business value is critical to any platforms adoption. We believe that by creating a modern EDP utilizing data mesh principles along with data as a product and decentralized governance allows us to solve for these problems in a scalable way.

Based on our experiences, these issues are not unique to any one organization. They are faced by organizations larges and small in the private and public sectors. Keep an eye out for our next few articles where we dive deeper into each of these problem statements and how we approach solving them!

Contributors

Chad Baily

Cloud Engineer
Alumni
Go to bio