The exponential increase of data has coincided with an explosion of various storage systems and distributed architectures such as streaming with Kafka/Kinesis, data warehouses like Redshift and Snowflake, and relational and non-relational and graph databases like MySQL, MongoDB, and Neo4j. Meanwhile, the rise of positions that apply numerous transformations to data (i.e. Data Scientists and Data Analysts) and solutions to enable ease of transformations like DBT have led to increasingly sophisticated use of data. This growth, coupled with the increased awareness of data privacy concerns, has also accelerated regulation and reputational considerations.
As data pipelines have grown in complexity, it has become increasingly difficult to determine if the data can be trusted for use, where the data has come from, who has accessed it, how the data was transformed or created, and whether the data contains any sensitive information. Data Lineage allows teams to stay compliant with increasing regulation and understand the data’s journey, including its origination, transformations, and quality.
With the rising importance of this technology in mind, our teamconvened industry experts to discuss open questions around data lineage. Our conversation yielded three key insights:
- Data lineage is only part of the picture, and is most impactful when integrated as part of an end-to-end data platform encompassing data discovery, data quality, and data access management.
- Even as part of a broader data platform, data lineage is most useful if it is tied to specific use cases and pain points.
- Data lineage plays a fundamental role in the democratization of data and the unification of non-technical business users with technical users and upstream data providers with downstream data consumers.
1. Data lineage is only part of the picture, and is most impactful when integrated as part of an end-to-end data platform encompassing data discovery, data quality, and data access management.
Because its effectiveness depends on data maturity within an organization, building a standalone lineage product alone has low returns. Lineage helps to answer the critical questions posed by other data ops tooling and therefore requires broader integration across the entire data map of an organization into data artifacts such as queries, dashboards, machine learning, models, and metrics. Notably, data lineage is most useful when incorporated into an all-in-one solution that encompasses other components of the data platform, such as:
- Data Discovery/Data Catalogs: Platforms like Datahub and Amundsen help users discover data/metadata in an organization and understand its location and purpose. Companies need catalogs to allow users to debug data lineage issues by looking in the appropriate location or reaching the correct team responsible for the data.
- Data Quality: Solutions like Great Expectations or Monte Carlo validate the data being transformed and transferred through the data pipeline and into its destination system (model, report, etc.), ensuring they are of intended schema and value. Data lineage piggybacks on data quality to help mitigate downstream impacts and identify the root causes upstream to debug issues.
- Data Access Management: Having appropriate RBAC and governance mechanisms in place is necessary to ensure that the individuals and solutions that data touches throughout its data lineage journey adheres to security protocol and permissions. Lineage can help keep organizations within compliance by understanding where personally identifiable information (PII) is stored and who has accessed it.
As a result, few individuals are willing to purchase a standalone data lineage product without these ancillary systems in place, despite data lineage being hailed as the “holy grail” of data usage by one attendee. Any one of the above solutions might be the beginning insertion point, but the common goal is to converge by offering an all-in-one platform that enables lineages as the end use case.
2. Even as part of a broader data platform, data lineage is most useful if it is tied to specific use cases and pain points.
As described above, the benefits of a standalone data lineage product is limited. To create a data lineage product with chances for widespread adoption, one must first clearly articulate the problem statements that a given position (e.g., a Data Analyst or Data Scientist) faces and build an effective product for those specific use cases. These vary across organization and industry, depending on the different requirements of each company.
Example use cases discussed ranged from simple explainability scenarios to requiring deep data understanding, including:
- Explainability: A financial institution might need to defend their risk model scores to regulators by tracing the lineage of specific data and how it was transformed to feed into a ML model.
- Sudden Business Changes: A company might see a sudden spike or decline in a core business KPI (e.g., churn) and use lineage to trace whether it was for valid business reasons or due to technical data errors.
- Metrics Alignment: Since there are multiple definitions for any given metric within an organization (e.g., several ways to calculate churn or net revenue retention across teams), companies can use data lineage to trace any given KPI and how that particular metric was calculated.
Without a framework to investigate a specific problem, there’s little value to understanding lineage. Instead, packaging data lineage with a pre-defined goal is the most impactful. The key is to start with a set of business questions and then translate those to logical lineage solutions that leverage other elements of data ops.
3. Data lineage plays a fundamental role in the democratization of data and the unification of non-technical business users with technical users and upstream data providers with downstream data consumers.
Data lineage is useful not only for explaining business decisions to external stakeholders for regulation and compliance, but also enabling the democratization of data within the enterprise. Data is no longer leveraged solely by data scientists, but by all parts of the organization, including non-technical individuals like sales, business development, and finance.
The fundamental problem that a data lineage solution addresses is the disconnect between downstream consumers and upstream producers of data. For example, a business analyst might notice an issue with a churn metric in a Looker or Tableau dashboard. Solving this problem today and ensuring data is “fresh” and up-to-date takes significant effort, involving numerous parties and putting pressure on the data team to debug. Debugging does not solely depend on the data team running SQL queries, but going through the entire lineage of the data and relying on some subject-matter expertise to understand what different fields mean and why certain data transformations occurred. To highlight the extent of this pain today, one of our Roundtable participants mentioned speaking with various Heads of Data who each spent 100 hours per quarter just fixing reports.
A holistic data lineage system empowers anyone to use real-time data to make important decisions, and enables each person throughout the lineage—technical or otherwise—to either self-diagnose or collaborate with another part of the org, often without the need to go through the data engineering team. Great Expectations is one company that is particularly powerful for bridging non-technical roles with technical roles by converting its data quality checks into human-readable documentation, so anyone in the company can interpret the expected values for their data.
One of our core theses at Unusual is the end-to-end unification of business and technical users, a disconnect that data lineage addresses. Ensuring smooth communication between consumer business executives and technical producers of data is no longer a “nice to have”. Businesses must invest now in tools so that they can rely on data in real time to make critical decisions.
We were pleased to convene this group of leading experts around this important and growing field. The learnings from our discussion have strengthened our thesis around the democratization of data and we are excited to back the next generation of data lineage companies. If you’d like to chat more about this space, we would love to share perspectives.
Contributors
Abe Gong, Co-founder and CEO of Superconductive (company behind Great Expectations)
Atul Gupte, Product Manager for Facebook Reality Labs, former Product Manager for Uber's Data Products
Ananth Packkildurai, Principal Data Engineer at Zendesk
Amandeep Khurana, Founder of Okera Inc.
Timothy Chen, Partner at Essence Ventures
Sign up for our newsletter below to receive updates on future Unusual Roundtables.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.