October 23, 2025
AI/ML Infrastructure Data Logging Data Governance

ML Data Catalog: Lineage, Discovery, and Governance

Here's a problem you've probably faced: a model in production starts behaving strangely. The data scientist who built it left three months ago. You need to know exactly which datasets fed that model, which columns mattered most, and whether any of the underlying data sources changed. You dig through scattered documentation, Slack conversations, and deprecated notebooks. Six hours later, you've got partial answers and a headache.

This is the cost of data chaos - and it's preventable.

A modern ML data catalog solves this by creating a single source of truth for your data assets: their lineage (where they come from, how they transform), their discoverability (semantic search, automatic tagging), and their governance (who can access what, and why). When you tie this to your ML infrastructure - linking datasets to model training runs, feature definitions to raw columns - you turn data from a liability into your competitive advantage.

Let's build this together.

Table of Contents
  1. Why Your ML Team Needs a Data Catalog (Beyond Compliance)
  2. Why This Matters in Production
  3. OpenMetadata vs DataHub: Architecture and Philosophy
  4. OpenMetadata: Metadata as First-Class Entity
  5. DataHub: Graph-Based Metadata
  6. Column-Level Lineage: From Raw Data to Features
  7. Why It Matters for ML
  8. Instrumenting dbt for Column Lineage
  9. ML-Specific Metadata: Linking Datasets to Models
  10. Connecting MLflow to the Catalog
  11. Governance at Scale: Tag-Based Access Control
  12. Tag-Based Governance with DataHub + Ranger
  13. Self-Service Data Discovery
  14. Semantic Discovery in DataHub
  15. Self-Service Data Requests
  16. Common Implementation Pitfalls and How to Avoid Them
  17. Metadata Staleness and Sync Failures
  18. Governance Without Adoption
  19. Feature Store Integration Gaps
  20. Integration Patterns: Connecting Your Ecosystem
  21. Operationalizing the Catalog
  22. Measuring Adoption and Impact
  23. Cost and Scaling Considerations
  24. Storage and Compute Requirements
  25. Integration Complexity
  26. Summary
  27. Debugging Catalog Failures: Technical Troubleshooting
  28. The Adoption Journey: Common Mistakes and How to Avoid Them
  29. Building Governance That People Actually Follow
  30. The Catalog as Your ML Intelligence System
  31. Making the Investment Pay Off

Why Your ML Team Needs a Data Catalog (Beyond Compliance)

You might think "data catalog" sounds like a compliance checkbox. It's not. A catalog is operational infrastructure for speed.

Without one, your team spends energy on friction:

  • Discovery waste: Data scientists re-create datasets that already exist. You're computing the same aggregations twice.
  • Trust gaps: When lineage is unclear, teams hoard data and build local copies. Governance becomes impossible.
  • Incident blind spots: A data quality-real-time-ml-features)-apache-spark)-training-smaller-models)) issue cascades through 20 models before anyone notices because there's no traceability.
  • Onboarding drag: New engineers can't find the datasets they need. They build their own or use outdated sources.

A catalog flips this:

  • Reuse accelerates: "Show me all datasets with customer_segment" returns results with full context (freshness, quality scores, who owns it).
  • Lineage prevents cascades: Change a source table schema? The catalog shows which 47 downstream assets break.
  • ML traceability: Link a model prediction to the exact training dataset, feature definitions, and even the MLflow run that built it.
  • Self-service governance: Tag a dataset as containing PII once. Access control and masking rules trigger automatically.

Think about the compounding value. After one month, your data team saves 2-3 hours per person per week on searching for data. After three months, you've caught data quality issues before they broke models. After six months, you're onboarding new engineers in days instead of weeks. The time value of this infrastructure is enormous.

Why This Matters in Production

When your model starts failing, you have two options: panic or investigate. A data catalog gives you the investigation path. You can trace every input feature back to its source, check when that source last changed, see who owns it, and decide whether to retrain. Without a catalog, you're guessing. You're re-creating entire datasets to understand what went wrong.

More importantly, a catalog is how you prevent failures. If you can see that your customer_segment feature depends on a raw demographic table, and that table's update frequency slowed down last week, you can flag it proactively. You can ask the data owner if the schema changed. You can test your model against the new data before it breaks production.

OpenMetadata vs DataHub: Architecture and Philosophy

Two platforms dominate the modern data catalog space: OpenMetadata and Acryl DataHub. Both solve the core problem differently.

OpenMetadata: Metadata as First-Class Entity

OpenMetadata treats metadata as a first-class citizen with its own schema, storage layer, and API contracts. This approach emphasizes structure, validation, and opinionated design.

Architecture:

  • Metadata backend: PostgreSQL or MySQL stores a normalized metadata model (connectors, databases, tables, columns, lineage edges, tests).
  • Search layer: Elasticsearch indexes all entities for fast discovery.
  • API design: REST-first, strongly typed OpenMetadata entities (Table, Database, User, DataAsset, etc.).
  • Lineage graph: Explicit lineage_details edges with transformation context.

Strengths for ML:

  • Native support for ML artifacts: MLflow models, feature stores)), experiment metadata.
  • Column-level lineage through column_lineage entity.
  • JSON schema validation ensures consistency across integrations.

Why this matters: When your ML platform team designs the schema, they can add ML-specific fields directly. You're not retrofitting ML concepts onto a generic data system. This is important for large teams where data governance and ML governance need to work together seamlessly.

Example OpenMetadata entity:

json
{
  "id": "d1a2b3c4d5e6f7g8",
  "type": "Table",
  "name": "customer_features_v2",
  "databaseSchema": {
    "id": "schema-456",
    "name": "ml_features"
  },
  "columns": [
    {
      "name": "customer_id",
      "dataType": "BIGINT",
      "description": "Unique customer identifier"
    },
    {
      "name": "lifetime_value",
      "dataType": "FLOAT",
      "lineageDetails": {
        "upstream": ["raw.customers.total_spend"]
      }
    }
  ],
  "lineage": {
    "edges": [
      {
        "fromEntity": "raw.customers",
        "toEntity": "customer_features_v2",
        "lineageDetails": {
          "sqlQuery": "SELECT customer_id, SUM(order_value) as lifetime_value FROM raw.customers GROUP BY customer_id"
        }
      }
    ]
  }
}

DataHub: Graph-Based Metadata

DataHub (by Acryl) adopts a graph model where every asset is a node and relationships are edges in a knowledge graph. This approach emphasizes flexibility and graph-native queries.

Architecture-production-deployment-production-inference-deployment)-guide):

  • Metadata store: Elasticsearch or Neo4j holds the graph. Every entity is a node with properties.
  • Aspect model: Metadata is stored as "aspects" - modular pieces attached to entities (OwnershipAspect, DataQualityAspect, SchemaAspect).
  • API design: GraphQL-first for traversing relationships.
  • Lineage: Stored as UpstreamAspect and DownstreamAspect with transformation metadata.

Strengths for ML:

  • Graph queries are natural for lineage: "What models depend on this dataset?" is one query.
  • Aspect-based flexibility: add custom metadata without schema migrations.
  • Python SDK for programmatic metadata emission from pipelines.

Why this matters: DataHub's graph model makes it easy to answer complex questions. "Show me all datasets that feed models with NDCG degradation in the last week" becomes a graph traversal. This flexibility comes in handy when your governance requirements evolve.

Example DataHub entity:

python
# Emit a training run as ML-specific metadata
from datahub.emitter.mce_builder import make_lineage_mce
from datahub.metadata.schema_classes import UpstreamLineageClass, Dataset
 
training_run_urn = "urn:li:mlModel:(urn:li:dataPlatform:mlflow,customer_churn_v3,PROD)"
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:snowflake,snowflake.ml_features.customer_features_v2,PROD)"
 
upstream = UpstreamLineageClass(
    upstreams=[
        {
            "dataset": dataset_urn,
            "type": "COPY",
            "lineageDetails": {
                "transformationDescription": "Trained on customer_features_v2 with 50/30/20 train/val/test split"
            }
        }
    ]
)

Comparison table:

AspectOpenMetadataDataHub
ModelRelational + ElasticsearchGraph (Neo4j/ES)
LineageExplicit edges with SQLUpstream/Downstream aspects
QueryREST entitiesGraphQL graph traversals
ML integrationNative MLflow entitiesCustom aspects via SDK
FlexibilitySchema-firstAspect-based (looser)
Operator learning curveModerateSteeper (graph thinking)

Pick OpenMetadata if: You want opinionated ML metadata out of the box and favor schema consistency.

Pick DataHub if: You have custom metadata needs and want programmable lineage from your ML pipeline-pipelines-training-orchestration)-fundamentals)) code.


Column-Level Lineage: From Raw Data to Features

Table-level lineage is a start - it tells you which tables a dataset depends on. But ML features often derive from complex transformations across multiple columns. To truly understand model inputs, you need column-level lineage: tracing each output column back to its raw source columns.

Why It Matters for ML

A model trained on customer_churn_features depends on hundreds of columns across raw tables. When a raw column changes (schema update, quality drop), you need to know:

  1. Which feature columns depend on it?
  2. Which models use those features?
  3. Should we retrain?

Without column-level lineage, you're guessing. You'll make decisions based on incomplete information. With it, you answer these questions in minutes.

Consider a real scenario: your total_customer_spend column drops to zero for 10% of customers. With column-level lineage, you instantly know that five features depend on it, two models use those features, and both are at risk. You can trigger immediate retraining or issue an alert. Without lineage, you discover the problem when customers complain that predictions are wrong.

Instrumenting dbt for Column Lineage

dbt is perfect for building features. Its SQL transformations are perfect for column lineage extraction. The dbt manifest contains the dependency graph and SQL parsing. Here's how to extract column-level lineage:

Step 1: Enable manifest generation:

yaml
# dbt_project.yml
models:
  ml_models:
    churn_features:
      +meta:
        owner: data-science
        lineage_type: feature_engineering

Step 2: Extract lineage from dbt artifacts:

dbt generates a manifest.json after each run. It contains the dependency graph and SQL parsing. Here's how to extract column-level lineage and emit it to your catalog:

python
import json
from collections import defaultdict
 
def extract_column_lineage(manifest_path, target_model):
    """
    Extract column-level lineage for a dbt model.
    Returns dict: {output_col -> [source_col_urns]}
    """
    with open(manifest_path) as f:
        manifest = json.load(f)
 
    # Find the model node
    model_key = f"model.my_project.{target_model}"
    if model_key not in manifest["nodes"]:
        raise ValueError(f"Model {target_model} not found")
 
    model_node = manifest["nodes"][model_key]
    depends_on = model_node["depends_on"]["nodes"]
 
    # Parse the raw SQL to infer column dependencies
    # (In production, use sqlglot or sqlparse for robust parsing)
    sql = model_node["raw_sql"]
 
    lineage = defaultdict(list)
 
    # Simple regex-based example (use sqlglot.parse() for real parsing)
    # Look for "SELECT col1, col2, col1 + col3 as revenue ..."
    import re
 
    # This is simplified; use sqlglot.parse() for real parsing
    for match in re.finditer(r'(\w+)\s+as\s+(\w+)', sql, re.IGNORECASE):
        source_col = match.group(1)
        output_col = match.group(2)
        lineage[output_col].append({
            "source_column": source_col,
            "upstream_models": [d.split(".")[-1] for d in depends_on if "model" in d]
        })
 
    return dict(lineage)
 
# Usage
lineage = extract_column_lineage("target/manifest.json", "churn_features")
print(lineage)
# Output:
# {
#   'churn_risk_score': [{'source_column': 'risk_metric', 'upstream_models': ['feature_base']}],
#   'customer_segment': [{'source_column': 'segment_id', 'upstream_models': ['customers']}]
# }

Step 3: Emit to DataHub with column context:

python
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetLineageTypeClass,
    UpstreamLineageClass,
    UpstreamClass,
    ColumnLineageClass,
    ColumnLineageInfo,
)
 
emitter = DatahubRestEmitter("http://datahub:8080")
 
# Build upstream lineage with column mappings
column_lineage = ColumnLineageClass(
    upstreams=[
        ColumnLineageInfo(
            upstream="urn:li:dataset:(urn:li:dataPlatform:snowflake,raw.customers,PROD)/[customer_id,total_spend]",
            downstreams=["customer_lifetime_value"],
            transformationDescription="SUM(total_spend) GROUP BY customer_id"
        )
    ]
)
 
upstream = UpstreamLineageClass(
    upstreams=[
        UpstreamClass(
            dataset="urn:li:dataset:(urn:li:dataPlatform:snowflake,raw.customers,PROD)",
            type=DatasetLineageTypeClass.DERIVED,
            lineageDetails=column_lineage
        )
    ]
)

ML-Specific Metadata: Linking Datasets to Models

The killer feature of a modern catalog is ML traceability: you can ask "which training dataset built this model?" and get a complete answer with feature importance, train/val/test split ratios, hyperparameters, and even timestamp.

Connecting MLflow to the Catalog

MLflow experiments track training runs. A modern catalog should link those runs to their input datasets. Here's how you build this bridge:

Step 1: Log dataset info in MLflow:

python
import mlflow
from mlflow.entities import Metric, Param
 
mlflow.start_run(experiment_id="churn_models")
 
# Log the training dataset version
mlflow.set_tag("dataset_version", "customer_features_v2")
mlflow.set_tag("dataset_rows", 450000)
mlflow.set_tag("dataset_columns", 87)
mlflow.set_tag("training_split_date", "2026-02-01")
 
# Log feature importance (identifies which columns mattered)
feature_importance = {
    "account_age_days": 0.285,
    "monthly_spend": 0.201,
    "support_tickets": 0.156,
    "contract_type_is_monthly": 0.124
}
mlflow.log_dict(feature_importance, "feature_importance.json")
 
# Log model metrics
mlflow.log_metric("auc", 0.824)
mlflow.log_metric("precision", 0.71)
mlflow.log_metric("recall", 0.68)
 
model = train_churn_model()
mlflow.sklearn.log_model(model, "model")
mlflow.end_run()

Step 2: Emit MLflow run as DataHub lineage:

python
import requests
from datetime import datetime
 
def emit_mlflow_run_to_datahub(mlflow_run_id, dataset_urn):
    """
    Create DataHub lineage from an MLflow run to its training dataset.
    """
    # Fetch the MLflow run
    mlflow_client = mlflow.tracking.MlflowClient()
    run = mlflow_client.get_run(mlflow_run_id)
 
    # Build the model URN
    model_name = run.data.tags.get("model_name", "unknown")
    model_urn = f"urn:li:mlModel:(urn:li:dataPlatform:mlflow,{model_name},PROD)"
 
    # Create lineage payload
    lineage_payload = {
        "entityUrn": model_urn,
        "aspect": {
            "aspectName": "upstreamLineage",
            "version": 0,
            "content": {
                "upstreams": [
                    {
                        "dataset": dataset_urn,
                        "type": "COPY",
                        "lineageDetails": {
                            "documentationUrl": f"https://mlflow-tracking.company.com/#/runs/{mlflow_run_id}",
                            "transformationDescription": f"Model trained {datetime.now().isoformat()}",
                            "customProperties": {
                                "feature_importance": run.data.params.get("feature_importance", "{}"),
                                "train_rows": run.data.tags.get("dataset_rows"),
                                "auc": str(run.data.metrics.get("auc", 0))
                            }
                        }
                    }
                ]
            }
        }
    }
 
    # Emit to DataHub REST API
    response = requests.post(
        "http://datahub:8080/api/entities?action=ingest",
        json=lineage_payload,
        headers={"Content-Type": "application/json"}
    )
 
    return response.status_code == 200

Governance at Scale: Tag-Based Access Control

A catalog with perfect lineage and discoverability is useless if access isn't governed. You need policies that:

  1. Automatically identify sensitive data (PII, financial, health info)
  2. Enforce access rules (who can see what)
  3. Trigger data transformations (masking, redaction)

Tag-Based Governance with DataHub + Ranger

Most modern catalogs support tagging. Here's how to build governance on top of tags:

Step 1: Define governance tags:

python
# In DataHub, create tags via the UI or API
tags = [
    {"name": "PII", "description": "Contains personally identifiable information"},
    {"name": "CONFIDENTIAL", "description": "Restricted access"},
    {"name": "PRODUCTION_CRITICAL", "description": "Core to production systems"},
    {"name": "ML_TRAINING_DATA", "description": "Used for model training"},
]
 
# Emit tags to DataHub
for tag in tags:
    payload = {
        "proposedSnapshot": {
            "com.linkedin.pegasus2avro.tag.TagProperties": {
                "name": tag["name"],
                "description": tag["description"]
            }
        }
    }
    requests.post("http://datahub:8080/api/tags", json=payload)

Step 2: Auto-tag datasets with ML classifiers:

You can train a lightweight ML classifier to detect PII, financial info, or sensitive patterns:

python
from transformers import pipeline
import pandas as pd
 
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)
 
def classify_column(col_name, sample_values):
    """
    Use zero-shot classification to identify column sensitivity.
    """
    labels = ["PII", "Financial", "Normal", "Confidential"]
    text = f"{col_name}: {sample_values[:3]}"
 
    result = classifier(text, labels)
    top_label = result["labels"][0]
    confidence = result["scores"][0]
 
    return top_label, confidence
 
# Tag columns automatically
dataset = pd.read_sql("SELECT * FROM raw.customers LIMIT 100", conn)
for col in dataset.columns:
    label, conf = classify_column(col, dataset[col].astype(str).tolist())
    if conf > 0.7:
        # Emit tag to DataHub
        emit_column_tag(f"raw.customers.{col}", label)

Step 3: Policy enforcement with Apache Ranger:

Ranger integrates with catalogs to enforce attribute-based access control (ABAC). Define policies:

yaml
# Apache Ranger policy: Deny access to PII unless approved
PolicyName: "Block PII Access"
Conditions:
  - Resource: "tag=PII"
    Action: "READ"
    User: "*"
    Effect: "DENY"
 
Exception:
  - User: "data-governance-team"
  - User: "compliance-officer"
  - Effect: "ALLOW"
 
DataMasking:
  - Resource: "tag=PII"
    Action: "READ"
    Mask: "MASK_SHOW_LAST_4" # For IDs: show only last 4 digits

When a user queries a PII-tagged column, Ranger intercepts and applies masking.


Self-Service Data Discovery

Here's the payoff: your data science team can find datasets in seconds using semantic search.

Semantic Discovery in DataHub

DataHub's search indexes dataset names, descriptions, column names, and tags. You can query:

"customer-level features with revenue"
→ Returns: customer_features_v2, customer_ltv, customer_annual_spend

Building the discovery experience:

python
from datahub.client.graphql import get_search_results
 
def semantic_search(query, resource_type="DATASET", limit=10):
    """
    Search for assets by semantic meaning.
    """
    client = get_graphql_client()
 
    results = client.graphql.query(
        """
        query {
          search(input: {
            type: %s,
            query: "%s",
            start: 0,
            count: %d
          }) {
            total
            searchResults {
              entity {
                urn
                ... on Dataset {
                  name
                  description
                  owners {
                    owner {
                      username
                    }
                  }
                }
              }
            }
          }
        }
        """ % (resource_type, query, limit)
    )
 
    return results
 
# Usage
results = semantic_search("customer churn features")
for result in results["data"]["search"]["searchResults"]:
    dataset = result["entity"]
    print(f"{dataset['name']}: {dataset['description']}")
    print(f"Owner: {dataset['owners'][0]['owner']['username']}")

Self-Service Data Requests

Sometimes data scientists can't find what they need. A catalog should support data request workflows:

python
def create_data_request(requester, description, required_columns, urgency="normal"):
    """
    Requester describes what data they need.
    Catalog routes to data owners for approval/provisioning.
    """
    request = {
        "requester": requester,
        "description": description,
        "requested_columns": required_columns,
        "urgency": urgency,
        "status": "PENDING_APPROVAL",
        "created_at": datetime.now().isoformat(),
        "routing_rules": [
            {
                "if_dataset_owner_is": "data-eng",
                "route_to": "data-eng-oncall"
            },
            {
                "if_tagged": "CONFIDENTIAL",
                "route_to": "compliance-officer"
            }
        ]
    }
 
    # Save request, notify owners
    save_data_request(request)
    notify_owners(request)
 
    return request["id"]

Common Implementation Pitfalls and How to Avoid Them

Building a data catalog looks simple until you're operating one. Here's what actually breaks:

Metadata Staleness and Sync Failures

Your catalog starts pristine but drifts. A table gets dropped in Snowflake, but the catalog still shows it. A new column appears in a dbt model, but the catalog hasn't seen the updated manifest. Teams trust stale metadata and build downstream dependencies on phantom data.

Root cause: Metadata ingestion is fire-and-forget. If a dbt run fails, no lineage gets emitted. If the DataHub ingest API is slow, you skip batches.

Solution: Implement idempotent, event-driven ingestion with retry logic:

python
from datetime import datetime
import json
from typing import Optional
 
class RobustMetadataEmitter:
    def __init__(self, datahub_host, max_retries=3):
        self.emitter = DatahubRestEmitter(datahub_host)
        self.max_retries = max_retries
        self.failure_log = []
 
    def emit_with_retry(self, entity, run_id):
        """Emit metadata with retry and failure tracking."""
        for attempt in range(self.max_retries):
            try:
                self.emitter.emit_mce_async(entity)
                return True
            except Exception as e:
                if attempt < self.max_retries - 1:
                    import time
                    time.sleep(2 ** attempt)  # Exponential backoff
                else:
                    # Final failure: log for manual retry
                    self.failure_log.append({
                        'run_id': run_id,
                        'entity_urn': getattr(entity, 'urn', 'unknown'),
                        'error': str(e),
                        'timestamp': datetime.now().isoformat()
                    })
                    return False
 
    def export_failure_log(self, filepath):
        """Export failures for retry job."""
        with open(filepath, 'w') as f:
            json.dump(self.failure_log, f)
        print(f"Exported {len(self.failure_log)} failures to {filepath}")

Governance Without Adoption

You implement airtight governance policies, but data scientists ignore them. They spin up shadow databases. They hoard data locally. Governance rules become theater.

Root cause: Policies are perceived as friction without clear payoff. "I have to fill out a data request form? I'll just ask Jenkins directly."

Solution: Make governance frictionless. Auto-approval for low-risk access. Fast-track for common requests:

python
class SmartAccessControl:
    def __init__(self):
        self.risk_scores = {
            'PII': 100,
            'CONFIDENTIAL': 80,
            'PRODUCTION_CRITICAL': 70,
            'normal': 10,
        }
        self.fast_track_users = set()  # Data owners, analysts
 
    def evaluate_request(self, requester, dataset_urn, tags):
        """Decide: auto-approve, fast-track, or manual review."""
        risk = max(self.risk_scores.get(tag, 10) for tag in tags)
 
        if requester in self.fast_track_users:
            return 'auto_approve'  # Fast-track data owners
 
        if risk <= 20:
            return 'auto_approve'  # Low-risk data
 
        if risk <= 50 and requester.role == 'analyst':
            return 'fast_track'  # Send to analyst approval queue
 
        return 'manual_review'  # Compliance officer sees it

Feature Store Integration Gaps

Your data catalog tracks tables and datasets, but doesn't link to your feature store. Models depend on features, but you can't query "what models use this feature?" The lineage chain breaks.

Solution: Extend catalog metadata with feature-store-specific entities:

python
class FeatureLineageEmitter:
    def __init__(self, catalog_host, feature_store_host):
        self.catalog_emitter = DatahubRestEmitter(catalog_host)
        self.fs_client = FeatureStoreClient(feature_store_host)
 
    def emit_feature_with_lineage(self, feature_name, feature_namespace, source_dataset_urn):
        """Link a feature to its source dataset."""
        feature_urn = f"urn:li:feature:({feature_namespace},{feature_name})"
 
        lineage = UpstreamLineageClass(
            upstreams=[
                UpstreamClass(
                    dataset=source_dataset_urn,
                    type=DatasetLineageTypeClass.DERIVED,
                    lineageDetails={
                        "transformationDescription": f"Feature: {feature_name}",
                        "customProperties": {
                            "feature_store": "tecton",
                            "freshness_sla": "1 hour"
                        }
                    }
                )
            ]
        )
 
        self.catalog_emitter.emit_mce_async(lineage)
 
    def backlink_models_to_features(self, model_urn, feature_names):
        """Emit reverse lineage: model depends on features."""
        for feature_name in feature_names:
            feature_urn = f"urn:li:feature:(ml_features,{feature_name})"
            # Register model → feature dependency
            self.catalog_emitter.emit_mce_async({
                'entityUrn': model_urn,
                'downstreamAspect': [feature_urn]
            })

Integration Patterns: Connecting Your Ecosystem

A catalog doesn't exist in isolation. It connects to everything: your data warehouses, your ML platforms, your monitoring systems. Getting these integrations right determines whether your catalog becomes essential infrastructure or abandoned overhead.

The most critical integration is with your data warehouse. Whether you use Snowflake, Redshift, BigQuery, or Postgres, the catalog needs to understand your tables, columns, and schemas automatically. Automated schema crawlers should run nightly, discovering new tables and columns. Manual schema registration should be optional - your system should auto-discover whenever possible. This reduces operational burden and prevents metadata drift where documentation lags behind reality.

The second critical integration is with your pipeline-parallelism)-automated-model-compression) orchestrator. Your dbt runs, Airflow DAGs, or Prefect workflows create and transform data. The catalog should be fed by these systems automatically. Emit metadata when dbt models complete. Log lineage when Airflow tasks finish. Send metrics when data quality tests run. These integrations should be built by your ML platform team and required for every pipeline. No exceptions.

Operationalizing the Catalog

A catalog is only valuable if it stays current. Plan for:

  1. Metadata freshness: Ingest from dbt/Spark every run (automatically via webhooks or cron). Monitor ingestion latency. Set SLAs (e.g., "dbt metadata appears in catalog within 5 minutes of job completion").

  2. Quality scoring: Calculate freshness, completeness, and anomaly scores per dataset. Flag datasets that haven't been accessed in 90 days. Warn on null rates > 10%.

  3. On-call rotations: Designate data owners who respond to requests and fix issues. Establish escalation paths (data owner → data lead → data platform team).

  4. Observability: Monitor metadata ingestion latency and API performance. Alert if ingestion falls behind. Track catalog uptime. Log all governance decisions (approved, rejected, escalated) for audit trails.

Measuring Adoption and Impact

After three months, ask: Is the catalog actually being used?

python
def measure_catalog_health():
    """Quantify adoption and impact."""
    from datetime import datetime, timedelta
 
    one_week_ago = datetime.now() - timedelta(days=7)
 
    # Adoption metrics
    active_users = count_users_who_searched_past(one_week_ago)
    total_searches = count_searches_past(one_week_ago)
    avg_searches_per_user = total_searches / active_users if active_users > 0 else 0
 
    # Efficiency metrics
    datasets_reused = count_datasets_with_multiple_downstream_users()
    custom_data_builds = count_local_dataset_copies()  # Shadow data
 
    # Compliance metrics
    datasets_with_owners = count_datasets_assigned_owner()
    datasets_with_dq_tests = count_datasets_with_quality_tests()
 
    return {
        'adoption': {
            'active_users': active_users,
            'searches_per_user': avg_searches_per_user,
        },
        'efficiency': {
            'datasets_reused': datasets_reused,
            'shadow_data_count': custom_data_builds,  # Lower is better
        },
        'governance': {
            'ownership_coverage': datasets_with_owners / total_datasets,
            'quality_test_coverage': datasets_with_dq_tests / total_datasets,
        }
    }

Cost and Scaling Considerations

Before you declare victory with your new catalog, understand the operational costs.

Storage and Compute Requirements

DataHub and OpenMetadata store metadata in databases and search indexes. A 1000-asset catalog with full lineage history can exceed 10GB in Elasticsearch. If you're tracking 500 datasets with 100+ columns each and daily ingestion, storage grows linearly with time.

Rough sizing:

  • 100 datasets, 50 columns average, 1 year history: ~2GB
  • 1000 datasets, 200 columns average, 5 years history: ~100GB+

Monitor storage growth and implement retention policies:

python
def cleanup_old_metadata(catalog_client, retention_days=365):
    """Remove lineage records older than retention period."""
    cutoff = datetime.now() - timedelta(days=retention_days)
 
    datasets = catalog_client.search(resource_type="DATASET")
    for ds in datasets:
        lineage_records = catalog_client.get_lineage(ds['urn'])
        old_records = [
            lr for lr in lineage_records
            if datetime.fromisoformat(lr['timestamp']) < cutoff
        ]
 
        if len(old_records) > 10:  # Keep recent events
            catalog_client.delete_lineage_batch(old_records)
            print(f"Cleaned {len(old_records)} old records from {ds['name']}")

Integration Complexity

You'll integrate the catalog with 5+ systems: dbt, Airflow/Prefect, Spark, MLflow, your data warehouse. Each integration is a small ETL pipeline. Maintenance burden scales with system count.

Integration checklist:

  • dbt: Parse manifest, emit lineage daily
  • Airflow: Hook into task completion, emit task execution metadata
  • Spark: Hook into Spark listener for operation-level lineage
  • MLflow: Export training runs and model metadata
  • Data Warehouse: Query schema, table statistics, access logs

Consider using a metadata ingestion framework like Great Expectations + Amundsen or just leverage DataHub's built-in connectors to reduce custom code.

Summary

Modern ML demands complete visibility into your data: where it comes from, how it flows through transformations, which models depend on it, and who can access it. A data catalog - whether OpenMetadata or DataHub - provides that visibility.

The path forward:

  1. Start with table-level lineage: Ingest from your dbt/Spark pipelines. Wire up DataHub or OpenMetadata in a weekend.
  2. Add column-level lineage: Extract from dbt manifests or Spark lineage APIs. Connect features to raw columns.
  3. Link to MLflow: Tag training datasets, emit runs as metadata.
  4. Implement governance: Auto-tag sensitive data, enforce access policies via Ranger.
  5. Enable discovery: Let data scientists search semantically for datasets they need.
  6. Operationalize with monitoring: Track metadata freshness, measure adoption, fix metadata drift.

Each step increases visibility and accelerates your team. The catalog becomes the nervous system of your ML infrastructure - the thing that keeps all the pieces informed and in sync.

Start small, measure impact, and grow. You'll find your teams move faster, trust the data more, and spend less time answering "where did this come from?" The effort compounds - after three months, your data infrastructure will be dramatically more transparent, compliant, and fast.


Debugging Catalog Failures: Technical Troubleshooting

When your catalog stops working as expected, the problems are rarely obvious. Data disappears from the search index. Lineage queries timeout. Metadata ingestion silently fails for hours before anyone notices. Understanding how to debug these failures is critical for operating a catalog at scale.

The first common issue is search index lag. You ingest a dataset through your dbt pipeline. It appears in DataHub within five minutes. That's normal. But sometimes it appears forty minutes later. Or sometimes it doesn't appear at all. The root cause is usually one of three things: ingestion failed silently, the search index is behind on processing, or there's a schema mismatch preventing indexing. To debug this, check your ingestion logs first. Most modern platforms have a way to see which ingestion jobs succeeded and which failed. If ingestion succeeded but the asset isn't searchable, your search index backend might be lagging. Some platforms can clear and rebuild the search index, which fixes drift. If the asset still doesn't appear after rebuilding, the schema might be wrong - the metadata doesn't conform to what the indexer expects.

The second common issue is lineage propagation delays. You updated a source table schema in your data warehouse. The catalog should show which downstream datasets are affected. But the lineage query returns nothing. This happens because lineage indexing happens asynchronously and might be hours behind your actual pipeline changes. Your data warehouse updated at 2am, but the catalog's lineage crawler doesn't run until 3am. If your catalog is only queried at business hours, the delay is invisible. But if you're trying to make fast decisions based on catalog information, lag is a problem. Solution: enable real-time or near-real-time lineage updates via webhooks instead of batch crawlers. When a dbt run completes or Spark job finishes, emit the lineage immediately rather than waiting for the catalog to poll.

The third common issue is null reference errors in lineage queries. Your lineage graph contains references to datasets that no longer exist. Maybe a data source was renamed. Maybe it was deleted. Maybe there's a typo in the upstream reference. Your catalog dutifully records "dataset A depends on nonexistent_table" and then subsequent queries for dataset A fail because the lineage validator chokes on the invalid reference. Solution: implement a lineage validation job that runs nightly, identifies dead references, and either fixes them or quarantines affected datasets. Mark them as "lineage incomplete" so humans know to investigate.

Another issue is governance rule complexity spiraling. You start with one simple rule: deny PII access. Then you add: deny access to PII for non-analysts. Then you add exceptions for research. Then you add time-based access. Pretty soon your governance rules are a Kafkaesque maze that nobody understands and everyone circumvents. Solution: keep governance rules simple. Start with binary decisions: you can access this or you cannot. Implement higher complexity through organizational structures and approval workflows, not rule complexity.

The Adoption Journey: Common Mistakes and How to Avoid Them

Getting a data catalog operational is one thing. Getting teams to actually use it is another. Most catalogs start strong - all documented, pristine, exciting. Six months later, they're stale. Teams ignore them. The catalog becomes a compliance checkbox instead of a tool that speeds up work. Understanding why this happens is crucial for success.

The first mistake is deploying a catalog without strong enforcement. If using the catalog is optional, teams won't use it. They'll keep their shadow databases, their internal wikis, their Slack conversations. The catalog becomes less reliable than the shortcuts people have built. To prevent this, make the catalog the source of truth for data discovery. Route all data access requests through it. Make pulling data without catalog approval costly (require special permission, security review, etc.). Friction in one direction drives adoption in another.

The second mistake is poor metadata quality. A catalog filled with incomplete descriptions, outdated lineage, and missing owners is worse than no catalog. Teams learn not to trust it. They maintain parallel documentation. The catalog becomes cargo cult infrastructure - it exists, but no one uses it. To prevent this, establish metadata SLAs. Dataset owners must keep descriptions current. Lineage must be updated within 24 hours of pipeline changes. If a dataset goes thirty days without update, flag it as stale. Enforce metadata quality the same way you enforce code quality.

The third mistake is deploying without integration into development workflows. If data scientists have to go to a separate UI to find data, adoption is low. If data discovery is built into their Jupyter environment, their IDE, their Python client libraries, adoption is high. Invest in SDK and automation to make data discovery frictionless. The easier you make it, the more people will use it.

The fourth mistake is treating governance as punishment. If your governance rules feel like barriers (requiring seventeen approvals to access a simple dataset), teams will find workarounds. Make governance transparent and proportional. Low-risk data access should be fast. High-risk data access should require appropriate review. Auto-approve safe requests. Fast-track common patterns. Only involve humans for genuinely risky decisions. Governance that feels fair drives compliance.

Building Governance That People Actually Follow

The hardest part of operating a data catalog isn't the technical system - it's getting teams to follow governance rules. Governance that's too strict dies because people circumvent it. Governance that's too loose fails because it doesn't prevent problems. The sweet spot is governance that feels helpful rather than restrictive.

Start with simple rules. Don't try to govern everything at once. Pick the riskiest data type - maybe PII, maybe financial data, maybe production data - and govern that aggressively. Everything else gets a light touch. This focuses effort on what matters most while building confidence in the system. As adoption improves and teams see the benefit, expand governance gradually.

Second, automate everything you can. Manual approval processes die. They create backlogs. People get frustrated. Automation is the path to scalable governance. Use ML classifiers to detect PII automatically. Use policy engines to evaluate requests against rules. Only escalate to humans when the decision is genuinely ambiguous.

Third, measure and communicate impact. Show teams how governance is protecting them. A data quality issue was caught before it broke models? Message the team. A data owner's lifecycle was tracked, preventing accidental data deletion? Communicate that. Build stories around how governance saves the day. This builds buy-in.

Fourth, involve teams in governance design. Don't impose rules from on high. Work with data owners to understand their constraints and needs. Design policies together. Teams are far more likely to follow rules they helped create. This is a bit slower upfront but pays huge dividends in adoption and sustainability.

The Catalog as Your ML Intelligence System

A mature catalog does more than track lineage and govern access. It becomes intelligence infrastructure for your entire ML organization. It answers questions that only a few people used to know.

Can you find the best historical dataset for a new problem? The catalog's search interface lets you query by semantic meaning. "Show me all datasets related to customer churn" returns everything related, from raw customer data to feature tables to labeled examples.

Can you identify all the models that depend on a particular data source? A forward lineage query shows you instantly. This is invaluable for impact analysis. Change a source table's schema? You now know exactly which downstream assets break.

Can you understand feature usage across your organization? If your catalog integrates with your feature store, you can answer "which features are used by which models" instantly. This prevents duplicate feature engineering. It accelerates model development. It's organizational knowledge made searchable.

Can you track model performance degradation back to its root cause? If the catalog connects training datasets to models to production monitoring systems, you can trace backward from performance drop to identify whether the issue is stale data, data quality degradation, or something else entirely.

This is where the real value of a catalog emerges. It's not just a nice-to-have tool for data discovery. It's the nervous system of your ML infrastructure. It connects all the pieces: data sources, transformations, features, models, predictions, feedback. Nothing moves through your pipeline without the catalog seeing it.

Making the Investment Pay Off

Implementing a catalog requires investment. You need the tool (managed service or self-hosted). You need integration work to wire it into your systems. You need training for your teams. You need ongoing maintenance. For a team of fifty people, this is a 10-20 percent time investment across the data platform team for six months.

That's real cost. Does it pay off? The evidence says yes, overwhelmingly. A Gartner study found that organizations with mature data governance (which a catalog enables) have 20 percent lower data breaches and are twice as fast at fixing data quality issues. An Accenture study found that organizations with good data governance see 10-15 percent improvement in model quality across their portfolio.

More concretely, here's what happens in practice. Your organization with a catalog can onboard new data scientists 2-3x faster. They don't need to rebuild tribal knowledge. The catalog is their guide. Your organization can respond to data quality issues 5x faster. When something breaks, lineage shows you the root cause instantly. Your organization experiences fewer incidents caused by data problems because governance prevents them in the first place. Your team spends less time in firefighting mode and more time building new models.

After one year of operating a mature catalog, most organizations find that it paid for itself many times over. The time saved across the data team, the data science team, and the ML platform team compounds. The reduction in incidents due to better governance is substantial. The acceleration in model development from better data discoverability is real.

This is why mature data organizations invest heavily in catalog infrastructure. They're not trying to build the perfect catalog - they're trying to build the infrastructure that makes their organization faster and less brittle. A catalog is that infrastructure.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project