Using LLMs to Automate Data Cataloguing and Lineage Tracking

September 3, 20258 min read

Manual data cataloguing does not scale. We explore how large language models can automatically infer schema semantics, tag sensitive columns, and generate lineage graphs — cutting catalogue maintenance by 80%.

Data catalogues are one of the most consistently under-maintained assets in the enterprise. Every data leader knows they need one. Most have bought one. And most will quietly admit that the catalogue they spent six months and considerable budget building is now 40% accurate at best, because the team that was supposed to maintain it moved on to other priorities the moment the initial sprint was over.

The fundamental problem is that manual cataloguing does not scale with the pace of data platform change. A mid-sized organisation might add dozens of new tables and pipelines per week. Expecting data stewards to write business descriptions, tag sensitivity levels, and trace lineage for each one by hand is unrealistic. The catalogue decays the moment you stop actively feeding it.

Large language models offer a genuinely different approach. Not as a replacement for human judgment, but as the engine that handles the high-volume, low-ambiguity classification work that currently bottlenecks catalogue maintenance — freeing human stewards to focus on the ambiguous cases that actually require domain expertise.

Inferring Schema Semantics Automatically

The most immediately practical application is automated business description generation. Given a table name, column names, and a sample of values, an LLM can produce a reasonable natural-language description of what the table contains, what each column means, and how it relates to business concepts — without any human input.

The quality is not always perfect, but it does not need to be. The goal is to produce a first draft that a data steward can review and approve in 30 seconds, rather than write from scratch in five minutes. At scale, that difference is the gap between a catalogue that gets maintained and one that does not.

In practice, we run this as a triggered job: whenever a new table is detected in the catalogue (via the metadata event stream from Databricks Unity Catalog, Snowflake, or BigQuery), the LLM generates a description draft and flags it for human review. The steward approves, edits, or rejects. Approved descriptions are added to the training context for future runs, so the model improves over time as it learns your organisation's domain vocabulary.

Sensitive Column Detection

Identifying PII, financial data, and other sensitive columns is another high-volume task that LLMs handle well. A rule-based approach using regex patterns catches obvious cases — columns named email, ssn, credit_card_number — but misses the long tail: columns named subscriber_contact, transaction_ref, or legacy names that bear no resemblance to their content.

An LLM with access to column names, sample values (anonymised or synthetic if necessary), and the table description can infer sensitivity with high accuracy even for ambiguous names. We have found that combining LLM inference with a lightweight classifier trained on your organisation's historical sensitivity labels produces better results than either approach alone — the LLM handles novel cases, the classifier provides a confidence calibration.

The output feeds directly into your ABAC policy engine. A column tagged PII automatically gets masking policies applied wherever it appears, without any manual policy configuration.

Lineage from SQL and Pipeline Code

Lineage tracking — knowing which tables were used to produce which other tables — is critical for impact analysis, debugging, and regulatory compliance. Most modern orchestration tools (dbt, Airflow, Databricks Jobs) emit some lineage metadata natively. The problem is the long tail of ad-hoc queries, legacy stored procedures, and custom Python scripts that run outside your primary orchestration layer and leave no lineage trace.

LLMs can parse SQL and Python to extract lineage. Given a SQL query or a Python script that uses pandas or PySpark, the model identifies source tables, transformations, and output destinations. The extracted lineage is then merged into the catalogue's lineage graph, filling in the gaps that native tooling cannot reach.

The accuracy is high for well-structured SQL and reasonable for Python with recognisable patterns. For heavily dynamic or metaprogrammed code, you still need human annotation. But in most enterprises, the Pareto principle applies: 80% of the unmapped lineage comes from a relatively small number of well-structured legacy queries.

Building the Feedback Loop

The most important architectural decision in an LLM-powered catalogue is the feedback loop. Every human correction — a description edit, a sensitivity tag override, a lineage correction — is a training signal. Capturing that signal and incorporating it into future inference runs is what separates a system that degrades over time from one that improves.

Practically, this means storing corrections in a structured format alongside the original LLM output, periodically fine-tuning or prompt-engineering based on correction patterns, and routing systematically-wrong inferences to a human review queue rather than to automatic approval.

The organisations that get the most from this approach are the ones that treat the LLM as a junior colleague who learns the business over time — not as an oracle to be trusted unconditionally. With the right feedback loop and governance structure, the catalogue maintenance burden drops dramatically. We have seen teams cut the time spent on catalogue upkeep by 75-80% within six months of deployment, with accuracy actually improving rather than degrading.

LLMData CatalogLineageAutomation

Artificial Intelligence