Enrich Your Data with IBM Knowledge Catalog’s Large Language Model Powered Metadata Enrichment

Corey Keyser
4 min readJun 19, 2024

--

Authored with multiple collaborators in IBM Research and Product Management

In today’s data-driven business environment, organizations are increasingly encountering a multitude of obstacles to unlocking the full potential of their data. Chief among them is the challenge of data interpretability. You cannot gain value from your data if you do not know the business context of the data. In other words, what is the data about, where has it come from, and what can it be used for? All these questions and more can be answered through IBM Knowledge Catalog’s new Large Language Model-based metadata enrichment which uses trusted IBM LLMs to automate the unified application of business context to data in order to accelerate trusted enterprise data consumption.

Applying business context to raw data

In the modern data ecosystem, tabular data serves as the backbone of countless business operations, whether it is sales figures, customer information, or financial records. Despite the apparent simplicity of structured data, data analysts and decision-makers alike are facing a common challenge in attempting to decipher the meaning behind cryptic and non-descriptive data elements, such as table and column names.

In many organizations, tabular data is sourced from various systems and databases, each with their own respective naming conventions. As a result, when aggregated, data elements are often represented with unclear names that provide little insight to their actual content or purpose. This ambiguity creates challenges for searchability and data governance and hampers productivity due to misinterpretation and errors in analysis. Without accurately applying business context to tables through metadata enrichment, data navigation is time-consuming and error prone.

Consider a scenario where an analyst encounters a column named, “ACT_END_DT” in a dataset. Without additional context, it is nearly impossible to decipher the meaning or significance of the column which consequently prevents data consumers from finding valuable insights in the data. However, when you perform metadata enrichment it enables the analyst to know that the column contains data about “Actual Communication End Date.”

Generative AI and metadata generation

Traditionally, contextualized metadata for structured data is achieved through manual data annotation and basic automation techniques like linguistic matching. Recent advancements in LLMs have enabled the automation of metadata generation through highly accurate generative AI techniques which turn cryptic tabular data into semantically rich datasets. These LLM-based models unlock the ability to contextualize data at scale.

LLM-based metadata enrichment from IBM

With the advanced metadata interpretation capabilities included within IBM Knowledges Catalog’s new LLM-based metadata enrichment, data elements such as table names and column names can be automatically contextualized. By analyzing the peripheral information surrounding each data element, metadata enrichment can now infer semantic significance and relationships with other elements, empowering users to navigate complex datasets with confidence and clarity.

These advanced metadata enrichment capabilities allow users to automatically generate table metadata such as table descriptions, tags, expanded column names, and column descriptions to enable better understanding and search. The enrichment process takes it another step further by mapping the generated metadata to concepts in an organization-specific glossary or ontology. This enriched metadata is more meaningful and tailored to the user’s specific context, enhancing downstream tasks such as table search and discovery, and advanced business analytics.

IBM’s LLM-based metadata enrichment uses fine-tuned versions of IBM Research’s trusted Granite and Slate foundation models. These are fine-tuned based on carefully selected table metadata from the IBM Chief Data Office (CDO) as well as various global government agencies. These datasets were then carefully annotated by experts with a business glossary developed for the Cognitive Enterprise Data Platform. Granite is IBM’s flagship series of LLM foundation models based on decoder-only transformer architecture for generative tasks, whereas Slate is IBM’s embedding model, used for retrieval augmented generation, semantic search, and document comparison tasks.

To ensure trust and transparency, every dataset that is used in training IBM models undergoes quality filters, data protection safeguards, and careful risk review based on IBM’s AI Ethics principles. This comprehensive model governance process will continue to evolve and improve as the generative AI landscape advances so that you can always count on the safety and trust of your AI-powered tools.

Enrich your data today

With the ever-growing importance of data, the ability to interpret and derive insights from tabular datasets is proving to be paramount for organizations seeking to make data-driven decisions. IBM’s LLM-based metadata enrichment represents a significant leap forward in addressing the challenge of enterprise data interpretability, empowering users to navigate complex datasets with confidence and clarity. With LLM-based metadata enrichment, understanding tabular data has never been easier.

Today IBM launched the new Knowledge Catalog Standard and Premium Cartridges to help organizations scale data governance with a modern, LLM-powered data intelligence solution. Customers will be able to access these LLM-based capabilities through the new Cartridges or through the Cloud Pak for Data SaaS platform on IBM Cloud. These cartridges provide customers with a modular and consumption-based delivery vehicle for self-hosted infrastructure that leverages our updated Resource Unit pricing.

To discover how you can unlock the full potential of your data with LLM-powered data governance, contact an IBM representative today for a demo.

Photo by Sincerely Media on Unsplash

--

--