You Can’t Govern AI without Governing Data
Why Data Governance is Key to LLM Safety
Summary
AI models take in data and then learn patterns from the data so they can predict/generate an output. Complex Large Language Models (LLMs) like ChatGPT are no different and although the mechanism is somewhat straightforward, it’s clear that the technology will gradually transform many industries. Most of us who are involved are trying to make sure that the growth of AI goes well. Making AI go well requires us to govern the training and deployment of these models. However, much of the existing focus has been on the management and deployment of the models themselves. While model governance is crucial, the foundation for AI Governance is governance of the data that the AI is trained on. We need to focus first on managing, understanding, and protecting the data that is input into the AI models.
Problem
Large language models like ChatGPT can be trained on proprietary data and fine-tuned to fulfill specific enterprise use cases. For example, a software company could take ChatGPT and create a private model that is trained on the company’s CRM sales data. It could be deployed as a slack chatbot so sales teams could use the chatbot to find answers to things like “how many opportunities has product X won in the last year?”, “update me on product Z’s opportunity with company Y?”, “how much revenue are we are on track for in 2Q?”
These LLM’s can be extremely useful for summarizing large amounts of the data that they are trained on. You could easily imagine this being tuned for any number of customer service, HR, or marketing use cases. In time, we might even see these heavily augmenting legal and medical advice, turning LLMs into the primary diagnostic tools employed by healthcare providers.
The issue is that many of these use cases require training the LLMs on sensitive proprietary data. This is inherently risky. Here are a few of the primary risks:
- Privacy and Re-identification Risk —The point of AI models is that they learn the data that they are trained on. But what if that data is private or sensitive? A considerable amount of data can be directly or indirectly used to identify specific individuals. So, if I am training an LLM on proprietary data about an enterprise’s customers, I can run into situations where the consumption of that model could be used to leak sensitive information.
- In-model learning data— Many simple AI models have a training phase and then a deployment phase in which training is paused. LLMs are a bit different, they can take the context of your conversation with them, learn from that, and then respond accordingly. This makes the job of governing model input data infinitely more complex since I don’t just have to worry about the initial training data, I now have to worry about every single time the model is queried. What do I do if I feed the model sensitive information during conversation? Can I identify the sensitivity and prevent the model from using this in other contexts?
- Security and Access Risk — The sensitivity of the training data to some extent determines the sensitivity of the model. And although we have well established mechanisms for controlling access to data — monitoring who is accessing what data and then dynamically masking data based on the situation — AI deployment security is still developing. Although there are interesting products popping up in this space, we still can’t entirely control the sensitivity of model output based on the role of the person using the model (e.g. the model identifying that a particular output could be sensitive and then reliably redacting the content based on who is querying the LLM). Because of this, these models can easily become leaks for any type of sensitive information involved in the model training.
- Intellectual Property Risk — what happens when I train a model on every Michael Jackson song and then the model starts spitting out MJ ripoffs? Is the model infringing on MJ? Can you prove if the model is somehow copying your work? This problem is still being figured out by regulators, but it could easily become a major issue for any form of generative AI that learns off of artistic intellectual property. I expect this will lead into major lawsuits in the future that will have to be mitigated by sufficiently monitoring the IP of any data used in training.
- Consent and DSAR Risk — One of the key ideas behind modern data privacy regulation is consent: customers must consent to company’s using their data and they must be able to request that their data is deleted. This poses a unique problem for AI usage. If you train an AI model on sensitive customer data, that model then becomes a possible exposure source for that sensitive data. If a customer were to revoke company usage of their data (a requirement for GDPR) and if that company had already trained a model on the data, the model would essentially need to be decommissioned and retrained without access to the revoked data.
To sum it up, the utility of a model is dependent on the data that the model is trained on, however, the training data can be a significant source of risk and bias. Making LLMs useful as Enterprise Software requires governing the training data so that companies can trust the safety of the data and have an audit trail for the LLM’s consumption of the data.
Data Governance for LLMs
The best breakdown of LLM architecture I’ve seen so far is this article by a16z (image below). Really well done, but as someone who spends all my time working on data governance and privacy, that top left section of “contextual data → data pipelines” is pretty underdeveloped.
Regardless, the overall architecture still works so I want to use it as a north star to improve on. If I were to build out the “contextual data” architecture it would look a bit like this:
There is a lot built into making data “contextual”. Most of the time this just means identifying the type of data you have based on enterprise specific definitions (do I have Customer ID, SSNs, Zip code, first name, last name, etc.). This is well-established with structured data and is covered by Gartner and Forrester in the Data Fabric, Data Governance, Data Catalog, and Metadata Management markets.
But you might be surprised how underdeveloped discovery and classification is for unstructured data. Since unstructured data like text documents is the bulk of the training data for LLMs, the classification, governance, and cataloging of unstructured data is a very fruitful area for enterprise software investment.
It is difficult to really know “what is in” data regarding it’s purpose, sensitivity, ownership, and content. You must know all of this in order to properly govern data as it is used downstream in model training.
Once you have discovered and classified data — thus giving it its context — you can then go through the process of cataloging the data so that you have an auditable inventory. You can think of this as a searchable library of your data. It is useful both from the perspective of the data consumer finding data, as well as from the perspective of the risk team or regulator that needs to know what data an enterprise even has. From there you might further transform and prepare the data to ensure that you are only using high-quality data for training.
The last step is one that I think is easy to overlook which is the implementation of Privacy Enhancing Techniques…in short, how are you taking away the sensitive stuff before feeding it to AI? You can break this into three steps:
- Identify the sensitive components of the data that needs taken out (this is established during data discovery and is tied to the “context” of the data)
- Take out the sensitive data in a way that still allows for the data to be used (maintains referential integrity, statistical distributions roughly equivalent, etc.)
- Keep a log of what happened in 1) in 2) so that this information follows the data as it is consumed by models. That tracking is useful for auditability.
The original software market for all of this was “test data management” which was originally used to make de-identified copies of data at scale for application testing. The TDM market has slowly gone out of vogue and has been rebranded around “synthetic data” with more of a focus on data preparation for Analytics.
The TDM incumbents are companies like Protegrity, Delphix, and Optim. The new synthetic data competitors include Tonic, Gretel, and Hazy. The thing that connects these competitors is the simple fact that they read from a database, make some privacy enhancing transformations (changes names, take out SSN’s, etc.), and then create a new copy of the data. You then point your AI or Analytics team to this new, safe copy of data.
Now of course no one likes to create data copies. It increases cost and makes it difficult to manage the extra copies (when do we delete the copy? who gets to see it? where is it stored? for how long? etc…). So on the other end of the Privacy spectrum is the “policy based access control” people like Immuta, Okera, or Privacera (often covered under the “data access governance” and “data security platform” markets).
This space is slowly getting consolidated by acquisitions but the basic premise is that you identify the unsafe data and then you change the data in real-time as it is being consumed depending on the purpose of the consumption (who is using what data for what purpose/why). So rather than creating a copy of data, I just train the AI on data that is cleansed dynamically as it is accessed. Immuta and Databricks are teaming up for this and, in general, the latter is making interesting moves in this space.
There is a caveat though…just like governance, all of these Privacy Enhancing Technologies are focused on structured data. There aren’t yet a lot of good options for protecting unstructured data at scale for AI use cases. There are, of course, many interesting and well-developed products in the unstructured data space. They are just focused more on things like eDiscovery rather than this type of data governance.
Conclusion
AI models, particularly LLMs, will be one of the most transformative technologies of the next decade. It is currently very difficult to manage and govern AI and we are still figuring out the regulations and risk management. Governance of the data that is input into AI is an underrated factor in the management of these models.
Further reading
**Views expressed in my posts are my own and do not represent those of my employer**