Data Anarchies and Vetocracies — The Challenges of Data Sharing in Large Organizations

Corey Keyser
5 min readOct 21, 2024

--

My first time working with data was in an undergraduate neuroscience lab. Like most labs, the grad students were overworked, making it difficult for undergrads to contribute meaningfully. As a result, most of my time was spent on loosely defined tasks related to research proposals. My first project was to attempt to create a computational model of how humans make continuous decisions. To get started, all I needed was a CSV file that I was told could be found “somewhere in the lab’s Dropbox.” I was left to browse through folders, hoping to eventually find it. However, I didn’t know what I was looking for, didn’t know the file name, and had no idea how the folders were organized. In the end, I only found the data after asking someone on the team.

For those with data science experience, this scenario might sound familiar. Organizations aim to be data-driven, so they create data science teams. However, these teams are often ineffective because they can’t find the right data — they don’t know where to look, what to search for, or whether the data they find can be trusted. In short, many organizations have yet to develop the data-sharing infrastructure necessary to unlock the full value of their data science efforts.

In my experience working as a product manager in the data management space, most enterprises tend to live on one of two possible extremes:

  1. Data Anarchy — organizations that possess almost no formal structures for governing and organizing the data in their organization. Data access and governance is managed informally by the folk knowledge of the team — oh yeah Craig is the data engineer for marketing data, he can tell you where to find that table.
  2. Data Vetocracy — organizations in often highly regulated industries that almost completely prevent data usage due to risk aversion, centralization, and data siloes. Data governance is often centrally managed, but data is almost never shared.

It is easier to create structure than it is to take structure away. Because of this, Data Anarchies can be easier to fix especially when you’re dealing with organizations that aren’t highly regulated, have generally low risk data, and have relatively small teams of people who need to use the data.

In these scenarios, fixes can be very simple. Organize the data well and work to consolidate data sources. Document acceptable use policies for data teams. Document general data domains and structures — marketing keeps their data in X it has data about A, B, C, etc. Put the data into a catalog to make it searchable and then you’re basically 90% of the way there.

We all need to put up guardrails to prevent the improper usage of data, but sometimes those guardrails don’t need to be engineered, they just need to be stated. We don’t need to have police at every intersection physically preventing us from running red lights. For the most part, we just need to be told not to run red lights.

Data Vetocracies, however, develop over a long period of time and lead to the creation of jobs and structures whose entire purpose is to limit risk at the expense of data usage. These organizations have data security teams, data privacy teams, data governance teams, data offices, AI governance teams, data fabrics, data meshes, and everything else. They tend to be large companies in regulated industries whose default is to prevent data usage as much as possible.

I’ve encountered multiple organizations where it will take data scientists 3+ months to get access to data for a project. I’ve even heard it as high as 8 months in some extreme scenarios.

The typical structure goes something like this:

  1. Data Silos — Large company has loads of siloed data. Marketing data is stored in AWS and their BI team uses Looker. Sales has their data in Salesforce. Finance has their data in… etc.
  2. Data Owners — All large data domains have some data owner who is responsible for managing the data in that domain. They control who can access it and they are held responsible if something goes wrong. For example, much of the marketing data in the United States for one company might be managed under one single data owner.
  3. Request Infrastructure— Company creates some central place where data consumers (data scientists, BI teams, data analysts, etc.) can request access to the data they need. They might not even see data, it could be as simple as a request form that hopefully gets routed to the right person.
  4. Data Request — Data consumer creates the data request which is sent to the data owner. For example, maybe I’m a data scientist in sales interested in exploring the impact of marketing spend on business development lead generation. I might then request access to marketing campaign information from the marketing data owner.
  5. Data Denial — A large portion of the time this request will go unfulfilled. As banal as it sounds it just comes down to incentives. The data owner typically has their job because they are supposed to own and gain value from their data fiefdom. They have almost no incentive to share this data, and, to make matters worse, they are responsible if someone misuses the data. So they will either deny the request outright or make the requestor fill out more paperwork and resubmit. This will happen for awhile, and eventually you find yourself in an organization where it takes 8 months to share data.
  6. Data Fulfillment — In the off chance that someone is actually given access to the data, the process will typically look like this: DBA creates a view of the data needed by the requestor (CREATE VIEW Marketing_spend…), DBA adds requestor to a user group (User Group Sales), DBA creates a Role Based Access Control Rule giving the requestor access to the view (User group sales can access view Marketing_spend), DBA gives view to the requestor.

There are two large themes for why this is broken.

First, the entire organization is incentivized against data sharing. Data owners are both responsible for misuse and, because the system is slow, data owners rarely benefit from accessing others’ data. All they see is risk with no benefit, so everyone defaults to denying requests.

Second, the technology stack makes DBAs a bottleneck. The process for distributing data is long, somewhat complicated (depending mostly on the complexity of the SQL required to generate the View), and entirely manual. This process doesn’t scale for large organizations, and when you try to make it scale you create situations where organizations have thousands of mostly ungoverned and hard-to-audit rules that map user access to each view (user group x has access to view y, user group z has access to view z, …).

Because of these issues, fixing the data vetocracy requires changes to both process and infrastructure. So, how do you fix this? Well, I will try to outline that in part 2: Solving Data Sharing with the Data Exchange

Photo by Sincerely Media on Unsplash

--

--

No responses yet