selling data for money

For almost a decade, tech and economic journalists have been saying that data is more valuable than oil. This is because data will fuel the information revolution like oil and coal fueled the industrial revolution.

But have you ever had to scrounge and steal for oil the way companies do for user data? When you need to make an AI model, you need to somehow own a lot of data to get it started (which is tough), but if you want to drive your car, you just go and buy petroleum. You don’t need to refine the crude oil yourself.

That’s because oil has a sustainable marketplace where sellers are compensated. And, if we want an information revolution, data will need the same thing.

But the internet is huge, surely there’s enough free data

There isn’t.

I have discussed this before, in the context of LLMs, but even for traditional ML, there isn’t really that much data available, and certainly not enough to fulfill the sci-fi promise of ML/AI. We simply don’t have enough data, in volume and variety, to utilize these technologies to the fullest extent. And certainly not enough to get the information revolution going in earnest.

The confusion arises because we see what has been achieved by data-driven models in ML/AI. and there are a lot of those doing a lot of amazing things. But that is only what was achievable with the data that is readily available. This is things like user activity on websites and apps, images of generic text or objects, or time-series measurements of processes that are already being monitored (e.g. revenue, temperature, or crop yields). These are important, but the reason they were used is actually because they were “the quickest wins”. The business side dictated that these are the problems that will be solved, because sourcing the data for them was the cheapest (the data was often already being sourced)

The things we don’t think about as often are those that haven’t been solved. And these may be very important things, but the data is very difficult to get so we see some very ambitious flashes in the pan, but not really any concrete solution. Think about fully autonomous vehicles, an automated medical diagnostic engine, a robot farm, or an AI personal finance advisor. It’s not like these things aren’t worth solving, it’s just that the data needed to solve them is not available in the appropriate volume or condition.

For something like full self-driving, you need human drivers’ activity streams when they are behind the wheel, which is very difficult to get because that data is generated and never captured. When Tesla wanted to solve the problem, they had to start by first making a car that would record all that data, because it just wasn’t available before. And now anyone else who wants to work on this can’t use Tesla’s data because there is no reason for Tesla to provide it and so they have to solve the AI cold start problem again. Even then we don’t have the data of how every entity on the road reacted to the events they can face because not all the vehicles on the road have the data collection equipment. Even if they did, they probably won’t be from the same companies so a complete picture won’t be formed.

It’s the same thing with the other examples. Medical timeline data is difficult to work with because of fragmentation and concerns about privacy. Agricultural data isn’t collected because farmers can’t put IoT sensors on every plant. And the transactions on my bank statement contain a tiny fraction of the information that was available to be captured.

Why can’t they get this data?

Because it’s just not worth it for people to give it up.

Let’s look at it like oil again. The reason Shell and BP go out there and drill for oil is because it’s worth it for them. The oil they extract is valuable, they need to spend a lot of value getting it, and they need that value to be compensated if they are to do that continuously, regularly, and sustainably. That value is provided by you when you pay at the gas pump. It’s a very civilized system.

For data, the companies that want to use it have to do uncivilized things like steal it from their users. And that isn’t sustainable because a transaction is only sustainable if all parties benefit from it. To capture a lot of this data, large investments will have to be made in things like physical sensors or regulatory compliance. And that makes the data valuable. And that can’t really be sustained with the barbaric stealing and scrounging that tech companies do to get data right now. This can only be sustained with a monetary compensation that makes the efforts of sourcing the data not only affordable, but also profitable.

How can that happen?

Well the very first step is for tech companies to let go of the unreasonable expectation that they can get the single most valuable resource on earth for free. Companies making and providing ML/AI products have to be willing to pay for the data they use.

Once that basic principle of economic transactions is fulfilled, we need to come together to establish platforms for data commerce. If you have a little herb garden in your backyard, you should be able to sell that horticultural process’ data on an open market. None of this “we give you the data collection equipment for free and we get to keep your data” nonsense. The only thing this does is silo the data to a single consumer and not let its price reach a natural equilibrium. The data needs to be purchasable by anyone on an open market so that the supplier can earn as much as they need to, and the price per buyer is significantly lower.

Obviously this data needs to be collected before any sales can be made. This is a very simple problem that can be solved by financing. You see, the data is not that expensive to maintain. It’s already generated as a byproduct of an independent process that needs to be maintained anyway. The only extra thing is collection. Once a bank sees that selling the data is a low maintenance (and thus low risk) revenue stream, they will be happy to offer financing for the initial setup costs.

Scale might be a problem. The example I gave before about your backyard herb garden might not produce data of the volume or standard that buyers need. This can be remedied with sellers’ cooperatives of some sort, where sellers who have similar data generating processes to monetize can simply gather their data in a standardized way and collect it all together before selling it. A single herb garden’s data probably isn’t that valuable, but thousands of herb gardens all over the world can produce very valuable data.

The gist of it is to create a platform that facilitates the free flowing and sustainable transfer of data. That's necessary for us to realize the full potential of the information revolution.

TL;DR

Data is the most valuable modern resource, like oil and iron before it
Its supply chains are non-existent, everyone is scrounging and stealing
This limits AI research, development, and deployment
We need to set up proper supply chains for data
These need to be sustained with compensation flowing in the opposite direction
We need platforms where people can perform basic and sustainable data commerce