LM embeddings

Machine learning needs a lot of data to work. Previously the big problem was how to process all the data we have so that ML models can use it. But that was when supervised learning was the state of the art. Now that LLMs look like they could solve any task we throw at them, we have a different problem.

LLMs use unsupervised learning and need a lot more data to work, and they work across a general domain rather than on specific tasks. This means that our primary source of data, the internet, suddenly does not have the volume or diversity of data that this new evolution of ML demands

What do you mean the internet doesn’t have enough data? It’s huge!

It is. And it used to be more than big enough. But not anymore.

The internet is like the ocean because when we interact with it, it severely messes up our ability to grasp and contextualize scale. When you’re floating in that sea of data, no matter where you look it just keeps going on forever; no boundary in sight. You begin to think it’s infinite. That’s because you are contextualizing it using your ability to utilize what’s around you. Your house seems small because you don't have space for that home theater, or sun room, or library or whatever else you saw pictures of online. But the ocean, or the internet, you couldn’t traverse that in a lifetime. You couldn’t read all the content on the internet any more than you could swim across the pacific.

But you need to consider that it doesn’t take a cargo ship that long to go between Asia and North America, and it doesn’t take that large an LLM to be able to understand all of Wikipedia and answer questions about it.

It was big enough before. What changed?

The state of the art went from supervised learning to unsupervised learning.

Explaining what those are needs a very small lecture so skip the next two paragraphs if you already know or just don’t care. Supervised learning is when an ML model is set to solve a task, and it learns to do that using a dataset that is structured into samples, each having an input description, and an associated output prescription. That’s tough to grasp if you don’t already understand it, so here's an example. Imagine you need to teach someone how to identify birds, for the purpose of birdwatching or whatever. Supervised learning would be if you gave them pictures of all birds in existance and their names, and they memorized them. The pictures are the descriptions (albeit visual rather than textual), and the names are the prescriptions they must learn.

This is opposed to unsupervised learning where an ML model is not directly made to solve a task, but rather to understand the governing principles of the domain the task belongs to, after which it could solve the task. They learn to do this using an unstructured dataset that just contains descriptions; if any associated prescriptions exist they would be jumbled up in there. Extending the previous example, now you aren’t giving the person a list of bird pictures and names; you are now giving them all the textbooks, research papers, and articles ever written about birdwatching and they are pouring through them to learn everything they can. These works don’t have any prescriptions in a neat list; the person will have to read and learn a lot of stuff about birds. But when they do, they will be able to tell you not only the name of any bird you see, but almost anything else about its behavior or habitat.

LLMs use unsupervised learning. And that has confirmed to the world that that’s what we need to be able to apply machine learning to real world problems, and fulfill the promise of AI.

But note two things. Where supervised learning needed a narrow structured dataset, which took a long time to compile; unsupervised learning needed a far larger unstructured dataset, which only needed to be gathered together before it was ready to be fed to the model. And while the next task for supervised learning would have needed another dataset from the same domain (e.g. learn to classify bird habitats using this labeled map); unsupervised learning solved all tasks in its domain, and now needs all the data in another domain (teach the person to be a fisherman).

We can give LLMs data a lot more cheaply because we don’t have to spend the immense time and effort labeling it, and we can have to give it a lot more, because they directly learn the internal governing principles of the domain. And humanity’s AI ambitions are resting on our ability to do so.

So what do LLMs need?

Well first off let me not under-sell the effort needed to collect training data for unsupervised models. We don’t need to label datasets but we still need to gather an exhaustive set of unstructured data. And while that’s easier, it still needs to be done, and it needs to be very easy to do because everyone will have their own LLMs soon.

Secondly, the data we need will be determined by the domains we wish to solve. Imagine you want to automate financial forensics; tasks like detecting fraud, waste and compliance. You could try to do that using an LLM that instead of reading natural language like GPT, reads financial transactions, and understands the real world processing being described in them. But let me ask you, do you have a representative dataset of all the different types of financial transactions that happen? Could you find one online?

What about a robot doctor? Do you have a set of medical records, test results, prescriptions, symptom descriptions and all the other paperwork generated by hospitals? Is it large enough to be as representative of modern medicine as wikipedia is of general knowledge? Is it as easy to get as letting a web crawler loose on “www.wikipedia.org/”?

LLMs will automate almost every simple intellectual labour humans perform. And for each domain we wish to do that in, they will need unstructured datasets that are representatively large, and gathered in one place. The internet is big but it just doesn’t have that

TL;DR

The internet used to have more data than ML models could ever use
This was true for supervised learning, which needed structured and labeled data
But unsupervised learning can consume data cheaply, so it can consume more
We will need a lot more diverse data gathered in one place to leverage LLMs
The internet can not provide this right now