To build a large language model (LLM), you need hundreds of terabytes (if not petabytes) of training data. But where do you, as a developer, get all this data? And once you've built your model, how can you be sure that you don't get hit with a lawsuit if it turns out that you've unknowingly used copyrighted or inaccurate data?
In some instances, AI developers have been found to have gathered or scraped hundreds of gigabytes of pirated ebooks, proprietary code, or personal data from online sources – without the consent of the subjects or authors involved. Given that the standard for an LLM today is one that can recite poetry, write Python, and explain quantum physics, that creates a competitive incentive for companies to build the biggest models possible.
Not only does this make it more likely for people to collect copyrighted training data in a race to reach a certain number of parameters, but it also leads to increased environmental damage and inaccurate results. What we need in many situations, instead of LLMs, are smart language models (SLMs). These would be models with a horizontal knowledge base, using a sensible amount of ethically-sourced training data, but tailored to solve a particular business problem.
Steering clear of copyrighted or illegal datasets
If you want to make sure that your AI model can weather the storm of AI regulation over the next few years, the simplest way is to make sure that you've researched and verified the source of all your training data. That is easier said than done.
The nature of the tech landscape makes it far easier for hyperscalers like Amazon or Microsoft to build and train their own models. They have mountains of user data collected from different arms of the business to feed their neural networks. For a start-up looking to find its niche in the market by training a new model, collecting a similar volume of data whilst dodging copyrighted material can feel like an impossible task.
To start with, follow the usual steps: make sure you’ve got the necessary permissions or licenses to access and use the datasets you’ve selected and set up rules to govern your collection and storage of user data.
Also, think about whether using a smaller dataset to train your model or fine-tuning an existing open-source alternative might be a more effective solution. This makes it easier for you to collect enough data and verify its origins, and while this model may have less broad applicability than a ChatGPT or Bard, you can take this as an opportunity to enhance its reliability for a specific domain or industry.
There is, of course, another alternative. Issues abound with organic training data around copyright, accuracy, and bias - that’s why many people in the AI community are proponents of synthetic training data. If we can synthesise data for a particular problem, that makes it possible to train models to a far higher degree of accuracy whilst avoiding copyright issues entirely.
This kind of resourceful thinking is vital. After all, for every model we call smart, its builder was even smarter in how they leveraged existing models, data points, and data analysis to prepare, scale, and operate their data.
Think about the specific pain point you want to solve, such as finding the right paper in the vast amounts of scientific research) and then train your model on a focused, labeled dataset sourced from authoritative sources in that domain, which in the aforementioned case would be open-source academic research.
Again, a model’s quality is tied directly to how smart you can be as its developer. The level of care and resourcefulness that you put into acquiring your data will reflect how horizontal and high-quality you can expect the model to be.
Avoiding misinformation and inaccurate responses
Another benefit of curating a high-quality, vetted set of training data is that your users can rely on your models to produce accurate, informed responses – cutting down on the spread of misinformation and hallucinatory responses.
Every day, we read stories about models like ChatGPT or Bard generating inaccurate or downright false responses to questions. If you want to build an adaptable, efficient, and accurate model that stands the test of time, you've got to make factual validation a key part of your model's architecture.
We have an opportunity to alter the underlying mechanics of neural networks to prioritise accuracy and high-quality training. These models, to date, have been built to gather a lot of information and then spit it out in a sequence, but without an internal sense of how the two line up.
We need to build models that are more selective in their unsupervised learning, have upgraded attention spans, and can focus with greater ease – using internal mechanisms to filter the data before feeding it into the training process.
A smarter way to build language models
Right now, LLMs built by hyperscalers consume the electricity and resources of a small city, and this is only increasing. Training GPT-3 alone costs 355 years of single-processor computing time and 284,000 kWh of energy – 10 times more than GPT-2. Quite aside from the harm this causes to our planet, it's hugely inefficient. By upgrading the training process and narrowing down our list of specific use cases, we can build futureproofed, sustainable models.
If you have a specific use-case AI can help you with (scanning new scientific patents for potential infringements, for example), why do you need the model to be able to recite Shakespeare? More data doesn't always lead to a better system, and in specialised, technical domains like material science or medical writing, quality is vastly more important than quantity.
There’s another idea that may help you bypass copyright issues related to training LLMs. Think about how you could use a swarm of smart language model agents - with more autonomous self-direction in how they reach their goals - to tackle multiple facets of a business problem, rather than contorting and bending a single LLM out of shape to solve it all at once.
Industry leaders like Andrew Ng have called for the development of “data-centric AI,” which focuses on engineering the data required to construct a specific AI model. This movement aims to enhance the quality and labeling of data to match the efficiencies and methods of the newest algorithms.
If you want to build an AI model in a way that keeps you out of hot water, copyright-wise, make sure you stick to the basics and put quality before quantity. Research your sources, understand how much data you need to collect for a specific use case, and create factual validation mechanisms to ensure accuracy.
Let's work together to build smarter language models, not just larger ones.
Lead image: Dreamstudio