1000X

The quest to make AI models larger

Mar 05, 2024

The last decade has seen tremendous improvements in AI, driven largely by larger models and computational increases. If we could add more computation, there’s little question that it would drive improvements in model quality. Doubts if any about the next set of quality improvements are because there’s uncertainty around how we are going to get the next 1000X in computation.

Let’s start by looking at where we got the last 1000X to see if that gives us any clues.

Back in 2012 when GPU use for deep learning began, the default numeric format was fp32 (32-bit floating point), however over time we realized that fp16 (or even better bf16, another 16-bit format) worked quite well for models with no loss. Over time folks have been able to quantize and run these models, particularly for inference with int8 precision. This change in number representation gave the biggest bang of ~16X - this includes going from 32-bit to 8-bit and from float to int (fixed point precision).

The next set of improvements came from hardware change. GPUs were better than CPUs at deep learning because of the vector processing that they supported (initially designed for graphics). TPUs were the first to introduce a dedicated matrix multiply unit (added to GPUs shortly after) giving another 12.5X improvement.

The promise of Moore’s law was already going away by 2012 and since then going from roughly a 28 nm process to 5 nm gave only about 2.5X in improvement.

The last 2X on GPUs comes from leveraging the sparsity in the deep learning computations (lots of zeros) and using them to optimize the multiplies.

Of course the 1000X we talked about is for a single GPU, if we look at what is used for training, a single TPUv5p pod supports ~4 Exaflops (that’s 4x10¹⁸) - and stringing a few pods together (likely done for Gemini training) increases that by another 10X perhaps. That said, compared with the largest model in 2012 (cat paper) that had about 1 billion parameters, we seem to be in the few trillion parameter range with the recent LLMs, a factor of 1000X as well.

Can we continue on the same path for more improvements?

There are promising results with lower precision on 4-bit models although the hardware doesn’t take enough advantage of it yet. Early research on 1-2 bit models shows that they are trainable, but not ready for prime time. Could we go all the way to 1-bit? That’s at least an 8X improvement on an 8-bit model, potentially more.

Is there room for hardware improvements? Beyond the dedicated MXU (matrix multiply unit), there has been little improvement in the chip design itself. Cerebras offered a bigger and interesting change with their wafer scale systems, but it hasn’t moved the needle much as there hasn’t been much co-design across models and their hardware.

Going down in bits (particularly 1-bit since it replaces multiplications entirely) in either of activations or weights, will give some improvements (say 2X) and is likely to push the needle towards memory bandwidth limitations rather than compute, thus requiring bigger changes to hardware design.

Process improvements (aka Moore’s law) are going slow and seem unlikely to provide much. The best we can hope for is another 2X over the next 5-10 years.

The gains from sparsity have been small so far (~2X) but is time to leverage it more? With the model sizes increasing drastically, it should be easier to have sparser layers. The current hardware has mostly been designed for dense computation, and leveraging unstructured sparsity is not going to be easy.

While it is easy to add more chips to the fleet, it is hard to leverage them for training a single model. Large batch training is not very efficient (per example) and hence the size of the batch coupled with the width of the model limits the max parallelism. The large costs involved in experimenting with the largest clusters makes it difficult to find approaches that work in this setting.

Combining all these together seems to have potential for 64X, which is helpful but nowhere close to the 1000X we would like.

Of course a completely new hardware approach could provide a significant bump and there are some crazier ideas including photonic chips, analog chips and neuromorphic designs, but they are still limited in what they have been able to deliver.

In the existing space of ideas, sparsity has the most room and could easily provide the additional 16X needed if it works, but will need some support from hardware to scale up.

So what are the odds of us getting another 1000X in the next decade? The road to the next 1000X is harder than the last one. The motivations are a lot higher with AI having proven its value. The exact path may not be clear today, but the explorations across the industry as a whole are likely to yield strong results and take us to a better place.

Siri turns 13 this year. I look forward to a version more powerful than GPT4 running entirely on my phone before it reaches adulthood.

Shikhar Jaiswal

Mar 5, 2024

I'm curious to understand in what way do you think "AI has proven it's value?". Sure it has helped with answering and speech engines, but the impact still seems to be pretty low so far (OpenAI / Perplexity have all seen close to zero traffic growth this past quarter). Maybe Mid-Journey and Pika offer some benefits when it comes to generating non-copyrighted content, but the vast majority of people don't use these tools. Where do you see the proof of AI value?

Expand full comment

2 replies by Rajat Monga and others

2 more comments...

Rajat’s Substack

Discussion about this post