Zephyr, Mistral and LLMs in general

Exploring the role of foundation models and the emergence of smaller LLMs like Mistral 7B

There was big news in the world of LLMs at the beginning of the month: Mistral 7B. This new LLM was unique for its incredible performance and for its relatively small size. This LLM does not have to hide behind models like the Llama 2 13B. Most importantly, this model belongs to the class of foundation models. A foundation model is a specific type of machine learning model in the field of natural language processing (NLP). These models are trained on many different tasks. They are forming the basis of any fine-tuned model. However, the training process is costly. Lots of computing power is required. The classic foundation models are the GPT series, PaLM or Llama. In particular, LLaMa and Llama 2 have become the standard for open models. However, why are foundation models so important?

Foundation models

Foundation models play a very important role in the ecosystem of open models. They are the backbone of the whole open source LLM community. If the foundation of a house is not very solid, then the structure above becomes unstable. Quality and licence are very important for everything you want to do with the model later on. Only a handful of these models are available as open source models, due to the amount of money needed to create them. But foundation models are not everything. We saw how far fine-tuning can go at the beginning of this year. When OpenAI became the talk of the town with ChatGPT and the GPT-3.5 and -4 models behind it, the open source community raced to catch up in creating similarly powerful models.

At the same time, there was a renewed focus on the importance of a good training dataset. With a rich pool of information, the performance of Meta’s initially „weak“ Llama models could be drastically improved. Llama 2 also improved on this and allowed commercial use for the first time with its licence. Meanwhile, the Falcon LLM series had already attempted to solve the licensing problem with its own almost equally good models. But back to Mistral 7B. Mistral 7B does not necessarily reach new heights in the performance of LLMs in general, but it plays in a completely different performance class than one might expect at the beginning.

The price of openness

Open models have a price, and I mean that by the word. In reality, OpenAI or Anthropic are more SaaS companies with a research division. They hide models as magical black boxes behind a very convenient API with a small price tag. When you develop with these AI models, you are mostly using an API like a normal SaaS company. Maybe that’s a critical – maybe not so obvious – danger behind all these AI startups offering the most innovative solutions. Their core business is making smart requests to OpenAI’s API. But if OpenAI changes something in their product, they are directly affected. A good example is Codex, which was shut down at short notice.

On the other hand, we have the open models. Here, the developer has to allocate the necessary resources himself. What sounds simple is in fact a very big challenge. Large LLMs require huge amounts of computing power to make them work. Without GPUs there is no way. As a result, many of the „convenience“ features, such as serverless, that we have developed in the cloud industry in recent years are not available in the AI world.

Llama 70B vs Mistral 7B

Want an example? To use the most powerful model of the Llama 2 series on AWS (Llama 70B), you should use ml.p4d.24xlarge Instance, with 8 A100 (320 GB VRAM) GPUs each. Of course, you can halve this number by using quantization. However, we are still talking about a ml.g5.12.xlarge instance with 12 A10G GPUs (96 GB VRAM each). In this case, the cost is $7.09 per hour (on-demand pricing) or $5274.96 per month. The instance can only handle 5 concurrent requests. Real applications require more instances at a much higher price. Who wants to pay that for a few hundred requests a month?

Of course, you could scale such instances down to 0. However, such instances have very long startup times, which would make any stationary customer turn away in a flood of anger. But the calculation looks completely different when we go smaller and look at the 7B models. Here a ml.g5.2xlarge with 24GB VRAM can be used for either the Llama 2 7B or 13B model with quantization. This instance only costs 1.52 USD per hour or 1130.88 USD per month. Finally, with a ml.g5.2xlarge, we are talking about only one GPU per instance. At the same time, the memory requirements are starting to approach those of consumer graphics cards, as an NVIDIA RTX 4090 already has around 24 GB of VRAM. It is therefore very pleasing to see that Mistral has paid more attention to this size range and made improvements.

Conclusion

The best example of this trend is the new Zephyr 7B from Huggingface. This builds heavily on the Mistral foundation and is starting to compete with Llama2 70B and GPT3.5 at a fraction of the size. It uses a mix of public and synthetic datasets and employs Direct Preference Optimisation (DPO). More details are described in the model map.

What does this mean for the future? Well, open source is now researching on two fronts at once. On the one hand, many are still trying to reach the suffocation of OpenAI. On the other hand, many are trying to get the most out of the existing parameter sizes. So large models like GPT-3.5 with 175 billion parameters or the rumoured GPT-4 8 expert infrastructures with 1.4 trillion parameters (8x 175 billion) are very difficult to implement in the world of open models. Not every company has the resources and infrastructure to maintain such large models or a dedicated AI deployment team.

We are currently seeing the outsourcing of such efforts at OpenAI or other companies. But tasks can usually be done much more efficiently. Not every project needs always a GPT-4 running in the background. Maybe your project starts with a big model from the big vendors for testing. However, when knowing your requirements, you may want to look at the powerful and smaller open models for reducing your surging costs.