
Do you hear about MoE architecture a lot lately? Actually, it's nothing new, the term has become a trend, but only in recent months, when companies like Meta or OpenAI started using it in practice.
Mixture of Experts could solve one of the biggest problems of current AI - how to scale models without costing a rocket or requiring a datacenter the size of a city.
So what is it and on what principles does it work? Let's go through it point by point.
What's that?
MoE (Mixture of Experts) is a type of architecture that activates only a small part of the model - specific "experts" - on each query.
You can think of it as a team of specialists. If you have a question about HR, you won't burden everyone in the company, but go directly to an expert on the subject. MoE works on similar principles.
Its concept originated back in 90. yearsbut only in recent years has it become practically applicable on a larger scale.
Why is she important now?
Today's AI models are getting bigger and bigger, and so are the costs of running them. The MoE architecture helps to address this.
Instead of triggering the entire model with each query, the MoE triggers only a small portion - specific experts who are suited for the task at hand. This means less computing power and faster responses, which is key for real-world deployments in chatbots, mobile apps, or agent systems.
At the same time, the MoE architecture is much more scalable than traditional "monolithic" models - it can grow without increasing costs at the same rate. That's why it's appearing in more and more commercial systems: Meta is using it in Llama 4, Mistral has introduced a pure MoE model in Mixtral, and OpenAI suggests using a similar approach in GPT-4 Turbo.
Moreover, MoE is also suitable for specialized agents - each "expert" can be focused on something different, which increases the quality of answers and reduces the amount of computation performed.
Simply put, the MoE architecture is a way to have a powerful model that really only uses what is needed.
How does it work technically?
We have already mentioned that the MoE selects from several experts for each question. For example, in one query, it may select 2 or 8 of the 64 experts. But how can it decide which ones they will be?
This is due to the so-called routing mechanism.
It is a mechanism that can assign scores to individual experts based on the valuation of the input token. It then selects only the top ones with the highest scores.
There are several popular ways to implement the routing mechanism. Some of the most common include top-k routing or expert choice routing. You can read more about the specific differences between them here.
In terms of efficiency, it is advisable to activate the network evenly - i.e. one expert does not handle all queries - this leads to optimization of the set of experts. The models perform an analysis of the most frequent prompt areas, and this then allows them to create areas of expertise.
The outputs of the activated experts are combined using a weighted sum, with the weights determined by a gating function based on the scores of each expert. Experts with higher scores have more influence on the final output. In the case of top-k routing experts with lower scores may also contribute to the result, but their influence is smaller.
Training the MoE model is a bit more challenging because it involves several key aspects. The model must not only learn the task itself, but also optimize the routing mechanism that decides which expert is best suited for a given input.
Another challenge is to ensure that all experts are used equally. Without additional measures, some experts may be favoured by the routing mechanism, leading to overload and overlearning, while other experts remain unused. To balance the load, auxiliary loss functions are often used to penalize uneven distribution of inputs among experts.
Conclusion
Mixture of Experts is the type of architecture currently used by most major language models. It makes it possible to achieve higher model performance without having to constantly increase the size and cost of the models. It activates only a certain part of the model - specific "experts" - with each query.
Models using the MoE architecture are efficient, scalable and ideal for practical deployments - from chatbots to dedicated agents.
Author
Bára Mrkáčková
People & Marketing CoordinatorI am in charge of keeping the employees at DXH happy. I manage all things related to recruitment, employer branding, and event planning. I also take care of our marketing.