
Do you hear about MoE architecture a lot lately? Actually, it's nothing new, the term has become a trend, but only in recent months, when companies like Meta or OpenAI started using it in practice.
Mixture of Experts could solve one of the biggest problems of current AI - how to scale models without costing a rocket or requiring a datacenter the size of a city.
So what is it and on what principles does it work? Let's go through it point by point.
What's that?
MoE (Mixture of Experts) is a type of architecture that activates only a small part of the model - specific "experts" - on each query.
You can think of it as a team of specialists. If you have a question about HR, you won't burden everyone in the company, but go directly to an expert on the subject. MoE works on similar principles.
Its concept originated back in but only in recent years has it become practically applicable on a larger scale.
Why is she important now?
Today's AI models are getting bigger and bigger, and so are the costs of running them. The MoE architecture helps to address this.
Instead of triggering the entire model with each query, the MoE triggers only a small portion - specific experts who are suited for the task at hand. This means less computing power and faster responses, which is key for real-world deployments in chatbots, mobile apps, or agent systems.
At the same time, the MoE architecture is much more scalable than traditional "monolithic" models - it can grow without increasing costs at the same rate. That's why it's appearing in more and more commercial systems: Meta is using it in Llama 4, Mistral has introduced a pure MoE model in Mixtral, and OpenAI suggests using a similar approach in GPT-4 Turbo.
Moreover, MoE is also suitable for specialized agents - each "expert" can be focused on something different, which increases the quality of answers and reduces the amount of computation performed.
Simply put, the MoE architecture is a way to have a powerful model that really only uses what is needed.
How does it work technically?
We have already mentioned that the MoE selects from several experts for each question. For example, in one query, it may select 2 or 8 of the 64 experts. But how can it decide which ones they will be?
This is due to the so-called routing mechanism.
It is a mechanism that can assign scores to individual experts based on the valuation of the input token. It then selects only the top ones with the highest scores.
Author

Bára Mrkáčková
People & Marketing CoordinatorI am in charge of keeping the employees at DXH happy. I manage all things related to recruitment, employer branding, and event planning. I also take care of our marketing.