Efficient Scaling of Large Language Models with Mixture of Experts and 3D Analog In-Memory Computing
Efficient Scaling of Large Language Models with Mixture of Experts and 3D Analog In-Memory Computing
Academic Background
In recent years, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, text generation, and other fields. However, as the scale of these models continues to grow, the costs of training and inference have also surged significantly, particularly in terms of memory footprint, computational latency, and energy consumption. This has become a major bottleneck hindering the widespread application of LLMs. In traditional von Neumann architectures, frequent data movement between memory and computational units exacerbates these challenges, leading to the so-called “von Neumann bottleneck.”
To address this issue, researchers have explored various technical pathways, one of which is the “Mixture of Experts” (MoE) architecture. MoE employs a conditional computing mechanism to dynamically select processing paths for inputs, activating only a portion of the model’s parameters, thereby significantly reducing computational demands. However, the deployment of MoE models still relies on traditional hardware architectures, failing to fully resolve the bottleneck of parameter access. Meanwhile, Analog In-Memory Computing (AIMC), as an emerging technology, performs computations directly within memory, avoiding data movement and offering higher energy efficiency. The combination of MoE with AIMC, particularly leveraging 3D Non-Volatile Memory (3D NVM) technology, may provide a new pathway for scaling LLMs.
This paper builds upon this context to explore the deployment of the MoE architecture on 3D AIMC hardware and evaluates its potential in reducing the inference cost of large-scale language models.
Source of the Paper
This paper was co-authored by Julian Büchel and Athanasios Vasilopoulos from IBM Research Europe, along with experts from IBM Almaden Research Center, Micron Technology, and other institutions. Published in January 2025 in the journal Nature Computational Science, the paper is titled “Efficient Scaling of Large Language Models with Mixture of Experts and 3D Analog In-Memory Computing.”
Research Process and Results
1. Research Objectives and Framework
The core objective of this paper is to explore the deployment of the MoE architecture on 3D AIMC hardware and assess its potential in reducing the inference cost of large-scale language models. The researchers first analyzed the limitations of traditional LLMs under the von Neumann architecture, emphasizing the bottlenecks caused by parameter access and data movement. Subsequently, they proposed the idea of combining MoE with 3D AIMC, believing that this combination could effectively address the parameter access bottleneck and reduce energy consumption and latency during inference.
2. Simulating the 3D AIMC System
To evaluate the performance of MoE on 3D AIMC hardware, the researchers designed an abstract simulation framework for a 3D AIMC system. The system consists of multiple 3D memory units (tiles), each containing multi-layered non-volatile memory arrays (tiers). In the simulation, the researchers mapped the parameters of the MoE models to these memory units and assessed the models’ inference performance and energy consumption through simulation.
- Simulation Framework Design: The simulator, implemented in Python, used PyTorch and torch.fx libraries to define the model architecture and data flow. The researchers developed customized simulation modules to support the mapping and execution of MoE models. The simulator recorded inference time, energy consumption, and peak memory requirements.
- Model Mapping and Scheduling: The researchers mapped different layers of the MoE model onto the 3D AIMC hardware and employed a greedy algorithm to optimize the mapping strategy. Through simulation, they found that the MoE model’s conditional computing mechanism allowed it to better utilize the high-capacity memory of 3D AIMC, reducing conflicts between computational units.
3. Comparison of MoE and Dense Models
To evaluate the advantages of MoE, the researchers compared it with traditional dense models. The experimental results showed that as the number of parameters increased, the inference time of the MoE model remained almost unchanged, while that of the dense model increased significantly. This indicates that the MoE architecture can scale the model by increasing the number of experts without significantly increasing computational latency.
- Inference Performance: In the simulations, the inference time of the MoE model was much lower than that of the dense model, especially when the number of parameters reached tens of billions. The researchers also found that as the number of experts increased, the inference time of the MoE model grew slowly, demonstrating its superiority on 3D AIMC hardware.
- Energy Consumption and Memory Requirements: Since 3D AIMC hardware performs computations directly within memory, the energy consumption and memory requirements of the MoE model were significantly lower than those of the dense model. The researchers noted that the peak memory requirement of the MoE model was only around 1MB, far lower than the tens of GB required by the dense model.
4. Performance Comparison with GPUs
To further validate the advantages of 3D AIMC hardware, the researchers compared it with the NVIDIA A100 GPU. The experimental results showed that for larger MoE models, the throughput of the 3D AIMC hardware was six times higher than that of the GPU. Additionally, the energy efficiency of 3D AIMC hardware was three orders of magnitude higher than that of the GPU, highlighting its significant advantages in processing large-scale language models.
5. Robustness of MoE to Hardware Noise
To assess the robustness of the MoE model to noise in analog in-memory computing hardware, the researchers conducted hardware-aware training. The experimental results showed that the MoE model maintained floating-point-equivalent accuracy (iso-performance) even at a noise level of 6.3%, indicating its robustness on 3D AIMC hardware.
Conclusions and Significance
This study demonstrates that combining the MoE architecture with 3D AIMC hardware can significantly reduce the inference costs of large-scale language models, particularly in terms of energy consumption and latency. The MoE model’s conditional computing mechanism allows it to better utilize the high-capacity memory of 3D AIMC, reducing conflicts between computational units. Compared to traditional dense models and GPUs, the combination of MoE and 3D AIMC exhibits significant advantages in throughput, energy efficiency, and area efficiency.
This research provides a new direction for scaling large-scale language models in the future, particularly in the areas of hardware cost and high-efficiency computing. By combining the MoE architecture with 3D AIMC technology, researchers may develop more efficient and cost-effective large-scale language models, facilitating their widespread deployment in practical applications.
Research Highlights
- Innovative Architecture Combination: This paper is the first to combine the MoE architecture with 3D AIMC hardware, proposing a novel approach to addressing the inference cost bottleneck of large-scale language models.
- Significant Cost Reduction: The results show that the combination of MoE and 3D AIMC can significantly reduce energy consumption and latency during inference, especially when the number of parameters reaches tens of billions.
- Hardware Robustness: Through hardware-aware training, the MoE model maintained high accuracy even under high noise levels, demonstrating its robustness in analog in-memory computing hardware.
Additional Valuable Information
The paper also open-sourced the simulator and implementation code for the MoE model, making them available for other researchers and developers. This initiative will help drive further research and applications in related fields.