Click here - to use the wp menu builder

Meta proposes new scalable memory layers that improve knowledge, reduce hallucinations

January 7, 2025

98views

Meta proposes new scalable memory layers that improve knowledge, reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

As enterprises continue to adopt large language models (LLMs) in various applications, one of the key challenges they face is improving the factual knowledge of models and reducing hallucinations. In a new paper, researchers at Meta AI propose “scalable memory layers,” which could be one of several possible solutions to this problem.

Scalable memory layers add more parameters to LLMs to increase their learning capacity without requiring additional compute resources. The architecture is useful for applications where you can spare extra memory for factual knowledge but also want the inference speed of nimbler models.

Dense and memory layers

Traditional language models use “dense layers” to encode vast amounts of information in their parameters. In dense layers, all parameters are used at their full capacity and are mostly activated at the same time during inference. Dense layers can learn complex functions, and increasing their requires additional computational and energy resources.

In contrast, for simple factual knowledge, much simpler layers with associative memory architectures would be more efficient and interpretable. This is what memory layers do. They use simple sparse activations and key-value lookup mechanisms to encode and retrieve knowledge. Sparse layers take up more memory than dense layers but only use a small portion of the parameters at once, which makes them much more compute-efficient.

Memory layers have existed for several years but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.

Current frontier LLMs usually use some form of “mixture of experts” (MoE) architecture, which uses a mechanism vaguely similar to memory layers. MoE models are composed of many smaller expert components that specialize in specific tasks. At inference time, a routing mechanism determines which expert becomes activated based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to millions of experts, providing more granular control over the parameters that become activated during inference.

Upgrading memory layers

Memory layers are light on compute but heavy on memory, which presents specific challenges for current hardware and software frameworks. In their paper, the Meta researchers propose several modifications that solve these challenges and make it possible to use them at scale.

Memory layers can store knowledge in parallel across several GPUs without slowing down the model (source: arXiv)

First, the researchers configured the memory layers for parallelization, distributing them across several GPUs to store millions of key-value pairs without changing other layers in the model. They also implemented a special CUDA kernel for handling high-memory bandwidth operations. And, they developed a parameter-sharing mechanism that supports a single set of memory parameters across multiple memory layers within a model. This means that the keys and values used for lookups are shared across layers.

These modifications make it possible to implement memory layers within LLMs without slowing down the model.

“Memory layers with their sparse activations nicely complement dense networks, providing increased capacity for knowledge acquisition while being light on compute,” the researchers write. “They can be efficiently scaled, and provide practitioners with an attractive new direction to trade-off memory with compute.”

To test memory layers, the researchers modified Llama models by replacing one or more dense layers with a shared memory layer. They compared the memory-enhanced models against the dense LLMs as well as MoE and PEER models on several tasks, including factual question answering, scientific and common-sense world knowledge and coding.

Memory model vs dense layers — A 1.3B memory model (solid line) trained on 1 trillion tokens approaches the performance of a 7B model (dashed line) on factual question-answering tasks as it is given more memory parameters (source: arxiv)

Their findings show that memory models improve significantly over dense baselines and compete with models that use 2X to 4X more compute. They also match the performance of MoE models that have the same compute budget and parameter count. The model’s performance is especially notable on tasks that require factual knowledge. For example, on factual question-answering, a memory model with 1.3 billion parameters approaches the performance of Llama-2-7B, which has been trained on twice as many tokens and 10X more compute.

Moreover, the researchers found that the benefits of memory models remain consistent with model size as they scaled their experiments from 134 million to 8 billion parameters.

“Given these findings, we strongly advocate that memory layers should be integrated into all next generation AI architectures,” the researchers write, while adding that there is still a lot more room for improvement. “In particular, we hope that new learning methods can be developed to push the effectiveness of these layers even further, enabling less forgetting, fewer hallucinations and continual learning.”

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.