Reviving and Extending LSTM to Compete With Transformers









X-LSTM: Reviving and Extending LSTM to Compete with Transformers - An Interview with Maximilian Beck

In the rapidly evolving landscape of AI architectures, Transformer models have dominated the field of natural language processing since their introduction in 2017. But are they the only path forward? In this comprehensive interview, we speak with **Maximilian Beck**, joint first author (along with Corbinian Pel) of the groundbreaking X-LSTM paper, which presents a compelling alternative to the ubiquitous Transformer architecture.

X-LSTM (Extended Long Short-Term Memory) revives and enhances the classic LSTM architecture, addressing its limitations while maintaining its efficiency advantages. As you'll discover, this innovation is not merely academic—X-LSTM models demonstrate competitive or superior performance compared to Transformers in language modeling tasks.



The Rise and Fall of LSTMs

Before diving into X-LSTM, it's worth understanding the historical context. Max begins by taking us back to the pre-2017 era:

"Before 2017, LSTMs were state-of-the-art in language modeling," Max explains. "This was demonstrated in papers like 'Exploring the Limits of Language Modeling' from 2016, where the 'limits' at that time were 1 billion parameter LSTM models with only two layers."

The landscape changed dramatically in 2017 with the publication of "Attention Is All You Need," which introduced the Transformer architecture. This innovation quickly surpassed LSTM performance, leading to a wholesale shift in the field.

"Since then, we've seen the progression from GPT-2 to GPT-3, which scaled up to more than 175 billion parameters," Max notes. "These models demonstrated remarkable capabilities, learning from vast internet data in an unsupervised way and exhibiting few-shot learning abilities."

Today, Transformer-based architectures power virtually all leading large language models, including GPT-4, Claude, and Gemini. However, these models come with significant drawbacks:

1. Self-attention scales quadratically with sequence length during training
2. Inference requires substantial GPU memory due to the growing KV cache
3. Processing long contexts becomes increasingly inefficient



 Limitations of Traditional LSTMs

Despite their earlier success, traditional LSTMs fell out of favor due to several key limitations that the X-LSTM paper addresses:



1. Inability to Revise Storage Decisions

Max illustrates this problem with a nearest neighbor search example:

"In this problem, the goal is to predict the price of the closest value to a search key. Given changing inputs, the model must continuously update its prediction based on which value is closest to the key. Transformers excel at this task because their KV cache can always look back at the full sequence. Traditional LSTMs, however, struggle because they must carefully manage what goes into their fixed memory and how to modify it."

 2. Limited Storage Capacity

Unlike Transformers with their growing KV cache, LSTMs have a fixed state size consisting of scalar memory cells.

"This limitation becomes apparent in rare token experiments," Max explains. "When we analyze performance based on token frequency in the training data, traditional LSTMs struggle significantly with rare tokens that must be memorized rather than learned."


 3. Efficiency and Parallelization Challenges

"The recurrent connections in LSTMs make them difficult to train in parallel, unlike Transformers," Max notes. "These connections—where the hidden state of the previous time step influences the current computation—hinder parallelization during training."


 The X-LSTM Innovation

To address these limitations, the X-LSTM introduces several key innovations:


Exponential Gating

"The core of our new X-LSTM is the exponential gating mechanism," Max explains. "In the traditional LSTM, all gates use sigmoid activation functions, which limit their range between 0 and 1. We replaced the input gate with an exponential function."

This change allows the model to overwrite its memory more effectively when important new information arrives:

"With exponential gating, if we get a significant new input, it can override what's already in memory, even if the forget gate hasn't cleared that space. This helps solve the storage revision problem."

To make this exponential gating work, the team also added:
- A normalizer state alongside the cell state
- A stabilization mechanism to prevent numerical overflow



Matrix Memory and Specialized Variants

X-LSTM introduces two variants:


1. SLSM (Scalar LSTM)

- Similar to traditional LSTM with scalar cell states
- Introduces a new memory mixing approach
- Uses block-diagonal matrices for recurrent connections, similar to multi-head attention in Transformers

2. MLSM (Matrix LSTM)

- Uses a matrix cell state instead of scalar memory cells
- Employs a covariance update rule with outer products between values and keys
- Has no recurrent connections, enabling fully parallel training
- Maintains a fixed state size for efficient autoregressive inference

"Both variants are connected through exponential gating and both provide efficient autoregressive inference because they maintain a limited state size, unlike Transformers with their growing KV cache," Max summarizes.


Performance and Results

The team evaluated X-LSTM against Transformers and other alternative architectures like Mamba and RW-KV4, with impressive results:



Small-Scale Comparisons

"We compared our models to different competitors on smaller datasets using 15 billion tokens from SlimPajama. All models used the GPT-2 tokenizer and had approximately 350-400 million parameters," Max explains.

In these tests, X-LSTM clearly outperformed Transformer models and other alternatives.



 Scaling Behavior

The team also trained models on 300 billion tokens (the size of GPT-3's initial training) with various parameter counts from 125 million to 1.3 billion.

"The scaling behavior indicates that X-LSTM performs favorably compared to Mamba and Llama," Max reports. "As we increase model size, the performance gap grows, suggesting that larger X-LSTM models will be serious competitors to current large language models."



Length Extrapolation

One of the most impressive findings concerns X-LSTM's ability to handle sequences longer than those seen during training:

"We trained all models with a context length of 2048 tokens but evaluated them on much longer contexts. X-LSTM clearly outperformed Transformers, which struggle beyond their training length due to positional encoding limitations," Max notes.

"Even when we evaluated on a full 60K context length—30 times longer than the training context—X-LSTM maintained strong performance, demonstrating its autoregressive inductive bias is beneficial for length extrapolation."



 Future Directions

The X-LSTM team has ambitious plans:

"We're working on building larger models—7 billion parameters and beyond," Max shares. "For that, we need to write fast and efficient kernels for our new LSTM variants."

They're also exploring additional application areas where X-LSTM might particularly excel.



 Q&A: The Future of Recurrent Networks



Why focus on LSTM when everyone else is working on attention-based models?

"Transformers scale quadratically and are especially inefficient during text generation," Max explains. "The longer the context gets, the slower they generate new tokens. This is a real problem for applying these models, especially on edge devices."

"Recurrent networks like X-LSTM have a fixed state size, so each token takes the same generation time regardless of whether we've already processed 10 or 10,000 tokens. For Transformers, generation time scales linearly with context length."



Will X-LSTM see adoption similar to Mamba?

"I believe so. The inventor of the original LSTM has started a company called NX.AI, and we're working on scaling up X-LSTM there," Max reveals. "Our goal is to build products with X-LSTM and demonstrate that it's usable for companies."

"To convince people who are currently locked into Transformers, we need to show that X-LSTM can achieve the same or better performance more efficiently and with less compute. Once we do that, I think people will start switching."



Why would someone choose X-LSTM over Mamba?

"We show in our paper that X-LSTM performs better on language modeling than Mamba, which is a good reason to choose it," Max notes. "We also demonstrate stronger length extrapolation capabilities, which is crucial for handling long contexts efficiently even with a fixed state size."


Conclusion

X-LSTM represents a fascinating revival of recurrent network architectures, enhanced with modern techniques to address their historical limitations. By combining the efficiency advantages of recurrent models with competitive performance against Transformers, X-LSTM offers a promising alternative path for the future of language models—particularly for applications where computational efficiency and handling long contexts are crucial.

As the AI community continues to explore alternatives to the dominant Transformer paradigm, innovations like X-LSTM remind us that there are multiple architectural paths to achieving advanced language understanding and generation capabilities.

---


Comments

Popular posts from this blog

Video From YouTube

GPT Researcher: Deploy POWERFUL Autonomous AI Agents

Building AI Ready Codebase Indexing With CocoIndex