Topic 8: What is LSTM and xLSTM?

we review what we know about LSTM networks and explore their new promising development ā€“ xLSTM

There is an opinion that the recent success of ChatGPT and transformer-based Large Language Models (LLMs) has sucked out the majority of resources from other deep learning areas, including Recurrent Neural Networks (RNNs). The impressive achievements and commercial potential of LLMs have heavily influenced research priorities, media coverage, and educational trends, leaving fields like RNNs, computer vision, reinforcement learning, and others with much less attention. Thatā€™s why the new paper introducing Extended Long Short-Term Memory (xLSTM) excited the ML community: LSTMs are not dead! RNNs are coming back!

According to Sepp Hochreiter, pioneer of deep learning and one of the authors of LSTM, xLSTM excels in time series prediction: "Our xLSTMTime model demonstrates excellent performance against state-of-the-art transformer-based models as well as other recently proposed time series models." Thatā€™s big! Letā€™s review what we know about LSTM networks and explore their new promising development ā€“ xLSTM.

In todayā€™s episode, we will cover:

  • LSTMs are not dead at all: current use-cases

  • The story of LSTM

    • What is the Vanishing Gradient Problem?

    • Popularization and success

    • LSTM limitations and their overshadowing by Transformers

  • Introducing xLSTM: Addressing the shortcomings

    • The Architecture of xLSTM

    • Evaluations

    • Applications across domains

  • Conclusion: The future of sequence modeling with xLSTM

  • Bonus: Resources

LSTMs are not dead at all: current use-cases

Just to give everything its proper credit, LSTMs are not dead at all. They might be overshadowed but are still in heavy use. Here are a few examples of how LSTMs have been used in our daily lives (for years!):

  • Traffic prediction in navigation apps: Apps like Google Maps or Waze employ LSTMs to predict traffic patterns. By analyzing historical traffic data, current conditions, and even factors like weather or local events, these models can forecast traffic congestion and suggest the fastest routes in real-time.

  • Music generation and recommendation: Streaming services like Spotify use LSTMs to analyze your listening history and generate personalized playlists. The LSTM can understand patterns in the types of music you enjoy and predict songs you might like, even accounting for how your tastes change over time.

  • Predictive text on smartphones: When you're typing a message, LSTMs help predict the next word you're likely to use based on the context of what you've already written. (ā€œPredicting the future of our planetā€ - thatā€™s the text LSTM just suggested to me).

The story of LSTM

In the early 1990s, researchers were excited about Recurrent Neural Networks (RNNs). These networks were designed to handle sequential data, making them useful for tasks like speech recognition and time-series prediction. But, RNNs had a significant flaw: the vanishing gradient problem.

What is the Vanishing Gradient Problem?

Be the first to get our detailed explanations and valuable collection of relevant resources straight to your inbox ā€“>

Think of trying to recall a series of events from a few days ago while still remembering what happened today. For RNNs, updating their weights to learn from data is similar. They use backpropagation through time (BPTT) to adjust weights based on prediction errors. As the error signal is propagated back through many time steps, it can become so small that the network's weights barely change. This vanishing gradient problem means the network struggles to learn long-term dependencies and forgets important information from earlier time steps. What is not a problem for a human is a big challenge to RNNs.

Two German researchers, JĆ¼rgen Schmidhuber and his doctoral student Sepp Hochreiter, were determined to find a solution. In 1997, they introduced an improved RNN architecture called Long Short-Term Memory (LSTM). LSTMs were designed with a memory cell that could maintain information over long periods. This memory cell was controlled by three gates:

  • the input gate

  • the forget gate

  • the output gate

These gates regulate the flow of information, allowing the network to keep important information for longer and forget what is no longer needed.

The initial reception of their work was lukewarm. But a few researchers kept building on the LSTM foundation. Schmidhuber himself didnā€™t plan to give up. In 2000, Felix Gers, JĆ¼rgen Schmidhuber, and Fred Cummins introduced peephole connections, which allowed the gates to access the cell state directly. This improvement helped LSTMs learn precise timings of events, enhancing their performance.

Popularization and success

Bidirectional LSTM (BiLSTM) (2005): Alex Graves and Jurgen Schmidhuber introduced BiLSTMs in 2005, comprising two LSTM layers running in opposite directions (forward and backward). This architecture captures both past and future context, improving performance in tasks like speech recognition and machine translation.

The rise of deep learning in the 2010s brought another wave of innovation. Researchers began stacking multiple LSTM layers, creating deep LSTM networks that could learn hierarchical features. This advancement made LSTMs even more powerful, enabling them to excel in a wide range of applications, from machine translation to speech recognition.

In 2014, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le popularized LSTMs for machine translation with their sequence-to-sequence (Seq2Seq) models. These models used LSTMs for both encoding and decoding sequences, leading to notable improvements in translation quality.

In 2015, attention mechanisms (Bahdanau attention) were introduced by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. These mechanisms allowed LSTMs to focus on specific parts of the input sequence, further boosting their performance in tasks like translation and summarization.

A lot changed in 2017 after Vaswani et al. introduced Transformer models in the paper "Attention Is All You Need." This marked a shift towards attention-based mechanisms. But why?

LSTM limitations and their overshadowing by Transformers

LSTM networks have been the backbone of many early successes in natural language processing (NLP). But despite their ability to capture long-range dependencies and solve the vanishing gradient problem, LSTMs have significant limitations that led to their gradual overshadowing by transformers.

Firstly, LSTMs struggle with parallelization. Their sequential nature means that each step depends on the previous one, making it challenging to process data efficiently on modern hardware. This results in slower training times and higher computational costs, which are less than ideal in an era where speed and efficiency are paramount.

Moreover, LSTMs often face difficulties in handling very long sequences. Although they are designed to remember information over long time periods, their effectiveness diminishes as the sequence length increases. This limitation is particularly problematic for tasks requiring the model to understand context from extensive input data, such as long-form text or complex temporal patterns.

Transformers addressed this limitations. Their efficiency, scalability, and superior performance have firmly established them as the new standard in the field, they outpaced LSTMs at scale.

Introducing xLSTM: Addressing the shortcomings

ā€œNever give upā€ is probably the motto of LSTM aficionados. Remember Schmidhuberā€™s doctoral student Sepp Hochreiter? He became the head of the Linz Institute of Technology (LIT) AI Lab, a founding director of the Institute of Advanced Research in Artificial Intelligence (IARAI), and was awarded the IEEE CIS Neural Networks Pioneer Prize in 2021 for his work on LSTM networks. He continued to work on them. And in May 2024, he, along with eight other researchers from Linz, Austria, introduced Extended Long Short-Term Memory: xLSTM.

The researchers asked themselves a question: ā€œHow far can we get in language modeling by scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating the known limitations of LSTMs?ā€

They also argued that Transformers are powerful but lack the linear scaling with sequence length that LSTMs have.

The Architecture of xLSTM

Building on the solid foundation of traditional LSTMs, the researchers introduced two key enhancements:

  1. Exponential Gating: This update equips LSTMs with a more flexible approach to managing information flow. It's like fine-tuning the mechanisms that control data processing, providing a more nuanced handling of inputs and memory.

  2. Novel Memory Structures: xLSTM enhances memory in two significant ways:

    • Scalar LSTM (sLSTM): This version improves how memory is mixed and updated, allowing for more accurate data retention and processing.

    • Matrix LSTM (mLSTM): By transforming memory cells into a matrix structure, mLSTM enhances the network's ability to handle operations in parallel, which significantly speeds up processing.

The matrix structure in mLSTM not only expands memory capacity but also enhances the efficiency of information retrieval and storage. This structural change allows for better handling of tasks with complex data structures or long-range dependencies.

These enhancements are integrated into resilient block architectures, which essentially means they are stacked in a way that builds upon each previous layer's knowledge, creating a deep and powerful network.

Image Credit: The official paper

With these improvements, xLSTM enhances its ability to handle complex sequence modeling tasks, which boosts its performance and applicability in real-world scenarios.

Evaluations

The researchers conducted a series of structured task evaluations, where xLTSM performed quite well. The models were also tested on the SlimPajama and PALOMA datasets, focusing on real-world scenarios. SlimPajama provided a platform to compare models like xLSTM, RWKV, and Llama, focusing on their performance metrics, such as perplexity, across extensive training sessions. These tests highlighted the influence of different model architectures on performance at scale.

The PALOMA dataset, comprising diverse data sources, allowed for testing across various natural language processing tasks, ranging from understanding internet slang to complex reasoning. This testing demonstrated how models like xLSTM perform in handling linguistic diversity, often showing lower perplexity, which indicates a stronger ability to manage varied language inputs.

These experiments on SlimPajama and PALOMA underline the practical strengths of xLSTM, showing its adaptability and potential for real-world applications in AI.

For all experiments, the researchers used Python 1 3.11 with PyTorch 2.2.0 2 , and CUDA 12.1 3 on NVIDIA A100 GPUs.

Applications across domains

Answering the question posed at the beginning, the researchers state that xLSTMs perform at least as well as current technologies like Transformers or State Space Models. xLSTM models demonstrate promising results in language modeling, comparable to advanced techniques like Transformers and State Space Models. Their scalability suggests they could effectively compete with major language models and potentially impact fields such as reinforcement learning, time series prediction, and physical system modeling.

Conclusion: The future of sequence modeling with xLSTM

xLSTM represents a significant evolution in the LSTM architecture, addressing past limitations and setting a new standard for sequence modeling. It opens up a whole new range of possibilities for tackling complex problems involving data over time. As this technology continues to mature, its integration into various domains promises to drive forward the capabilities of AI in processing complex, sequential data.

By leveraging modern deep learning innovations, xLSTM achieves scalability to billions of parameters, making it competitive with contemporary models like Transformers. However, xLSTM is unlikely to replace Transformers, which are superior in parallel processing and attention-based tasks. Instead, xLSTM will likely complement them by excelling in memory efficiency and long-sequence handling. Think of xLSTM as a new tool in the AI toolbox, one that can enhance our ability to understand and predict patterns in our world. It's especially exciting that research and breakthroughs are also happening in other areas, such as RNNs.

Thank you for reading, please feel free to share this article with your friends šŸ¤

Bonus: Resources

Implementation:

Research papers, mentioned in the article or related to the topic:

Good reads:

Twitter:

How did you like it?

Login or Subscribe to participate in polls.

Thank you for reading! Share this article with three friends and get a 1-month subscription free! šŸ¤

Reply

or to participate.