Google's Transformer T5

You, Mon Feb 03 2025 • AI Generative AI Transformers T5

Introduction

In their paper titled “Attention Is All You Need,” Vaswani et al. (2017) proposed the transformer architecture, which is the foundation for many of today’s most advanced generative models (Foster, 2023). Approximately a year later, the Generative Pre-trained Transformer (GPT) model was introduced by OpenAI (Radford et al., 2018). The model introduced by this paper, known as GPT-1, is a decoder model (Foster, 2023). Naturally, encoder models, such as BERT (Devlin et al., 2018), were also built using the transformer architecture. However, this paper will focus on Google’s T5 model, which uses a third structure called “encoder-decoder” (Raffel et al., 2019).

Model Overview

The T5 model introduced by Raffel et al. (2019) is very similar to the architecture initially proposed by Vaswani et al. (2017); however, the main difference in implementation is the amount of data each model was trained on (Foster, 2023). The original implementation was trained on ~1.4 GB of data, while the T5 implementation was trained on ~750 GB of data (Foster, 2023).

At a high level, encoder-decoder transformer models are made up of two sets of transformer blocks (Foster, 2023). As one might imagine, the first set of transformer blocks are encoders, while the second set are decoders (Foster, 2023). The attention layer for the encoder does not need causal masking (Foster, 2023). The encoder will also contain layer normalization steps and feed-forward layers (Foster, 2023). The decoder will have an initial attention layer that is masked; however, the subsequent attention layer does not need masking as it pulls the key and value from the encoder and only the query from itself (Foster, 2023). This process is called cross-referential attention (Foster, 2023). The decoder also has layer normalization steps and feed-forward layers (Foster, 2023). The output from the decoder is then passed through a typical linear transformation as well as a softmax activation function (Vaswani et al., 2017).

Cross-Referential Attention

If a T5 model was tasked with translating the phrase “I have the book outside” from English to Spanish, one would first be able to observe each word attending to each other word in the sentence within the encoder. For example, “have” might pay more attention to “book” than it does to “outside.” Ultimately, the decoder should predict the Spanish translation to be “Tengo el libro afuera.” The decoder would first use self-attention to predict each word using only the previous words in the sentence. Next, the decoder would use cross-attention to look at the encoder’s representation for each word. This cross-attention step would be especially important for translating words like “the,” where the Spanish translation depends on the subsequent noun.

Video Generation

If one was tasked with generating a video with only a text prompt as input, this could be accomplished with a formula similar to the encoder-decoder architecture leveraged by the T5 model. The main change that would be necessary would be to include some training data where text is associated with some representation of visual data. However, many additional features, such as GANs or temporal encoders, could be added to make the architecture more robust.

References

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1810.04805

Foster, D. (2023). Generative Deep Learning: Teaching machines to paint, write, compose, and play (2nd ed.). O’Reilly Media, Inc.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1910.10683

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1706.03762