Primer: attention, transformers and the new architecture of language

Natural language is one of the most important challenges in AI and the frontier has moved rapidly over the past few years. This primer will help you get to grips with the basic concepts in a (mostly) non-technical way. If you haven’t already read our primer on representing data, you might want to do that first as it explains how AI networks can be used to find patterns in data.

Language is unique. There are many shades of meaning for a given word that only emerge based on its situation in a sentence and its relationship with other words. Meaning happens as language is actually used. The challenge for AI is to capture these relationships mathematically.

Traditional approaches to natural language processing assumes that word meaning is stable across sentences. This is far from the case so researchers are on a quest to invent techniques that are able to include more relevant information about a word.

A technique that has, up until recently been state-of-the-art, is to use an AI version of “memory.” Humans read text in two-dimensions, one word after the other. So one approach has been to use a particular style of neural network which is good at “remembering” things it has seen in the recent past.

This “remembering” is performed using a special type of neural network called a recurrent neural network. RNNs can take multiple inputs and deliver multiple outputs rather than simply a single classification. They are able to operate with sequences, where the neuron “remembers” its previous activation, a kind of artificial short-term memory, as well as able to then receive a fresh input. 

In a plain-vanilla NN there are no connections between the neurons in the same layer in a neural network. Recurrent neural networks create a kind of layer by adding a loop in time. This is clever way of giving a network “memory,” where it remembers the previous iteration and uses that to update its output.

An RNN processes text like a snow plow going down a road. Immediately behind the plow it’s clear, while the road ahead is untouched, so unknown. And way back it’s already getting a bit fuzzy. A RNN’s understanding of words it encounters late in the sentence depends on the words it has encountered earlier.

This is why language AI has been limited – long sentences are very difficult and RNNs tend to “build up” a representation of context. By the time they get to the end of the sentence, their understanding of those later words is strongly influenced by the earlier words. They put too much emphasis on words being close to each other (in a mathematical proximity sense), so they place too much emphasis on upstream context compared to downstream.

Let’s get a little more technical.

This “snow plow” is also called a Sequence to Sequence (seq2seq) model. The model is made up of many RNNs configured as encoders and decoders. The encoder captures the context of the input sequence and sends it to the decoder, which then produces the output sequence.

Imagine the encoder and decoder as human translators who each speak two languages – but not the same as each other. Their first language is their native language (e.g. Spanish and French) and their second language is a made up of one that only a machine knows – let’s call it 123. To translate Spanish into French, the encoder converts the Spanish sentence into 123. Since the decoder is able to read 123, it can now translate the text into French. Together, the model (consisting of encoder and decoder) can translate Spanish into French.

So, the encoder takes the sequence as an input and generates a final embedding at the end of the sequence. This is in the form of a mathematical vector, which is sent to the decoder, which then uses it to predict a sequence. After each successive prediction, it uses the previous hidden state to predict the next instance of the sequence.

In the case of long sequences, there is a high probability that the initial context has been lost by the end of the sequence simply because there is less relevance attached to downstream words.

This is where “attention” comes in. The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important. It essentially fixes the problem of too much context from earlier in the sentence by creating a matrix of two sentences. It’s a way of correlating meaning between two sentences. Here’s an English / French translation as an illustration.

What this does is eliminate noise – it connects two words together in a way that eliminates other markers. It’s a way of establishing relevance in the totality of the sentence.

Language is this two-dimensional array that somehow manages to express relationships inherent in life over many dimensions (time, space, colors, causation), but it can only do so by creating syntactic bonds among words that are not immediately next to each other in a sentence. Attention allows you to travel through wormholes of syntax to identify relationships with other words that are far away — all the while ignoring other words that just don’t have much bearing on whatever word you’re trying to make a prediction about.

Chris Nicholson, Pathmind

As you read this, you focus on the word you read but at the same time your mind still holds the important keywords of the text in memory in order to provide context. Our mental model of sentences is one of a two-dimensional structure unfolding in time as we speak or read or listen. But in fact, language is far more complex and a better mental model might be one of a three-dimensional, complex structure such as protein folded in on itself.

An attention-mechanism is more trying to understand a folded protein than it is trying to understand a line. For our example with the human encoder and decoder, imagine that instead of only writing down the translation of the sentence in the language of 123, the encoder also writes down keywords that are important to the semantics of the sentence, and gives them to the decoder in addition to the regular translation. Those new keywords make the translation much easier for the decoder because it knows what parts of the sentence are important and which key terms give the sentence context.

A transformer is a network that uses attention mechanisms to grab information about the relevant context of a word and then encode context. Because any given word can have multiple meanings and relationships with other words, the transformer architecture can have more than one set of associations for a particular word. This is called “multi-headed attention.” Because this can be mathematically manipulated, the attention mechanism can reshape and reform context as the network learns.

2020 is going to be an exciting year for natural language. Transformer architectures, built on attention-mechanisms form the basis of state-of-the-art models such as Google’s BERT and OpenAI’s GPT-2. Be sure to read our updates on language developments. Our first one for 2020 is here.

References and further reading:

https://pathmind.com/wiki/attention-mechanism-memory-network

https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04

https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263

Share on email
Share on facebook
Share on linkedin
Share on twitter