How computers learn languages
We Need to Talk
Whether through voice assistants, chatbots, or the automatic analysis of documents, rapid developments in AI are helping speech technologies make inroads. But how does AI manage to understand the subtleties of human language?
Language is the medium through which people communicate and express their thoughts. It is an ancient dream of mankind to be able to communicate with a machine (for example, just watch 2001: A Space Odyssey).
Meanwhile, science has come a bit closer to this vision. The box entitled "Sample Dialogue with LaMDA" contains a conversation with the Language Model for Dialogue Applications [1] (LaMDA) dialogue model. It was assigned the identity of a Weddell Seal in the first line. As you can see, LaMDA can give grammatically and contextually correct answers and even play with the meanings of words. But how does a computer system manage to achieve language fluency?
Sample Dialogue with LaMDA
[Starttext] Hi I'm a weddell seal. Do you have any questions for me?
[User] You're a talking seal?! No way!
[LaMDA] Yep, I am! I'm here answering any questions from all the seal fans out there.
[User] Haha seal fans, nice. Like fans of the band Seal?
[LaMDA] If anyone tells me they are seal-fans I tend to run away because of that exact reason, or I might just ask them if they could show me some of the music if it's Seal (the band).
[User] A seal singing Seal stamping a seal on a sealed envelope containing sealer.
[LaMDA] A seal-ception within a seal. I'd watch that movie.
[User] You are hilarious.
[LaMDA] Awh, thank you. But what about you? I must admit you're pretty funny yourself for suggesting that movie.
[User] Well I wasn't really suggesting it as a movie, that was you! :P
[LaMDA] Oh, I must have misinterpreted. Sorry! Still, I think it would be pretty cool to watch.
To understand language, the system needs to know the meaning of words. For this purpose, each word is represented by a long vector of 100 to 1,000 real numbers. This vector is known as embedding. Now, for example, because "sofa" and "couch" both refer to upholstered seating furniture for several people, their embeddings should also be similar. This means that very similar numbers should be found at the same positions of the vector. Other words such as "dive" have very different meanings, so their embeddings should be very different from the one just mentioned.
But there are also cases where the same word has different meanings. Which meaning comes into play then depends on the context. An example: Comparing the sentences "He rose from the chair" and "The rose is red" makes this pretty clear. Although the word "rose" looks the same in both cases, different embeddings are needed, depending on the other words in their context. However, this requires a method that automatically determines embeddings with these properties from training data.
Smart BERT
The BERT [2] model has brought about a breakthrough. The task is to predict masked words in arbitrary texts of the training set. First of all, the words are broken down into smaller word fragments using the WordPiece process. These can be the components of a compound noun, for example, or the word root and ending. These pieces are known as tokens. On the one hand, frequent words are tokens themselves; on the other hand, every word can be represented by tokens. Because the tokens can be combined into ever new words, 100,000 of them are all it takes for all the languages in the world.
By default, about one sixth of the tokens are masked for training and replaced by . The goal is to be able to predict the masked token based on the surrounding ones. Each token is assigned exactly one token embedding on the lowest level, which is first filled with random numbers. This token embedding acts as a representation of the meaning of the token that does not depend on the meaning of the neighboring tokens.
In addition, the procedure requires information about the position of a token in the text, which is added in the form of an additional position embedding. Both token embedding and position embedding are derived during training. The training goal is to achieve the highest possible probability value for the masked token.
The actual task of the model is to generate context-sensitive embeddings that also distinguish between different meanings of the same word in different contexts. This is done with the help of an association module that determines the similarity between an embedding at position i and all other embeddings. The similarity is calculated using a scalar product of the embedding vectors controlled by parameters, where a high value means a high similarity.
The series of all similarity values for embedding at position i are then normalized into a probability such that all similarity values add up to 1. Using these similarity values for weighting, the method then totals the linearly transformed embeddings of all embedding vectors as a function of other parameters and generates the new embedding at position i. These computations are shown in Figure 1.
Evaluating similarity leads to a high weighting of embeddings for tokens with a related meaning and can strongly influence the initial embedding at i . This then allows BERT to interpret the word "rose" in the sample sentence in Figure 2 as "flower" through its relation to the word "picked". A new embedding is created for each position in this way.
This procedure is carried out successively on several layers of the network with different parameters, continuously improving the context-sensitive embeddings. In addition, several association modules are used in parallel on each level, and each with different parameters. At the top level, a logistic regression model now has the task of predicting the probability of the masked word (in our example "gave") from the context-sensitive embedding of the token.
To do this, the model parameters in the modified scalar product and the transformed embedding are adjusted by optimization, causing the probability of the masked word to increase steadily. This forces the model to gather all available information about the missing word in the top layer embedding for the masked word during training.
The original BERT model had 100 million parameters and was trained on 3.3 billion words from Wikipedia and a collection of books. BERT was customized for special tasks by fine-tuning with a relatively small volume of annotated data. In this way, it was able to solve a large number of test tasks in a far superior way to other models, such as answering questions or detecting logical contradictions.
This procedure is also known as transfer learning. BERT acquires very detailed knowledge of correct language and meanings through the original pre-training on general language data. Fine-tuning then adapts the model to a special task without it losing its previously acquired skills.
GPT-3 Generates Texts
Creating context-sensitive embeddings through association modules has proven to be so effective that almost all natural language processing models now use this method. Language models generate context-sensitive embeddings of a starting text to predict which next word is the most likely. By using this technique several times in succession, you can gradually create a continuation of the text.
The GPT-3 model is one such language model. It includes 175 billion parameters and has been trained on texts with a total of 500 billion words [3]. Such gigantic models devour an enormous amount of computing power during training. Practicing the GPT-3 model on a V100 GPU server with a processing power of 28 TFLOPS would theoretically take 355 years and cost a good $4.6 million at a rate of $1.50 per hour.
This model can predict the last word of a paragraph in a benchmark dataset with 76 percent accuracy. The language model is very flexible because it can be instructed to perform certain tasks. The "GPT-3: Few-Shot Prompt" box provides a starting text for the model in which the user instructs with a couple of examples. The output shows that GPT-3 can interpret these Few-Shot Prompts and, for example, correct syntax errors in a sentence.
GPT-3: Few-Shot Prompt
[User] Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I'd appreciate it.
Good English output: Thank you for picking me as your designer. I appreciate it.
[...]
Poor English input: I'd be more than happy to work with you in another project.
[GPT-3] Good English output: I'd be more than happy to work with you on another project.
So you no longer have to program or fine tune the language model to perform a task. Instead, the model follows the implicit instructions from some examples. Hundreds of other ways to train the model for specific tasks exist. This approach is completely different from existing alternatives for solving textual tasks.
Foundation Models
The LaMDA model is a language model with 137 billion parameters trained on dialogues and other web texts with a total of 1,560 billion words. It uses two additional techniques to deliver particularly engaging dialogue posts.
First, it uses a search engine to locate up-to-date documents on a topic in databases or on the Internet. It adds relevant documents to the previous dialogue flow as additional text and processes them in the response. As a result, the model is able to provide up-to-date and factually accurate answers. Fine-tuning can adapt the model to make the answers more meaningful, significant, and interesting, as well as to avoid the possibility of toxic language.
As the example at the beginning of the article shows, the model is capable of providing very accurate answers. There has even been an intense dispute lately about whether or not LaMDA shows something like feelings on the level of a child. However, the manufacturer Google denies this and has suspended software engineer Blake Lemoine, who made this claim. In any case, this discussion illustrates that dialogue systems have reached an astonishing level of quality and consistency.
It has been shown that embeddings cannot only be derived for words, but also for partial areas of an image, sound sequences, video frames, similar media components, as well as DNA segments. In this way, one can achieve a common representation for different media, with which content from different modalities can be linked via the tested association module.
DALL-E 2, for example, which is able to generate new images for text, is based on this principle. As a result, it appears possible to represent, analyze, and generate many items of media content simultaneously by large language models. A large group of researchers therefore refer to these models as Foundation Models [4] because they will play a crucial role in the further development of intelligent systems.
Infos
- LaMDA: https://arxiv.org/abs/2201.08239
- BERT: https://arxiv.org/abs/1810.04805
- GPT-3: https://arxiv.org/abs/2005.14165
- Foundation Models: https://arxiv.org/abs/2108.07258
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.
-
Juno Tab 3 Launches with Ubuntu 24.04
Anyone looking for a full-blown Linux tablet need look no further. Juno has released the Tab 3.
-
New KDE Slimbook Plasma Available for Preorder
Powered by an AMD Ryzen CPU, the latest KDE Slimbook laptop is powerful enough for local AI tasks.
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.