At the special session chaired by His Majesty the King “Unity and diversity of Spanish. Tradition and the challenge of artificial intelligence”, Telefónica, together with Microsoft, Google, Amazon and Meta, have unveiled the progress made in the LEIA initiative whose aim is to help machines speak correct Spanish and ensure that the rules, drawn up by the Royal Spanish Academy (RAE, by its acronym in Spanish), are respected by the AI tools in support of the generation and understanding of the language.
Committed to the Spanish language
Ángel Vilá, Chief Operating Officer at Telefónica, gave an overview at the event of all the advances made by Telefónica to promote the proper use of Spanish in home products and services, such as the RAE Living App on Movistar Plus+ to consult definitions or learn more about the language, and the RAE game available on the Movistar Home device. As a novelty, he presented the prototype LEIA-X, a extension for Chrome browsers that uses artificial intelligence to improve the understanding of Spanish. This tool highlights the most appropriate meaning of a selected word according to the context. It uses an AI model that has been trained with more than 70,000 examples from various RAE dictionaries.
This functionality is especially useful for the more than 100 million non-native Spanish speakers. In addition, using automatic translation APIs, it is capable of providing a response in any language, always aimed at improving the user’s understanding of Spanish.
LEIA-X responds to the need of improving reading comprehension in a web browser on a laptop, an e-book or simply a mobile phone. Today, all readers have access to a “look up” or “define” feature that allows them to select a word and automatically open a dictionary window with its corresponding entry. From there, as readers, we have to navigate through all the meanings to find the one that fits best; a task that distracts from reading, especially on small screens or devices that are not particularly fast. LEIA-X uses AI to provide an exact definition of a word according to its context, making it much easier to read.
How LEIA-X works
The extension is based on an AI model trained specifically with Spanish text (namely the BETO model, trained by the University of Chile) to solve a problem that does not require huge large language models (LLMs) such as GPT3 or 4: the disambiguation of the meaning of a word.
The original model (BETO) is trained, by the University of Chile, on a task known as “fill the mask”, which consists of, given a phrase, masking a word and asking the model to try to predict which word is the best fit. This method of machine learning is called “self-supervised”. By doing this a sufficient number of times, the model is able to extrapolate which words are related to the context in the phrase or what is, for example, the sentiment of the phrase, or when a verb or noun is required. In short, the AI model learns to extract knowledge or correlations between the words that make up a phrase.
To disambiguate a word in Spanish, you have to use the context where the word appears. To give an example, the Spanish word “banco” (“bank” or “bench” in English) takes on different meanings depending on the context:
“I have gone to the bank to make a deposit”
Or if we say:
“I’m sitting on a bench reading a book”
While people do this process automatically and almost unconsciously, it is really complex for an algorithm to know which of the definitions of the word “banco” is being referred to in each case. The only way to know this is to understand each of the words and how they relate to each other in a given context.
Based on the BETO model, LEIA-X has been trained with a corpus of positive and negative examples of words with their meanings in the following way: given a word and a phrase, e.g. the word “banco” (“bank” or “bench” in English) and the sentence:
“I have gone to the bank to make a deposit”
The model, during the automatic learning process, takes as input the different definitions of the word “banco”; including, according to the RAE dictionary:
- Seat, with or without backrest, on which two or more people can sit.
- A company engaged in financing operations with money from its shareholders and customer deposits.
In order to build the LEIA-X training corpus, each sentence and target word has been automatically labelled by its correct meaning and positive examples, or an incorrect one and negative use examples.
The examples in the corpus will eventually take the following form:
- I have gone to the “bank” to make a deposit [SEP] where “bank” means: Seat, with or without backrest, on which two or more people can sit. [incorrect]
- I have gone to the “bank” to make a deposit [SEP] where “bank” means: A company engaged in financing operations with money from its shareholders and customer deposits. [correct]
In this way, a corpus of more than 70,000 examples has been constructed based on various dictionaries provided by the RAE. In the Student’s Dictionary, each meaning or definition of an entry has a positive example, the correct meaning. To complement this corpus, we have also taken advantage of the Spanish Language Dictionary (DLE, by its acronym in Spanish), in which approximately 15% of its meanings have examples of use. Thanks to the corpus generated, the BETO model has been adapted by incorporating disambiguation capabilities.
Once trained, the LEIA-X model is able to assign to each of the word-sentence pairs the confidence or probability that a particular meaning is the correct one. In the case of the example with the Spanish word “banco”, for the first sentence, the model would assign a level of probability close to 0% and for the second sentence a level of confidence close to 100%, showing the latter as the most likely meaning. It has therefore succeeded in disambiguating the word.