Large Language Models in a Multilingual World

In conversation with … a computer

Years after my last longform article on language of the Internet of Things (IoT), I find myself once again writing about language. If that article was about a common language for the IoT, this one is about Large Language Models in a multilingual world. Language is an endlessly fascinating subject. It structures communication through rules to describe the world around us. Idioms, proverbs, fables, and legends rooted in our languages often inform our values. We understand our place in this world through language. The same is true for computers. We taught computers to communicate with us through programming languages. The syntax of such a language is not particularly intelligible to layfolk but the underlying logic is usually intuitive. Nevertheless, reading code in an unfamiliar language can feel like reading text in a foreign language.

Now, LLMs have seemingly eliminated this barrier between layfolk and the computer. Or have they? First, let’s understand what these are. LLMs are short for Large Language Models and they use a vast quantity of written content to be able to communicate in a human language. Essentially, they facilitate an approximation of natural conversation between the user and the computer. Instead of typing keywords in Google such as, “chocolate cake recipe no eggs” and then clicking through the links in the search results, you can theoretically ask a LLM to write the recipe by instructing it to, “give me a recipe for chocolate cake that doesn’t require eggs.” Instead of trying to game a search engine with search keywords and clicking through links to find the answers you need, you can theoretically have a conversation with a LLM about chocolate cake recipes.

This is no small feat. Communicating with computers through binary was difficult to begin with and it led to the development of various computer languages. Now, computers have the tools to interpret the written word in common languages and respond in like fashion with a user. Think about how long it took you to gain mastery over the language you speak. You probably read thousands of pages to be able to communicate as well as you do. The leading LLMs today have access to much more written content, possibly orders of magnitude more than what you have consumed. But a LLM doesn’t really understand words as we do. It learns how to sequence words by observing their usage in a dataset. Given a large enough dataset, it can do a good job of mimicking an understanding of language.

A biased trainer … or biased data?

Without an understanding of words on a human level, a LLM can make outlandish statements. Just because the words make grammatical sense, doesn’t necessarily mean the sentence is morally or logically correct. Again, given a large enough dataset, it could correlate words such that it recognizes patterns of causality. For example, it could detect the demographic impact of mutual declarations of war between two nations if it has access to the declarations and the census records of that time. It could even take into account the human cost of war if it had access to content that describes the horrors of conflict.

A few observations,

Large Language Models mimic an understanding of language
They needs vast quantities of data in that language to do so
Their interpretation of the data is dependent on their training

I think the most wondrous aspect of a large language model is its ability to mimic an understanding of a language. Stringing along a sequence of words with correct grammar and reasonably good vocabulary indicates a strong grasp of the mechanics of language by the developers. But the LLM’s performance will be as good as the data fed to it, and the guardrails it is instructed to respect.

Consider a simplistic thought experiment. Suppose you are training a computer on the concept of color. You’ve given it the hex codes to be able to recognize blue and orange. You present it with 4 photographs taken from the same viewpoint on a beach. 3 of those photographs were taken just after sunset. 1 of those photographs was taken at high noon. Assume clear skies in all 4 photographs with no heavenly bodies visible.

Based on these 4 photographs only, and without having been provided any information on the day-night cycle, how do you think the computer will respond when asked, “what is the color of the sky?”

Now suppose you introduce some guardrails. You tell the computer that the high noon photograph was produced by a photographer who presents raw pictures only. You also tell the computer that the other 3 photographs were taken by photographers who tend to edit their photographs after capture. Do you think the computer’s confidence in its earlier response will falter?

The curation of data presented to a LLM, and any guardrails introduced in its training create bias. Developers of leading LLMs claim to train these models on a large volume of written content available on the web. This is not so different from what search engine crawlers do. These crawlers go from website to website and index their locations and content. The engines then match these with a user’s search keywords to return a ranked list of results. A LLM indexes this information not just to find answers, but also as reference guides on how to communicate in human languages. In other words, the more data in a given language, the better it is at mimicking communicating in that language. And it stands to reason that the more data in a given language, the more likely it is to find relevant answers in that language.

Language and culture

For a variety of reasons, English is the dominant language on the internet. A handful of nations could reasonably claim it to be their national or first language. It intricately links to their cultures. The experiences and opinions of people for whom English is or was a native language are likely to influence the outlook of their fellow co-linguists. A significant proportion of the rest of the world also communicates in English. Would it be reasonable to assume that uniquely English perspectives influenced these people when they learned this language? A LLM trained largely on English primary sources incorporates biases inherent in that content.

Let’s briefly touch upon translations. A LLM can train on pre-translated text. In which case, it will use that text for its language learning as well as a possible match for keywords. A LLM can also translate text in real-time. In either case, the translator or the developer that designs the translation parameters has a role to play in how that text is interpreted. It can take a lifetime to learn the nuances of one language and a translator is often concerned with two or more languages. Does he understand each foreign language as well as his native one. Does his native language instill a bias within him that he is conscious or unconscious of?

Discussions of this nature can quickly take on a political hue. This article, however, is about recognizing that Large Language Models may have some limitations when its users come from a wide variety of linguistic backgrounds. Can such models be made useful and resourceful for all users, irrespective of their linguistic backgrounds? And if yes, how?

Making yourself heard

I think this will require concerted advocacy at the social, cultural, and governmental levels.

One solution is to increase the quantity of written content in your language on the web. Maintaining websites, publishing content, and promoting the use of your language for communication on the web are some ways to increase its online presence. You have probably noticed that some languages have a strong online presence today. Incentives are a significant contributor. Societal and governmental support can act as a catalyst to promote content creation in languages that are not in wide use.

Another possible solution is to take an active interest in translation of texts from your language to the more widely spoken ones. This may ensure a substantial presence of your linguistic and cultural perspectives on the web. Native speakers can provide a great deal of insight into the nuances of their languages. They should consider engaging with developers that code translation services used by these LLMs or with the developers of the LLMs themselves.

It’s also important to start engaging with LLM developers to help them incorporate more data from your language. Possibly, they have few, if any, people that understand it. There may be technical challenges hindering their efforts to incorporate such content. Introduce your language to them. Work with them to help them understand its cultural and historical context. Impress upon them how it could help them reach more users around the world.

A more radical solution is to engineer Large Language Models built primarily around your own language. This will both require and promote investment in talent and technology. It could also incentivize the building of cultural and industrial networks across linguistic and geographical barriers. Some nations have already begun to do so.

These solutions require collective action across social, cultural, and regional groups. In my opinion, widening cultural and political differences across the world will spur the creation of competing LLMs primarily differentiated on the basis of language. Among all this, it’s important to remember that LLMs are tools built by people for specific purposes. LLMs are not born out of a vacuum nor are they divine black boxes. It may be tempting to get swept up in the hype around ‘artificial general intelligence’ but these are ultimately just tools. If the tool and its purpose aren’t aligned, one or the other must change.

Let’s discuss

Are you working on a solution to this problem? Perhaps you already have one. Do you seek to communicate your ideas to your audience? I can help you do just that through a white paper. To learn more, please visit the Solutions page.