Natural Language Processing (NLP) is a field of Machine Learning that focuses on understanding human language. Understandably, such a field is highly data-based, needing copious amounts of training data to create models for language translation, speech to text, transliteration, and other popular NLP tasks. This is where low-resource languages come in. Contrary to some beliefs, NLP tasks are not equally available for all languages. This is expected as there are almost 7000 languages across the world, and some of them aren’t even written!
Just from looking at the data, the issue with low resource languages is evident. Languages with a #textcorpora of hundreds of million words perform the best on machine learning models, but only 20 of the 7000 languages worldwide meet these requirements. This page on the Translators without Borders website provides a visual graphic of the vast disparity in language technology.
Below is another visual representation of low resource languages, found in Dr.Sebastian Ruder’s article linked here:
Image Description: Language Resource Distribution Of Joshi Et Al. (2020).
On the bright side, research on low resource languages has been improving recently. Research journals specifically focusing on NLP for low resource languages, such as Natural Language Processing Frontiers, have been popping up.
So why is NLP on low resource languages important?
There are many reasons, but I will cover the 3 major reasons and my own take on it.
1. Language preservation
Many languages around the world are dwindling. Since 1950, the number of unique languages has actually significantly decreased. According to the Language Preservation website, “about 2,900 languages or 41% are endangered. At current rates, about 90% of all languages will become extinct in the next 100 years.” All of these languages that are at risk of extinction would be considered low-resource languages. By preserving them through NLP, we can keep the writing system intact for years to come. Many old languages such as Hebrew and Latin have been revived and studied in such a fashion as well!
2. Bringing Awareness To Those Who Speak The Languages
People who speak low-resource languages do not gain much coverage in the media. Bringing awareness and doing research on the languages they speak, empowers those individuals and brings greater visibility to them and their cultures. This article by Sciforce provides an example, “if we think that Africa has a population of over 1.2 bln. people, we’ll realize that it’s important to get closer to them.”
3. Educational Opportunities
This is personally the reason that strikes me the most for NLP on low resource languages. Translation of English content to “low-resource” languages is an especially relevant issue to immigrant and refugee communities who speak dialects of their native language (SciForce 2019). I personally experienced frustration due to the lack of technology available for low-resource languages when I tutored #refugee children. Many of them spoke specific dialects of languages that were not covered by conventional translation methods, making it very hard for them to learn English by themselves. With the staff at these refugee organizations already being overworked, #esl classes were in high demand with low supply. This is what prompted me to do research on machine learning and NLP, to create solutions to help the children who speak low resource languages. Currently, I am working on EduLang, a bilingual children’s library for kids who speak low-resource languages. I am partnering with Neural Space, a company that works on #nlp tasks for low resource languages at scale.