AI-powered language translation has become popular, and the need for data to train African languages is critical. Two thousand of the world’s 7000 languages are spoken in Africa, most of which are primarily spoken and don’t have a widely used writing system. This has made it impossible to build machine translation tools using standard techniques which require large amounts of written text to train an AI model. Therefore, researchers refer to most of our languages as “low-resourced.”
However, as an African, I know our languages are anything but. Quite the opposite, African languages are rich in resources. This can be directly seen in how our communities participate in formal and informal oral storytelling. People congregate together, listen, and participate in accounts and stories of past deeds, beliefs, wisdom, counsel, morals, taboos, and myths. In our communities, accomplished storytellers are respected community members who have mastered a complex verbal use of proverbs, musical parables, and memory skills after years of training. Oral arts and lessons are essential to children’s traditional education on their way to full humanness.
Language models today are entrained using texts from sources like Wikipedia, news articles, scientific papers, and books. However, it’s time for us to consider oral stories as an approach to gathering the data necessary to accelerate natural language processing for languages, especially African languages. Traditional and cultural practices of oral literature are existing libraries of information (texts).
Being “low-resourced” does not mean that resources for the language are unavailable. It simply means that our existing resources have not done enough to reach Africa. Unique problems require new, creative solutions.
The Jeli (Griot) project is an excellent example of an approach that leveraged traditional and cultural practices of oral storytelling to gather the data necessary for natural language processing. A project I led at Google, we connected with Bambara-speaking Griots (historians and oral storytellers) from West Africa. First, we recorded more than 30+ hours of stories, teachings, and cultural practices. Then we worked with a local organization to transcribe all 30+ hours of audio and translated them into French. Finally, using the stories, our research partners built an Automatic Speech Recognition model that understands Bambara speech and facilitates easy translation to other languages.
Many of our languages are underrepresented online, and there’s still much work to do to lower the barrier of digital participation for local communities across the continent. Utilizing our existing oral libraries to build more datasets is one of many creative approaches we must adopt to digitize more African languages. In addition, our traditional cultures and heritage are rich in resources that can enable significant advancements in AI-powered language translation. It is up to us to recognize them and build our own approaches. Because only we are in the best position to introduce the required solutions to onboard more African languages online.
Griots Interviews, Bambara Language WAV, 30 hours, Recorded 2022, Cultural and ASR Training Resource.
Source material for this project:
Addition to 200,000 Bambara-French clean synchronized corpus
Co-project with Google, recorded 30 hours of video interviews with Griots
30 hours manually transcribed and translated into French
10 hours were used in training the ASR system and MT transformer
100% Open Sourced
Cultural/Technical Exhibition to be hosted online and in the National Museum of Mali
Record, preserve, and share Malian culture with the world
Contribute to the science of low-resource language NLP
Reinforce the development of written Bambara
Enable Bambara to reach status as a “first-class internet language”
Building language models, one story at a time