Stanford, California - Every day, people around the globe visit one of the roughly 300 language editions of Wikipedia, searching through millions of articles written by tens of thousands of volunteer editors who build and maintain this free encyclopedia.
Most of the visitors look for articles written in English or one of the other widely spoken languages that account for the vast majority of Wikipedia’s 36 million entries. But with more than half the world’s population monolingual, gaps in knowledge exist from one local language version to another.
To help editors in different linguistic communities identify important missing articles, computer scientists at Stanford and the Wikimedia Foundation have created a recommendation tool that identifies the most important articles not yet available in a given language. Editors can use these recommendations and, if they are multilingual, find an article in a second language familiar to them and get other help in order to translate the article for local Wikipedia readers.
Thus, the system would first identify an editor in Madagascar who is interested in climatology and literate in Malagasy and French and then recommend the editor work on an article about El Niño, which is absent from the Malagasy Wikipedia. This way the editor can create an article for people on this island country explaining how El Niño may influence rainfall, which in turn affects agriculture and flooding.
“As university researchers, we look for projects with real-world impact,” said Jure Leskovec, an assistant professor of computer science at Stanford. “What could have more impact than democratizing access to knowledge?”
Wikimedia Foundation research scientists Ellery Wulczyn and Leila Zia and Stanford graduate student Robert West rounded out the team of collaborators who will report on their efforts this week at the International World Wide Web Conference in Montreal.
“Wikipedia has huge amounts of data about articles in different languages and the relationships between them,” said West, a doctoral candidate in computer science. “Our goal was to use that data to design a system to encourage editors to create the most important missing articles.”
The researchers began by creating lists of every article in each language, and then cross-referencing these lists to determine which articles were missing in which languages. The researchers then estimated the importance of each missing article based on cultural and geographic relevance. The idea was to rank the value of creating any given article missing in that language relative to all the other missing articles.
“We had to create a system of rankings that would be meaningful to editors in different cultural and linguistic communities because Wikipedia is shaped by the editors’ choices,” Zia said.
The researchers hypothesized that a system that accurately predicted the popularity of missing articles would appeal to editors by suggesting where their voluntary efforts would deliver the most value to their linguistic communities and, presumably, afford them the greatest personal satisfaction.
To test this premise, the researchers designed a complex experiment. They began with the 4.9 million articles that existed in English Wikipedia, and found those that were missing relative to the 1.6 million articles in French Wikipedia.
The researchers then chose the 300,000 most important English articles missing from French Wikipedia. These articles were randomly divided into three groups of 100,000 articles each and distributed to selected editors.
The crux of the experiment involved two groups of 6,000 editors who had done at least one edit in both English and French Wikipedias in the 12 months before the experiment. On June 25, 2015, each of these editors received emails pointing them to five unique missing articles and a suggestion that it would be a community service if they translated one from English into French.
In one group, the five choices were assigned at random from the master list of important articles missing from French Wikipedia.
For the second group, the five choices were also drawn from a separate list of important missing articles, but were also attuned to each editor’s presumed interests based on articles each had edited in the past.
A month after sending out their emails, the researchers assessed missing article creation. The researchers found that by simply pointing editors in the first group toward five random missing articles they could double the organic article creation rate.
In the second group, where the scientists tailored the five suggestions to the editors’ interests, they tripled the rate at which editors plugged article gaps.
Based on these results, the Wikimedia Foundation has developed an experimental tool where editors can find gaps in their local language Wikipedia and get pointed to an entry in a second, familiar language that can serve as a starting point for translating that article or creating it from scratch.
The research is described in detail at arxiv.org.