Skip to content

apjama/Somali_NLP_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Somali NLP Data

Description

This is the sister repo of Somali_NLP. The data here has been collected from a range of different places. We have sometimes gone to great lengths to clean them up. For project aims and milestones, please see the Somali NLP repo linked above.

If you’ve any ideas about how to clean up the data, make it better, please get in touch. I can be reached on Twitter.

Here’s a quick overview of the data so far.

The Wikipedia folder

Here you’ll find three csv files. These are about 8mb each. Which is huge (not that huge!). Between these three files you’ll find the entire Somali Wikipedia corpus. There are two headings. One for the title of the articles, and the other the actual text containing them.

The Masaryk University folder

We took this corpus by Masaryk University | which supposedly comprised of over 80 million tokens (individual words). We cleaned up these tokens, removed xml data around them, and removed duplicates. We then sorted the tokens into grammatical categories (is it a word a verb, an adjective, a noun, etc). These categories still need a LOT of work because many are still uncategorized but the foundation is there.

The Hadrawi folder

The Hadrawi data were contributed by Mohamed Ainab. They can be found in the original repo here.

About

a place to hold all that somali nlp data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published