Somali NLP Data

Description

This is the sister repo of Somali_NLP. The data here has been collected from a range of different places. We have sometimes gone to great lengths to clean them up. For project aims and milestones, please see the Somali NLP repo linked above.

If you’ve any ideas about how to clean up the data, make it better, please get in touch. I can be reached on Twitter.

Here’s a quick overview of the data so far.

The Wikipedia folder

Here you’ll find three csv files. These are about 8mb each. Which is huge (not that huge!). Between these three files you’ll find the entire Somali Wikipedia corpus. There are two headings. One for the title of the articles, and the other the actual text containing them.

The Masaryk University folder

We took this corpus by Masaryk University | which supposedly comprised of over 80 million tokens (individual words). We cleaned up these tokens, removed xml data around them, and removed duplicates. We then sorted the tokens into grammatical categories (is it a word a verb, an adjective, a noun, etc). These categories still need a LOT of work because many are still uncategorized but the foundation is there.

The Hadrawi folder

The Hadrawi data were contributed by Mohamed Ainab. They can be found in the original repo here.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
hadrawi		hadrawi
masaryk_university		masaryk_university
wikipedia		wikipedia
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Somali NLP Data

Description

Here’s a quick overview of the data so far.

The Wikipedia folder

The Masaryk University folder

The Hadrawi folder

About

Releases

Packages

apjama/Somali_NLP_data

Folders and files

Latest commit

History

Repository files navigation

Somali NLP Data

Description

Here’s a quick overview of the data so far.

The Wikipedia folder

The Masaryk University folder

The Hadrawi folder

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages