We have build corpus for Kazakh language from Wikipedia dump (https://dumps.wikimedia.org/kkwiki/). Using a WikiExtractor (https://github.com/attardi/wikiextractor) to parse data, and nltk to build n-grams.
A total of 21 million words were collected. With almost 600 thousand words of different derivations.