This repo contains crawled Vietnamese text from multiple sources.
This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.
Here are the ways we clean the data:
-
Removal of emojis
-
Removal of emoticons
-
Removal of URLs
-
Removal of HTML tags
Binhvq News Corpus was crawled from news on the internet with size of 50GB text.
OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.
Including texts of short and long story with size of 10 GB crawled by QAI on the internet.
More than 1 million sentences collected by QAI on the internet.