Skip to content

qai-research/Open_Crawl_Vietnamese_Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

  • Removal of emojis

  • Removal of emoticons

  • Removal of URLs

  • Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages