Skip to content

Releases: bltlab/mot

V1.9

09 Apr 03:11
bfc960d
Compare
Choose a tag to compare
  • Scrape up to April 1, 2024

  • Better filtering out of <!-- IMAGE --> and variants

V1.8

20 Nov 19:21
63ef942
Compare
Choose a tag to compare
  • Added scraping from April 2023 to November 15 2023

1.7

04 May 23:14
Compare
Choose a tag to compare
1.7
  • Additional data scraped from October 2022 to end of April 2023

v1.6

11 Oct 22:40
7511f2c
Compare
Choose a tag to compare

v1.5

02 Sep 20:08
8f697c5
Compare
Choose a tag to compare
  • Added segmentation for remaining languages
  • Improvements to some of the existing segmentation models
  • Both cases of under-segmentation and over-segmentation were found and addressed

v1.4

08 Jul 20:22
6263108
Compare
Choose a tag to compare

Updated scrape through July 1st, 2022
Fix missing yue documents
Change yue to cmn and voacambodia from khm to eng
Authors extraction from metadata improved
Paragraph splits extraction improved

v1.3

16 Jun 18:24
d8c1df5
Compare
Choose a tag to compare

Release 1.3 with updated scrapes through the end of May 2022.

v1.2

12 May 15:32
91162df
Compare
Choose a tag to compare
  • Added segmentation for all languages except: ben, bod, kat, kur
  • Better publication date coverage
  • Remove zero-width space in segmentation and tokenization output for Thai, Lao, Khmer (zero-width space is kept in the original text in paragraphs
  • Release as described in camera-ready LREC 2022 paper

v1.1

24 Mar 01:41
Compare
Choose a tag to compare
  • Additional scraping from January 2022 to March 1, 2022.

  • Fix for Cantonese segmentation

  • Add segmentation for Portuguese and Urdu

  • Added source code

v1.0

14 Jan 18:05
1e6a25d
Compare
Choose a tag to compare

This is the release of Multilingual Open Text v1.0.
This is a corpus of public domain news in 44 languages.

Contributors

@cpalenmichel
@haewonkim620
@ConstantineLignos