Skip to content

Releases: bitextor/bitextor-testing-output

Bitextor testing output

29 May 08:56
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v11, has been generated using a commit very close to this commit.

Changes has been caused due to:

  • New version of bicleaner-hardrules

Bitextor testing output

08 Mar 13:33
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v10, has been generated using a commit very close to this commit.

Changes has been caused due to:

  • Fixed bug in documents output rule
  • Fixed dictionary-based docalign feature: mutually linked documents

Bitextor testing output

02 Mar 12:56
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v9, has been generated using a commit very close to this commit.

Changes has been caused due to:

  • New method for scoring in the TF-IDF MT-based document aligner

Bitextor testing output

25 Jan 13:05
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v8, has been generated using a commit very close to this commit.

Changes has been caused due to:

  • New model for the dictionary-based document aligner, trained due to bump Scikit-learn version.
  • New bicleaner models due to the same Scikit-learn version bump.
  • New version of bicleaner-hardrules (making lowercase before scoring, new fastpell version, etc.), making Bicleaner and Bicleaner-AI scores to be different too.
  • New document output test (number 40 in run-tests-min and 80 in run-tests)

Bitextor testing output

30 Nov 10:41
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v7, has been generated using a commit very close to this commit.

Changes has been caused due to:

  • New text2prevertical change (avoiding strip to preserve original WARC spaces) introduced in bitextor/bitextor#245 modifies test 11 results.
  • New metadata added to the output files in tests 13 and 73
  • Test 70, 71, 72, 73 and 102 in run-tests.gz has been run under CPU and old architectures Nvidia GPU (which gives the same result), instead of new architecture GPU (Nvidia A100), having different precision.
  • Test 102 in run-tests.gz use --disable_minimal_length in Bicleaner through new Bitextor option --bicleanerExtraArgs. This modified three sentence pairs, which were having Bicleaner score 0 by minimal length hardrule (source or target or both were 2 tokens long) but now they are filtered anyway because score is still lower than 0.5.

Bitextor testing output

23 Nov 11:56
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v6, has been generated using a commit very close to this commit.

Changes has been caused due to:

25/11/2022: Changed run-test-min.tgz to add dir2warc test outputs.

Bitextor testing output

23 Nov 08:34
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v5, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

  • Adding the number of the final paragraph in a document, if option is enabled.

Bitextor testing output

13 Oct 14:29
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v4, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

  • Removing default tokenizer from Bicleaner (now it is provided only if the user provides a tokenizer)
    • Due to the different scores of Bicleaner, the number of sentences in some tests have been altered due to a configured threshold.
  • Bicleaner AI submodule was updated, and scores might have been altered for this reason as well.
  • Some output files have different order since the condition for sorting has been lightly changed (e.g. run-deferred-tests.tgz).

Update (after the release was published):

  • Tests 40 and 50 have been enabled again: bitextor/bicleaner#72
  • Test 40.1 was failing, what led to think that, specifically, hunalign was returning non-deterministic values depending on the machine that the tests were executed. Actually, we didn't notice that a different dictionary was being used, which was the reason why there were different values. The real reason why different results were being obtained was that in GHA, the tests are executed concurrently and in separate machines, while locally all the tests were being executed concurrently but in the same machine. This situation caused that, locally, the dictionary was being replaced. Fix: bitextor/bitextor@2a69167
  • Older tests had been uploaded for run-tests.tgz file. It's been fixed.

Bitextor testing output

23 Sep 11:13
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v3, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

  • Sentence splitter was printing the total of found paragraphs when paragraph identification was being processed. This was an issue since we might not have the total of paragraphs as input (e.g. paragraphs removed due to boilerplate removal), so the total count of paragraphs might be lower that the last paragraph id. This count has been removed.
  • Vecalign was printing the target URL as the source URL (commit).

Bitextor testing output

06 Sep 10:48
19b0958
Compare
Choose a tag to compare

Testing output files which differ from v2, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to: