Skip to content

Commit

Permalink
chores: include backward compatibility section
Browse files Browse the repository at this point in the history
  • Loading branch information
SethFalco committed Sep 19, 2023
1 parent bf52b0f commit fdc3282
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 16 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/translations.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
on:
workflow_dispatch:
schedule:
- cron: "0 0 1 * *"
- cron: "0 0 1 * *"

jobs:
translations:
Expand Down
18 changes: 13 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,15 @@ At this stage you should be ready to develop! Feel free to manually execute comm

Before submitting PRs, execute tests to ensure existing functionality doesn't break. Please introduce new tests for new code.

You can execute tests with the following command:
You can execute tests with:

```sh
npm run test
```

## Manual Execution

You can run this manually over a local copy of tldr-pages. First clone a copy of tldr-pages somewhere on your device.
You can run this manually over a local copy of tldr-pages. First clone a copy of tldr-pages:

```sh
git clone https://github.com/tldr-pages/tldr.git
Expand All @@ -40,10 +40,18 @@ Then build tldr-translation-pairs-gen:
npm run build
```

Finally, you can execute the command from the transpiled sources.
Finally, you can execute the command from the transpiled sources:

```sh
npm run tldr-translation-pairs-gen -- --source {PATH_TO_TLDR-PAGES}
npm run tldr-translation-pairs-gen -- --source {{path/to/tldr_dir}}
```

Read the README or help command for more information on how to use this and arguments.
Read the README or help command for more information on how to use this.

## Backward Compatibility

The GitHub Actions artifacts of this repository are consumed by [OPUS-ingest](https://github.com/Helsinki-NLP/OPUS-ingest). Do not make backward incompatible changes to them.

Changing the file formats, directory structure, and file names **must** be avoided. If it is necessary to alter these, an accompanying pull request **must** be submitted to OPUS-ingest.

See [tldr-pages corpus on OPUS-ingest](https://github.com/Helsinki-NLP/OPUS-ingest/tree/master/corpus/tldr-pages) for more details.
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@

## About

This is a CLI application for parsing all tldr pages from the [tldr-pages/tldr](https://github.com/tldr-pages/tldr) repository, and creating a dataset that maps the translations in internationalized pages. The primary purpose is to provide an additional dataset for [OPUS](https://opus.nlpl.eu/), a collection of translated resources from the web, readily available in a standardized format for other tools or research.
This is a CLI application for parsing all tldr pages from the [tldr-pages/tldr](https://github.com/tldr-pages/tldr) repository, and producing a dataset that maps the strings across localized pages. The primary motivation was to provide an additional corpus for [OPUS](https://opus.nlpl.eu/), a collection of translated resources from the web, readily available in standardized formats.

### What is OPUS?

[OPUS](https://opus.nlpl.eu/) is public dataset of translated text on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions. These datasets are helpful for a variety of applications such as research and machine learning.
OPUS is public dataset of translated text on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions. These datasets are helpful for a variety of applications such as research and machine learning.

A notable project that uses the OPUS dataset is [LibreTranslate](https://libretranslate.com/), powered by [argos-translate](https://github.com/argosopentech/argos-translate/). It's a free, open-source, and self-hostable machine translation API that doesn't depend on third-party services. Now by contributing translations to tldr-pages, we're collectively providing more data that will be used to improve machine translations and support additional languages.
A notable project that uses the OPUS corpuses is [LibreTranslate](https://libretranslate.com/), powered by [argos-translate](https://www.argosopentech.com/). It's a free, open-source, and self-hostable machine translation API that doesn't depend on third-party services. Now by translating tldr-pages, we're collectively contributing more data to improve open-source machine translations!

## Usage

Expand All @@ -32,27 +32,27 @@ git clone https://github.com/tldr-pages/tldr.git

### Execute tldr-translation-pairs-gen

Once you have tldr-pages locally, you should be able to point tldr-translation-pairs-gen to the directory using the `--source` argument. This will output a file for every combination of languages to the `dataset/` directory, with all alignments that can be found between translated pages.
Once you have tldr-pages locally, you can point tldr-translation-pairs-gen to the directory using the `--source` argument. This will output a file for every combination of languages to the `dataset/` directory, with all alignments that can be found between localized pages.

```sh
tldr-translation-pairs-gen --source {{path/to/sources}}
tldr-translation-pairs-gen --source {{path/to/tldr_dir}}
```

You can also pass the `--format` argument to specify a different output format. The supported file formats are TMX ([Translation Memory eXchange](https://en.wikipedia.org/wiki/Translation_Memory_eXchange)), XML, CSV, and JSON.

```sh
tldr-translation-pairs-gen --source {{path/to/sources}} --format csv
tldr-translation-pairs-gen --source {{path/to/tldr_dir}} --format csv
```

## Excluded Strings

When generating the dataset, you'll find that not all strings are included. Due to how the project is structured, and the current translation workflow, there are instances where the order or number of examples differ. This results in the internationalized pages falling out of sync.
When generating the dataset, you'll find that not all strings are included. Due to how the project is structured, and the current translation workflow, there are instances where the order or number of examples differ. This results in the localized pages falling out of sync.

Each example in a page features two strings, the description of what the command does, and the command itself. To work around the aforementioned issue, we parse each example and use the command as an identifier.
Each example in a page features two strings, the description of the command, and the command itself. To work around the aforementioned issue, we parse each example and use the command as an identifier.

To map strings between languages, we parse all examples, remove tokens between curly braces (i.e. `{{path/to/file}}`) as they can be internationalized, and then find the pairing example in the page of other languages if it exists.

However, sometimes after removing the content between curly braces, two or more examples in the same page may have the same content because the only difference was the tokens. In these cases, we omit them from dataset as there is no way to unambiguously know which command is the pairing example.
Sometimes after removing the content between curly braces, two or more examples in the same page have the same content because the only difference was the tokens. In these cases, we omit them from corpus as there's no way to unambiguously determine which command is the pairing example.

Here is a real-world example of the problem: the English version was modified after the French translation was made, so now the pages have fallen out of sync. If we made pairs using the index, we'd create mismatches.

Expand Down
2 changes: 1 addition & 1 deletion src/constants.ts
Original file line number Diff line number Diff line change
@@ -1 +1 @@
export const VERSION = "0.2.0";
export const VERSION = "0.2.1";

0 comments on commit fdc3282

Please sign in to comment.