Skip to content

Akshayanti/cross-lingual-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cross-lingual-tools

Contains the files needed for working with cross-lingual data, all developed in-house. The following are the folders present in the directory. Click on the link to see the contents of the folder.

  1. Root Directory

    Files not included in any project, but can be used all by themselves as stand-only files.

  2. Parallel Data

    Tools that can be used to check the accuracy of alignments, quality of parallel data.

  3. Tagset Converter

    Convert PDT, Penn Treebank, Perseus and PDT based PDT-based Tamil tagsets into UD tagset.

Root Directory

  1. langCodes.tsv

    TSV File containing the language codes for 134 languages, arranged in alphabetical order of their name, with their codes in 4 major standards. The columns are named as Language and Standard Code out of which the second is a CSV Value arranged as ISO 639-1 Code, ISO 639-2 Code, ISO 639-3 Code, WALS Code.

    The following notations hold in CSV values:

    Notation Implication
    XXX List big enough to not fit here
    abc [A, B, C] abc as inclusive code, along with the ones in braces
    [A, B, C] all the codes mentioned are used, each for different dialects/variations of the language
    - the language is not coded as per this standard

    Information on WALS can be found here.

  2. wals.py

    Python3 File to

    • Find the most similar languages to given language.
    • Find the centroid language of a given genus, i.e. a language most similar to other languages of the genus.
    • Find languages that are most dissimilar to any other language in the given genus.

    List of Arguments (all compulsory):

    • -i or --input: Input file containing the WALS data in a tsv-format

    List of Positional Arguments, and the sub-arguments (Mutually-exclusive):

    • similarity: Display the WALS code and similarity scores for most similar languages to given input language's WALS code.

      Sub-Arguments Function
      -c or --code Input WALS code for the source language
      -n or --number Number of languages to be displayed in the output
    • centroid: Display the WALS code and similarity scores for the centroid language of an input genus, i.e. a language most similar to other languages of the genus.

      Sub-Arguments Function
      -g or --genus Input genus to find the centroid for
    • dissimilarity: Display the WALS Code and similarity scores of the languages that are most dissimilar to any other language in the given genus.

      Sub-Arguments Function
      -g or --genus Input genus to find the centroid for
      -n or --number Number of languages to be displayed in the output

      The input file for the task can be downloaded from here.

    Usage:

    • python3 wals.py -i input_file similar -c <wals_code> -n <output_count>
    • python3 wals.py -i input_file centroid -g <genus_name>
    • python3 wals.py -i input_file dissimilar -g <genus_name> -n <output_count>

About

Tools for working with cross-lingual data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published