cxt2vec

About

This repository contains the code to vectorise contexts to use in models such as MTCue and LMCue.

Getting Started

Prerequisites

To run the software, you need to install the following packages (preferably in an Anaconda environment or similar):

sentence_transformers: pip install sentence-transformers

Installation

Clone the repo

git clone https://github.com/st-vincent1/cxt2vec.git

Navigate to the repository
```
cd cxt2vec
```

Run the embedding code:

python main.py --path [path to data] --dest_path [path to output embeddings] --suffixes [comma-separated suffixes of data to embed] --model [context model]

Usage

For a data folder which looks like this:

data
 |- en-pl
     |- train.en
     |- train.pl
     |- valid.en
     |- valid.pl
     |- test.en
     |- test.pl
     |- context
         |- valid.speaker_gender.cxt
         |- valid.formality.cxt
         |- valid.writer_names.cxt
         |- ...
         |- valid.0.en
         |- valid.1.en
         |- ...
         |- test.0.en
         |- ...
         |- train.speaker_gender.cxt
         |- train.plot.cxt
         |- ...
 |- en-de
 |- ...

The following code

python main.py --paths data/en-pl --dest_path data/en-pl/embeddings --suffixes cxt,pl --model minilm

Finds all contexts at data/en-pl/context which match the given suffixes in order (cxt, then pl), embeds all found contexts into a single file per split (train/valid/test) per suffix. A .json file is also produced which contains metadata necessary to later interpret the files in training code.

The result will therefore look like so:

data
 |- en-pl
     |- embeddings
         |- train.minilm.cxt.bin
         |- train.minilm.cxt.idx
         |- train.minilm.en.bin
         |- train.minilm.en.idx
         |- train.minilm.json
         |- valid.minilm.cxt.bin
         |- valid.minilm.cxt.idx
         |- valid.minilm.en.bin
         |- valid.minilm.en.idx
         |- valid.minilm.json
         |- test.minilm.cxt.bin
         |- test.minilm.cxt.idx
         |- test.minilm.en.bin
         |- test.minilm.en.idx
         |- test.minilm.json

Training models on this data

See examples in e.g. [github.com/st-vincent1/MTCue](MTCue repository) for how to use the embeddings to train models.

License

Distributed under the MIT License. See LICENSE.txt for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cxt2vec

About

Getting Started

Prerequisites

Installation

Usage

Training models on this data

License

About

Releases

Packages

Languages

License

st-vincent1/cxt2vec

Folders and files

Latest commit

History

Repository files navigation

cxt2vec

About

Getting Started

Prerequisites

Installation

Usage

Training models on this data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages