Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 #135

Merged
merged 470 commits into from
Feb 17, 2023
Merged

3.0 #135

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
470 commits
Select commit Hold shift + click to select a range
9e2156d
Corrected permissions
tiberiu44 Jan 7, 2021
c600029
Bugfix
tiberiu44 Jan 8, 2021
c2a3649
Added GPU support at runtime
tiberiu44 Jan 8, 2021
07243ae
Wrong config package
tiberiu44 Jan 8, 2021
7980d96
Refactoring
tiberiu44 Jan 8, 2021
61c5b3a
refactoring
tiberiu44 Jan 8, 2021
efdef5b
add lightning to dependencies
tiberiu44 Jan 8, 2021
8d9ed74
Dummy test
tiberiu44 Jan 8, 2021
2d68eda
Dummy test
tiberiu44 Jan 8, 2021
e595d51
Tweak
tiberiu44 Jan 8, 2021
e6c839f
Tweak
tiberiu44 Jan 8, 2021
f2df03e
Update test
tiberiu44 Jan 8, 2021
5476e0b
Test
tiberiu44 Jan 8, 2021
483f1f9
Finished loading for UD CONLL-U format
tiberiu44 Jan 8, 2021
0cb76de
Working on tagger
tiberiu44 Jan 9, 2021
72cdcfb
Work on tagger
tiberiu44 Jan 9, 2021
3987217
tagger training
tiberiu44 Jan 9, 2021
96a0e96
tagger training
tiberiu44 Jan 9, 2021
1dee270
tagger training
tiberiu44 Jan 9, 2021
6b8c8f6
Sync
tiberiu44 Jan 10, 2021
08e5d8f
Sync
tiberiu44 Jan 10, 2021
c7a2684
Sync
tiberiu44 Jan 10, 2021
47c76a7
Sync
tiberiu44 Jan 10, 2021
daf9d6b
Tagger working
tiberiu44 Jan 10, 2021
0e857f1
Better weight for aux loss
tiberiu44 Jan 10, 2021
25547a2
Better weight for aux loss
tiberiu44 Jan 11, 2021
cac1484
Added save and printing for tagger and shared options class
dumitrescustefan Jan 11, 2021
aa435b0
Multilanguage evaluation
tiberiu44 Jan 13, 2021
6f3598f
Saving multiple models
dumitrescustefan Jan 13, 2021
33531e2
Updated ignore list
tiberiu44 Jan 13, 2021
293e368
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
tiberiu44 Jan 13, 2021
78b1f1e
Added XLM-Roberta support
tiberiu44 Jan 13, 2021
09a7717
Using custom ro model
tiberiu44 Jan 13, 2021
ba42dff
Score update
dumitrescustefan Jan 13, 2021
e80575c
Bugfixing
tiberiu44 Jan 14, 2021
58ae614
Code refactor
tiberiu44 Jan 14, 2021
ad57f3b
Refactor
tiberiu44 Jan 14, 2021
31d8476
Added option to load external config
tiberiu44 Jan 14, 2021
c891dce
Added option to select LM-model from CLI or config
tiberiu44 Jan 14, 2021
73cd151
added option to overwrite config lm from CLI
tiberiu44 Jan 14, 2021
96ea7fb
Bugfix
tiberiu44 Jan 14, 2021
638b99f
Working on parser
tiberiu44 Jan 14, 2021
5bdab7c
Sync work on parser
tiberiu44 Jan 15, 2021
96c6b78
Parser working
tiberiu44 Jan 15, 2021
974f510
Removed load limit
tiberiu44 Jan 15, 2021
8b78026
Bugfix in evaluation
tiberiu44 Jan 15, 2021
9ca836b
Added bi-affine attention
tiberiu44 Jan 15, 2021
b89382d
Added experimental ChuLiuEdmonds tree decoding
tiberiu44 Jan 16, 2021
b9b3f66
Better config for parser and bugfix
tiberiu44 Jan 16, 2021
1d35d40
Added residuals to tagging
tiberiu44 Jan 16, 2021
7145b02
Model update
tiberiu44 Jan 17, 2021
d3b24da
Switched to AdamW optimizer
tiberiu44 Jan 18, 2021
499d3d3
Working on tokenizer
tiberiu44 Jan 18, 2021
64f5b4c
Working on tokenizer
tiberiu44 Jan 18, 2021
7bdf518
Training working - validation to do
tiberiu44 Jan 18, 2021
5b3fad5
Bugfix in language id
tiberiu44 Jan 18, 2021
d258ab3
Working on tokenization validation
tiberiu44 Jan 19, 2021
5b0601c
Tokenizer working
tiberiu44 Jan 19, 2021
888cf4f
YAML update
dumitrescustefan Jan 19, 2021
a8bc130
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
dumitrescustefan Jan 19, 2021
b178643
Bug in LMHelper
dumitrescustefan Jan 19, 2021
6e40594
Tagger is working
dumitrescustefan Jan 19, 2021
582f330
Tokenizer is working
tiberiu44 Jan 19, 2021
7c124ef
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
tiberiu44 Jan 19, 2021
da267af
bfix
tiberiu44 Jan 19, 2021
82c3442
bfix
tiberiu44 Jan 19, 2021
954c68a
Bugfix for bugfix :)
tiberiu44 Jan 19, 2021
4286610
Sync
tiberiu44 Jan 19, 2021
c0c01ca
Tokenizer worker
tiberiu44 Jan 20, 2021
dcc7a6b
Tagger working
dumitrescustefan Jan 24, 2021
4060387
Trainer updates
dumitrescustefan Jan 24, 2021
44b126d
Trainer process now working
dumitrescustefan Jan 26, 2021
dba40ce
Added .DS_Store
rscctest Jan 27, 2021
5211ea9
Added datasets for Compound Word Expander and Lemmatizer
rscctest Jan 27, 2021
d974ff1
Added collate function for lemma+compound
rscctest Jan 27, 2021
80487ca
Added training and validation step
rscctest Jan 27, 2021
c53bc2e
Updated config for Lemmatizer
rscctest Jan 28, 2021
a2d8345
Minor fixes
rscctest Jan 28, 2021
479e6fb
Removed duplicate entries from lemma and cwe
rscctest Jan 28, 2021
7e766ef
Added training support for lemmatizer
rscctest Jan 28, 2021
fe41cb0
Removed debug directives
rscctest Jan 28, 2021
f87f920
Lemmatizer in testing phase
rscctest Jan 28, 2021
39799e2
removed unused line
rscctest Jan 28, 2021
25387ae
Bugfix in Lemma dataset
rscctest Jan 28, 2021
11734e8
Corrected validation issue with gs labels being sent to the forward m…
rscctest Jan 28, 2021
d27571c
Lemmatizier training done
rscctest Jan 28, 2021
f4b3f29
Compound word expander ready
rscctest Jan 28, 2021
0868000
Sync
rscctest Jan 28, 2021
6befc0f
Added support for FastText, Transformers and Languasito LM models
rscctest Jan 29, 2021
c95fb5d
Added multi-lm support for tokenizer
rscctest Jan 29, 2021
6bd843e
Added support for multiword tokens
rscctest Jan 29, 2021
04569a5
Sync
rscctest Jan 29, 2021
57778ac
Bugfix in evaluation
rscctest Jan 29, 2021
c367702
Added Languasito as a subpackage
rscctest Jan 29, 2021
2c83d74
Added path to local Languasito
rscctest Jan 29, 2021
094c268
Bugfixing all around
rscctest Jan 29, 2021
08ff6b1
Removed debug printing
rscctest Jan 29, 2021
f534ecb
Bugfix for no-space languages that actually contain spaces :)
rscctest Jan 29, 2021
e27f3bd
Bugfix for no-space languages that actually contain spaces :)
rscctest Jan 29, 2021
ac69ef9
Fixed GPU support
tiberiu44 Jan 30, 2021
1f8cbd7
Biaffine transform for LAS and relative head location (RHL) for UAS
tiberiu44 Jan 31, 2021
cbfd546
Bugfix
tiberiu44 Jan 31, 2021
a8f4714
Tweaks
tiberiu44 Jan 31, 2021
85e16d0
moved rhl to lower layer
tiberiu44 Jan 31, 2021
be89087
Added configurable option for RHL
tiberiu44 Feb 1, 2021
dac1fb7
Safenet for spaces in languages that should use no spaces
rscctest Feb 1, 2021
2b64dc5
Better defaults
tiberiu44 Feb 1, 2021
8999afc
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
tiberiu44 Feb 1, 2021
e3ca3ea
Sync
tiberiu44 Feb 2, 2021
4ceb712
Cleanup parser
tiberiu44 Feb 2, 2021
d8998c9
Bilinear xpos and attrs
rscctest Feb 2, 2021
9c63fe1
Added Biaffine module from Stanza
rscctest Feb 2, 2021
25ba112
Tagger with reduced number of parameters:
rscctest Feb 2, 2021
c65306c
Parser with conditional attrs
rscctest Feb 2, 2021
39c79c4
Working on tokenizer runtime
rscctest Feb 2, 2021
9c4a4aa
Tokenizer process 90% done
tiberiu44 Feb 2, 2021
1764995
Added runtime for parser, tokenizer and tagger
tiberiu44 Feb 2, 2021
a9fa45d
Added quick test for runtime
tiberiu44 Feb 2, 2021
2acbfdc
Test for e2e
tiberiu44 Feb 3, 2021
d69da61
Merge
tiberiu44 Feb 4, 2021
a93d852
Added support for multiple word embeddings at the same time
Feb 4, 2021
896d2f4
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
Feb 4, 2021
bf03444
Bugfix
Feb 4, 2021
7c93955
Added multiple word representations for tokenizer
rscctest Feb 5, 2021
8405b6c
moved mask_concat to utils.py
rscctest Feb 5, 2021
216eb19
Added XPOS prediction to pipeline
tiberiu44 Feb 6, 2021
32884c7
Bugfix in tokenizer shifted word embeddings
tiberiu44 Feb 6, 2021
23f9660
Using Languasito tokenizer for HF tokenization
tiberiu44 Feb 6, 2021
895e9c8
Bugfix
tiberiu44 Feb 7, 2021
d89962d
Bugfixing
tiberiu44 Feb 7, 2021
525d04d
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
tiberiu44 Feb 7, 2021
b66d26d
Bugfixing
tiberiu44 Feb 7, 2021
88e34ed
Bugfix
tiberiu44 Feb 7, 2021
25c7fb2
Runtime fixing
tiberiu44 Feb 7, 2021
45c2c21
Sync
rscctest Feb 8, 2021
6f612df
Added spa for FT and Languasito
tiberiu44 Feb 8, 2021
ecb1367
Added spa for FT and Languasito
tiberiu44 Feb 8, 2021
37152ca
Minor tweaks
tiberiu44 Feb 8, 2021
5aaa22c
Added configuration for RNN layers
tiberiu44 Feb 8, 2021
7d91756
Bugfix for spa
rscctest Feb 9, 2021
7dbefc1
HF runtime fix
tiberiu44 Feb 9, 2021
fec6a2d
Mixed test fasttext+transformer
tiberiu44 Feb 10, 2021
3b187a2
Added word reconstruction and MHA
rscctest Feb 11, 2021
5a86b25
Sync
rscctest Feb 11, 2021
205852c
Bugfix
rscctest Feb 11, 2021
5a3e9ed
bugfix
rscctest Feb 11, 2021
b20309f
Added masked attention
rscctest Feb 11, 2021
04f1d66
Sync
tiberiu44 Feb 12, 2021
e582eaa
Added test for runtime
tiberiu44 Feb 12, 2021
3a23314
Bugfix in mask values
tiberiu44 Feb 12, 2021
e5a7391
Updated test
tiberiu44 Feb 12, 2021
969ac73
Added full mask dropout
tiberiu44 Feb 13, 2021
e5a0212
Added resume option
tiberiu44 Feb 13, 2021
a5356f4
Removed useless printouts
tiberiu44 Feb 13, 2021
e1706bc
Removed useless printouts
tiberiu44 Feb 13, 2021
87038c1
Switched to eval at runtime
tiberiu44 Feb 14, 2021
8961f8e
multiprocessing added
dumitrescustefan Feb 14, 2021
033bf88
Added full mask dropout for word decoder
tiberiu44 Feb 15, 2021
673dec5
Bugfix
tiberiu44 Feb 15, 2021
e3e33de
Residual
tiberiu44 Feb 16, 2021
c5f9c16
Added lexical-contextual cosine loss
tiberiu44 Feb 16, 2021
cf23762
Removed full mask dropout from WordDecoder
tiberiu44 Feb 16, 2021
74bde22
Bugfix
tiberiu44 Feb 16, 2021
b4697d9
Training script generation update
dumitrescustefan Feb 18, 2021
6a4f93b
Added residual
tiberiu44 Feb 22, 2021
7363e78
Updated languasito to pickle tokenized lines
dumitrescustefan Feb 22, 2021
139a3bb
Updated languasito to pickle tokenized lines
dumitrescustefan Feb 22, 2021
fe7e087
Updated languasito to pickle tokenized lines
dumitrescustefan Feb 22, 2021
c95d60d
Not training for seq len > max_seq_len
rscctest Feb 22, 2021
197b213
Added seq limmits for collates
rscctest Feb 22, 2021
685555d
Passing seq limits from collate to tokenizer
rscctest Feb 22, 2021
023b67a
Skipping complex parsing
rscctest Feb 22, 2021
96c0eca
Working on word decomposer
rscctest Feb 24, 2021
93b1717
Model update
tiberiu44 Feb 25, 2021
3396702
Sync
tiberiu44 Feb 25, 2021
f0ce55e
Bugfix
tiberiu44 Feb 25, 2021
d621e7e
Bugfix
tiberiu44 Feb 25, 2021
ec154da
Bugfix
tiberiu44 Feb 25, 2021
a8e090a
Using all reprs
tiberiu44 Mar 2, 2021
3176356
Dropped immediate context
tiberiu44 Mar 5, 2021
17cf390
Multi train script added
dumitrescustefan Mar 6, 2021
f9f5e12
Changed gpu parameter type to string, for multiple gpus int failed
dumitrescustefan Mar 11, 2021
affd46c
Updated pytorch_lightning callback method to work with newer version
dumitrescustefan Mar 13, 2021
9fc2ed5
Updated pytorch_lightning callback method to work with newer version
dumitrescustefan Mar 13, 2021
a0bd6a9
Transparently pass PL args from the command line; skip over empty com…
dumitrescustefan Mar 13, 2021
4d67bb3
Fix typo
dumitrescustefan Mar 22, 2021
775a949
Refactoring and on the way to working API
dumitrescustefan Apr 2, 2021
879a965
API load working
dumitrescustefan Apr 3, 2021
c988509
Partial _call_ working
dumitrescustefan Apr 3, 2021
76ca34f
Partial _call_ working
dumitrescustefan Apr 3, 2021
9829a36
Added partly working api and refactored everything back to cube/. Com…
dumitrescustefan May 8, 2021
32b20d6
api is working
rscctest Aug 5, 2021
17fdca6
Fixing api
rscctest Aug 5, 2021
bec1b1c
Updated readme
rscctest Aug 5, 2021
5669a9e
Update Readme to include flavours
rscctest Aug 5, 2021
03ad480
Device support
rscctest Aug 5, 2021
c11888e
api update
rscctest Aug 5, 2021
eecbb3b
Updated package
rscctest Aug 5, 2021
e1e9b0b
Tweak + results
rscctest Aug 5, 2021
51f5ed2
Clarification
rscctest Aug 5, 2021
8a373f6
Test update
rscctest Aug 8, 2021
fb21a67
Update
rscctest Aug 8, 2021
00b8a66
Sync
rscctest Aug 9, 2021
9cc59d8
Update README
rscctest Aug 9, 2021
e08ce39
Bugfixing
rscctest Aug 9, 2021
e4c365a
Bugfix and api update
rscctest Aug 9, 2021
2507c82
Fixed compound
rscctest Aug 9, 2021
fb7069e
Evaluation update
rscctest Aug 9, 2021
9f72f66
Bugfix
tiberiu44 Aug 9, 2021
2a948fc
Merge branch '3.0' of https://github.com/adobe/NLP-Cube into 3.0
tiberiu44 Aug 9, 2021
deb8110
Package update
tiberiu44 Aug 9, 2021
a93176c
Bugfix for large sentences
rscctest Aug 11, 2021
a3bbc83
Pip package update
rscctest Aug 11, 2021
fa1c60f
Corrected spanish evaluation
rscctest Aug 11, 2021
b29fc89
Package version update
rscctest Aug 11, 2021
9aadd75
Fixed tokenization issues on transformers
rscctest Aug 11, 2021
d6e7c3e
Removed pinned memory
tiberiu44 Aug 11, 2021
e60526e
Bugfix for GPU tensors
tiberiu44 Aug 11, 2021
b82b380
Update package version
tiberiu44 Aug 11, 2021
b7c241a
Automatically detecting hidden state size
tiberiu44 Aug 12, 2021
909327e
Automatically detecting hidden state size
tiberiu44 Aug 12, 2021
9d75f67
Automatically detecting hidden state size
tiberiu44 Aug 12, 2021
cc49b1b
Sync
tiberiu44 Aug 12, 2021
8e000c2
Evaluation update
tiberiu44 Aug 12, 2021
5d48ddd
Package update
tiberiu44 Aug 12, 2021
b61bc16
Bugfix
rscctest Aug 13, 2021
9569e32
Bugfixing
tiberiu44 Aug 15, 2021
d95c8a5
Package version update
tiberiu44 Aug 15, 2021
4933255
Merge branch '3.0' of github.com:adobe/NLP-Cube into 3.0
tiberiu44 Aug 15, 2021
454366d
Bugfix
tiberiu44 Aug 15, 2021
65c1d09
Package version update
tiberiu44 Aug 15, 2021
8ab611e
Update evaluation for Italian
tiberiu44 Aug 15, 2021
b561d7f
tentative support torchtext>=0.9.0 (#127)
KoichiYasuoka Aug 15, 2021
5254d3b
Update package dependencies
tiberiu44 Aug 15, 2021
b799812
Merge branch '3.0' of github.com:adobe/NLP-Cube into 3.0
tiberiu44 Aug 15, 2021
4f2b854
RC
tiberiu44 Aug 15, 2021
0717fe4
Merge branch 'master' into 3.0
tiberiu44 Aug 15, 2021
2f5a664
Dummy word embeddings
tiberiu44 Feb 1, 2022
275b75a
Update params
tiberiu44 Feb 1, 2022
7d60005
Better dropout values
tiberiu44 Feb 1, 2022
40f2b89
Skipping long words
tiberiu44 Feb 1, 2022
09e2ea0
Skipping long words
tiberiu44 Feb 1, 2022
5fc7060
dummy we -> float
tiberiu44 Feb 1, 2022
c1e8d65
Added gradient clipping
tiberiu44 Feb 1, 2022
3b238a2
Update tokenizer
tiberiu44 Feb 2, 2022
ed8706f
Update tokenizer
tiberiu44 Feb 2, 2022
2f30c4d
Sync
tiberiu44 Feb 3, 2022
ca3c504
DCWE
tiberiu44 Feb 8, 2022
05e35df
Working on DCWE
tiberiu44 Feb 10, 2022
e2a084c
Merge branch 'master' into 3.0
tiberiu44 Feb 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions cube/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,10 +126,7 @@ def __call__(self, text: Union[str, Document], flavour: Optional[str] = None):
self._lm_helper.apply(doc)
self._parser.process(doc, self._parser_collate, num_workers=0)
self._lemmatizer.process(doc, self._lemmatizer_collate, num_workers=0)
for seq in doc.sentences:
for w in seq.words:
if w.upos =='PUNCT':
w.lemma = w.word

return doc


Expand Down
30 changes: 27 additions & 3 deletions cube/io_utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,10 @@ def __init__(self, filename=None, verbose=False):
self.cnn_filter = 512
self.lang_emb_size = 100
self.cnn_layers = 5
self.external_proj_size = 300
self.rnn_size = 50
self.rnn_layers = 2
self.external_proj_size = 2

self.no_space_lang = False

if filename is None:
Expand Down Expand Up @@ -139,9 +142,10 @@ def __init__(self, filename=None, verbose=False):
self.head_size = 100
self.label_size = 200
self.lm_model = 'xlm-roberta-base'
self.external_proj_size = 300
self.external_proj_size = 2
self.rhl_win_size = 2
self.rnn_size = 50
self.rnn_size = 200

self.rnn_layers = 3

self._valid = True
Expand Down Expand Up @@ -275,6 +279,26 @@ def __init__(self, filename=None, verbose=False):
self.load(filename)


class DCWEConfig(Config):
def __init__(self, filename=None, verbose=False):
super().__init__()
self.char_emb_size = 256
self.case_emb_size = 32
self.num_filters = 512
self.kernel_size = 5
self.lang_emb_size = 32
self.num_layers = 8
self.output_size = 300 # this will be automatically updated at training time, so do not change

if filename is None:
if verbose:
sys.stdout.write("No configuration file supplied. Using default values.\n")
else:
if verbose:
sys.stdout.write("Reading configuration file " + filename + " \n")
self.load(filename)


class GDBConfig(Config):
def __init__(self, filename=None, verbose=False):
super().__init__()
Expand Down
83 changes: 83 additions & 0 deletions cube/networks/dcwe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import torch
import torch.nn as nn
import pytorch_lightning as pl
from typing import *
import sys

sys.path.append('')
from cube.networks.modules import WordGram, LinearNorm
from cube.io_utils.encodings import Encodings
from cube.io_utils.config import DCWEConfig


class DCWE(pl.LightningModule):
encodings: Encodings
config: DCWEConfig

def __init__(self, config: DCWEConfig, encodings: Encodings):
super(DCWE, self).__init__()
self._config = config
self._encodings = encodings
self._wg = WordGram(num_chars=len(encodings.char2int),
num_langs=encodings.num_langs,
num_layers=config.num_layers,
num_filters=config.num_filters,
char_emb_size=config.lang_emb_size,
case_emb_size=config.case_emb_size,
lang_emb_size=config.lang_emb_size
)
self._output_proj = LinearNorm(config.num_filters // 2, config.output_size, w_init_gain='linear')
self._improve = 0
self._best_loss = 9999

def forward(self, x_char, x_case, x_lang, x_mask, x_word_len):
pre_proj = self._wg(x_char, x_case, x_lang, x_mask, x_word_len)
proj = self._output_proj(pre_proj)
return proj

def _get_device(self):
if self._output_proj.linear_layer.weight.device.type == 'cpu':
return 'cpu'
return '{0}:{1}'.format(self._output_proj.linear_layer.weight.device.type,
str(self._output_proj.linear_layer.weight.device.index))

def configure_optimizers(self):
return torch.optim.AdamW(self.parameters())

def training_step(self, batch, batch_idx):
x_char = batch['x_char']
x_case = batch['x_case']
x_lang = batch['x_lang']
x_word_len = batch['x_word_len']
x_mask = batch['x_mask']
y_target = batch['y_target']
y_pred = self.forward(x_char, x_case, x_lang, x_mask, x_word_len)
loss = torch.mean((y_pred - y_target) ** 2)
return loss

def validation_step(self, batch, batch_idx):
x_char = batch['x_char']
x_case = batch['x_case']
x_lang = batch['x_lang']
x_word_len = batch['x_word_len']
x_mask = batch['x_mask']
y_target = batch['y_target']
y_pred = self.forward(x_char, x_case, x_lang, x_mask, x_word_len)
loss = torch.mean((y_pred - y_target) ** 2)
return {'loss': loss.detach().cpu().numpy()[0]}

def validation_epoch_end(self, outputs: List[Any]) -> None:
mean_loss = sum([output['loss'] for output in outputs])
mean_loss /= len(outputs)
self.log('val/loss', mean_loss)
self.log('val/early_meta', self._improve)

def save(self, path):
torch.save(self.state_dict(), path)

def load(self, model_path: str, device: str = 'cpu'):
self.load_state_dict(torch.load(model_path, map_location='cpu')['state_dict'])
self.to(device)



21 changes: 21 additions & 0 deletions cube/networks/lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,27 @@ def apply_raw(self, batch):
pass


class LMHelperDummy(LMHelper):
def __init__(self, device: str = 'cpu', model: str = None):
pass

def get_embedding_size(self):
return [1]

def apply(self, document: Document):
for ii in tqdm.tqdm(range(len(document.sentences)), desc="Pre-computing embeddings", unit="sent"):
for jj in range(len(document.sentences[ii].words)):
document.sentences[ii].words[jj].emb = [[1.0]]

def apply_raw(self, batch):
embeddings = []
for ii in range(len(batch)):
c_emb = []
for jj in range(len(batch[ii])):
c_emb.append([1.0])
embeddings.append(c_emb)
return embeddings

if __name__ == "__main__":
from ipdb import set_trace

Expand Down
7 changes: 4 additions & 3 deletions cube/networks/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,9 +427,10 @@ def __init__(self, num_chars: int, num_langs: int, num_filters=512, char_emb_siz
super(WordGram, self).__init__()
NUM_FILTERS = num_filters
self._num_filters = NUM_FILTERS
self._lang_emb = nn.Embedding(num_langs + 1, lang_emb_size)
self._tok_emb = nn.Embedding(num_chars + 1, char_emb_size)
self._case_emb = nn.Embedding(4, case_emb_size)
self._lang_emb = nn.Embedding(num_langs + 1, lang_emb_size, padding_idx=0)
self._tok_emb = nn.Embedding(num_chars + 3, char_emb_size, padding_idx=0)
self._case_emb = nn.Embedding(4, case_emb_size, padding_idx=0)

self._num_layers = num_layers
convolutions_char = []
cs_inp = char_emb_size + lang_emb_size + case_emb_size
Expand Down
22 changes: 14 additions & 8 deletions cube/networks/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ def __init__(self, config: ParserConfig, encodings: Encodings, language_codes: [
self._upos_emb = nn.Embedding(len(encodings.upos2int), 64)

self._rnn = nn.LSTM(NUM_FILTERS // 2 + config.lang_emb_size + config.external_proj_size, config.rnn_size,
num_layers=config.rnn_layers, batch_first=True, bidirectional=True, dropout=0.33)
num_layers=config.rnn_layers, batch_first=True, bidirectional=True, dropout=0.1)


self._pre_out = LinearNorm(config.rnn_size * 2 + config.lang_emb_size, config.pre_parser_size)
# self._head_r1 = LinearNorm(config.pre_parser_size, config.head_size)
Expand Down Expand Up @@ -137,9 +138,10 @@ def forward(self, X):
for ii in range(len(x_word_emb_packed)):
we = unpack(x_word_emb_packed[ii], sl, x_sents.shape[1], self._get_device())
if word_emb_ext is None:
word_emb_ext = self._ext_proj[ii](we.float())
word_emb_ext = self._ext_proj[ii](we)
else:
word_emb_ext = word_emb_ext + self._ext_proj[ii](we.float())
word_emb_ext = word_emb_ext + self._ext_proj[ii](we)


word_emb_ext = word_emb_ext / len(x_word_emb_packed)
word_emb_ext = torch.tanh(word_emb_ext)
Expand All @@ -153,7 +155,8 @@ def forward(self, X):

word_emb = self._word_emb(x_sents)

x = mask_concat([word_emb, char_emb, word_emb_ext], 0.33, self.training, self._get_device())
x = mask_concat([word_emb, char_emb, word_emb_ext], 0.1, self.training, self._get_device())


x = torch.cat([x, lang_emb[:, 1:, :]], dim=-1)
# prepend root
Expand All @@ -172,7 +175,8 @@ def forward(self, X):
res = tmp
else:
res = res + tmp
x = torch.dropout(tmp, 0.2, self.training)
x = torch.dropout(tmp, 0.1, self.training)

cnt += 1
if cnt == self._config.aux_softmax_location:
hidden = torch.cat([x + res, lang_emb], dim=1)
Expand All @@ -184,7 +188,8 @@ def forward(self, X):
# aux tagging
lang_emb = lang_emb.permute(0, 2, 1)
hidden = hidden.permute(0, 2, 1)[:, 1:, :]
pre_morpho = torch.dropout(torch.tanh(self._pre_morpho(hidden)), 0.33, self.training)
pre_morpho = torch.dropout(torch.tanh(self._pre_morpho(hidden)), 0.1, self.training)

pre_morpho = torch.cat([pre_morpho, lang_emb[:, 1:, :]], dim=2)
upos = self._upos(pre_morpho)
if gs_upos is None:
Expand All @@ -200,11 +205,12 @@ def forward(self, X):
word_emb_ext = torch.cat(
[torch.zeros((word_emb_ext.shape[0], 1, self._config.external_proj_size), device=self._get_device(),
dtype=torch.float), word_emb_ext], dim=1)
x = mask_concat([x_parse, word_emb_ext], 0.33, self.training, self._get_device())
x = torch.cat([x_parse, word_emb_ext], dim=-1) #mask_concat([x_parse, word_emb_ext], 0.1, self.training, self._get_device())
x = torch.cat([x, lang_emb], dim=-1)
output, _ = self._rnn(x)
output = torch.cat([output, lang_emb], dim=-1)
pre_parsing = torch.dropout(torch.tanh(self._pre_out(output)), 0.33, self.training)
pre_parsing = torch.dropout(torch.tanh(self._pre_out(output)), 0.1, self.training)

# h_r1 = torch.tanh(self._head_r1(pre_parsing))
# h_r2 = torch.tanh(self._head_r2(pre_parsing))
# l_r1 = torch.tanh(self._label_r1(pre_parsing))
Expand Down
7 changes: 6 additions & 1 deletion cube/networks/tagger.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
import sys

sys.path.append('')
import os, yaml


os.environ["TOKENIZERS_PARALLELISM"] = "false"
import pytorch_lightning as pl
import torch.nn as nn
Expand All @@ -14,6 +17,7 @@
from cube.networks.utils import MorphoCollate, MorphoDataset, unpack, mask_concat
from cube.networks.modules import WordGram


class Tagger(pl.LightningModule):
def __init__(self, config: TaggerConfig, encodings: Encodings, language_codes: [] = None, ext_word_emb=0):
super().__init__()
Expand Down Expand Up @@ -276,7 +280,8 @@ def validation_epoch_end(self, outputs):
# print("\n\n\n", upos_ok / total, xpos_ok / total, attrs_ok / total,
# aupos_ok / total, axpos_ok / total, aattrs_ok / total, "\n\n\n")

def load(self, model_path:str, device: str = 'cpu'):
def load(self, model_path: str, device: str = 'cpu'):

self.load_state_dict(torch.load(model_path, map_location='cpu')['state_dict'])
self.to(device)

Expand Down
27 changes: 22 additions & 5 deletions cube/networks/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ def __init__(self, config: TokenizerConfig, encodings: Encodings, language_codes
conv_layer = nn.Sequential(
ConvNorm(cs_inp,
NUM_FILTERS,
kernel_size=5, stride=1,
padding=2,
kernel_size=3, stride=1,
padding=1,
dilation=1, w_init_gain='tanh'),
nn.BatchNorm1d(NUM_FILTERS))
conv_layers.append(conv_layer)
Expand All @@ -49,7 +49,13 @@ def __init__(self, config: TokenizerConfig, encodings: Encodings, language_codes
self._wg = WordGram(len(encodings.char2int), num_langs=encodings.num_langs)
self._lang_emb = nn.Embedding(encodings.num_langs + 1, config.lang_emb_size, padding_idx=0)
self._spa_emb = nn.Embedding(3, 16, padding_idx=0)
self._output = LinearNorm(NUM_FILTERS // 2 + config.lang_emb_size, 5)
self._rnn = nn.LSTM(NUM_FILTERS // 2 + config.lang_emb_size,
config.rnn_size,
num_layers=config.rnn_layers,
bidirectional=True,
batch_first=True)
self._output = LinearNorm(config.rnn_size * 2, 5)


ext2int = []
for input_size in self._ext_word_emb:
Expand Down Expand Up @@ -103,20 +109,29 @@ def forward(self, batch):
half = self._config.cnn_filter // 2
res = None
cnt = 0

skip = None
for conv in self._convs:
conv_out = conv(x)
tmp = torch.tanh(conv_out[:, :half, :]) * torch.sigmoid((conv_out[:, half:, :]))
if res is None:
res = tmp
else:
res = res + tmp
x = torch.dropout(tmp, 0.2, self.training)
x = torch.dropout(tmp, 0.1, self.training)
cnt += 1
if cnt != self._config.cnn_layers:
if skip is not None:
x = x + skip
skip = x

x = torch.cat([x, x_lang], dim=1)
x = x + res
x = torch.cat([x, x_lang], dim=1)
x = x.permute(0, 2, 1)

x, _ = self._rnn(x)

return self._output(x)

def validation_step(self, batch, batch_idx):
Expand Down Expand Up @@ -297,7 +312,9 @@ def process(self, raw_text, collate: TokenCollate, batch_size=32, num_workers: i
return d

def configure_optimizers(self):
return torch.optim.AdamW(self.parameters())
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-4)
return optimizer


def _compute_early_stop(self, res):
for lang in res:
Expand Down
2 changes: 2 additions & 0 deletions cube/networks/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ def __init__(self, document: Document, for_training=True):
word = w.word
lemma = w.lemma
upos = w.upos
if len(word) > 25:
continue

key = (word, lang_id, upos)
if key not in lookup or for_training is False:
Expand Down
Loading