We had major update and recompiled the corpus. First, its data was updated with the documents issued to the end of 2023. Then, sone new variables added. Then, we prowided CONLL-U format morphosyntactic tagging.
We had updated the corpus of the most complicated sentences because we fixed a bug causing corruption in some of the sentences. Note that every number more than three digits long is replaced to "999" by technical reasons.
We have published a corpus of the most complicated sentences in Russian law texts. It was made by segmentation of texts into sentences, then we choose some of them by metrics. It is in CSV Unicode. See file most_complicated_sentences.zip
We added documents adopted by the end of year 2017. All zipfiles were reloaded to the download source because of minor changes to some of the documents. You need to reload all zipfiles. Links and md5sums updated accordingly.
We added an example in Python 3 of how to load data from this dataset to a Pandas DataFrame. Hope it helps to understand how to use the dataset. Commit: https://github.com/irlcode/RusLawOD/commit/54efe4dbeb3b28cdb309eb27c31b1ca3f749f712
Modified XML and reloaded all files accordingly. Commit: https://github.com/irlcode/RusLawOD/commit/c5778eadd7f2dc2ecc4e81f626ca5d3251e3fc40
Actually the dataset were made available