If you want to contribute to the DaNLP project, your help is very welcome. These guidelines will help you contributing to the repository for:
- adding new tutorials
- adding new benchmarks
- adding new models and datasets
If you don't know how to contribute to an open source repository (i.e. open a pull request), here is an example tutorial.
Please, follow the commit message style described here.
Before pushing a pull request, make sure that the tests pass running python -m unittest discover
(or using coverage run -m unittest discover
).
You're welcome to help us writing new jupyter notebook tutorials (in English) about how to use and apply Danish NLP.
For instance, you can write about:
- how to use DaNLP tools for a specific applicative case, e.g.:
- NER for recommendation system, customer support
- Dependency parsing for question answering, building a knowledge graph
- ...
- how to train a model for a specific task
- ...
For inspiration, you can have a look at our current tutorials here.
Keep in mind that DaNLP focuses on industry-friendly Danish NLP. So the examples should be using tools and data that are freely available for commercial purpose.
Steps for adding a tutorial to the DaNLP repo:
- add the jupyter notebook tutorial to the
examples/tutorials
folder - update the README.md in the same folder with adding to the list of tutorials: the name of the tutorial (with a brief description)
We update our benchmark scripts when :
- we add a new dataset to DaNLP ;
- we add a new model to DaNLP ;
- new (commercialy available) tools are realeased for Danish NLP and we want to compare it to the DaNLP models and evaluate it against our data.
Refer to the specific benchmark subsection if you are adding a new model or dataset.
If you want to benchmark a model that is not part of the DaNLP package, it is possible (we evaluated, for example, the NERDA models on our NER dataset, see script).
Steps when evaluating a new model :
- add the code to the corresponding script in the benchmarks folder: in
examples/benchmarks/ner_benchmarks.py
for example when benchmarking a new NER model - add required packages in
requirements_benchmarks.txt
- update our documentation with the results from benchmarking.
If you want to add a new model or dataset to the DaNLP repository, contact us (danlp@alexandra.dk) first in order to:
- make sure that the model or data is in line with our focus (i.e. industry friendly Danish NLP) ;
- send us your model or dataset, so that we can upload it to our server.
Code for loading/using models and datasets should be added to, respectively, the models and datasets folders.
Each model or dataset should be provided with at least one load
function.
When you add code for a new model or a new dataset to the danlp repository, you should also:
- add code for testing it (required)
- add code for evaluating it against our benchmarks (optional but that would greatly help us, and the community using it)
- add some documention about it (optional but that would greatly help us, and the community using it)
Include all of this in your pull request. Following are more details about these 3 steps.
Add a test for your model or dataset to the test folder:
- in
tests/test_datasets.py
for a new dataset, - in
tests/test_{framework}_models.py
for a new model (whereframework
isspacy
,flair
orbert
-- or create a new test file if you introduce a new framework).
Then run the test(s) (e.g. python -m unittest tests.test_datasets
) to check that your code doesn't fail.
Benchmark your model on our dataset(s) or our model(s) on your dataset.
Update or create a new benchmark script:
in examples/benchmarks
, add the code for benchmarking in the corresponding task file.
If your model introduces a new task, create a new benchmark file for this task.
Don't forget to add any potential required packages in requirements_benchmarks.txt
.
Update our documentation with the results from benchmarking -- in the corresponding {task}.md
file or create a new one for a new task.
If you are adding a new task (hence a new benchmark script), update the list of scripts in the README file.
Add (markdown) documentation in docs
:
- in
datasets.md
for a new dataset - in
tasks/{task}.md
for a new model for a specific task (e.g. intasks/pos.md
for a POS-tagger); create a new file if introducing a new task. You can also addframeworks
documentation with examples of how to use the model.