This repo contains a Jupyter notebook introducing to language models and word embeddings by training a word2vec model relying on datasets of 100K and 1M sentences from German news articles.
Python and JupyterLab installed on your machine.
- Run
jupyterlab
in your terminal. - Clone this repo.
- Download this folder from Wortschatz Leipzig, unpack it and save the file "deu_news_2022_1M-sentences.txt" in the "data" folder. It is not provided in this repo as it exceeds 100 MB.
- Navigate to this repo using the file manager inside JupyterLab.
- Open "Notebook.ipynb" and enjoy!