First, download the desired Wiktionary dump, from this repository. For instance, here are the latest links for some languages:
Then, run the following command to extract the IPA data:
python -m extract frwiktionary-latest-pages-articles.xml.bz2 fr.tsv
Note that you can disable the cleaning step to get all detected entries:
python -m extract -r frwiktionary-latest-pages-articles.xml.bz2 fr.raw.tsv
The output is easily loaded and processed using pandas
:
import pandas as pd
df = pd.read_csv("fr.tsv", sep="\t", na_filter=False)
df = df.sort_values(["text", "pronunciation", "language"])
df = df.drop_duplicates()
df.to_csv("fr.sorted.tsv", index=False, sep="\t", encoding="utf-8", line_terminator="\n")