Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For HuggingFace: indicating closest ISO 639-3 code? #13

Closed
alexis-michaud opened this issue Aug 24, 2022 · 9 comments · Fixed by #15
Closed

For HuggingFace: indicating closest ISO 639-3 code? #13

alexis-michaud opened this issue Aug 24, 2022 · 9 comments · Fixed by #15
Milestone

Comments

@alexis-michaud
Copy link

alexis-michaud commented Aug 24, 2022

Esteemed maintainers of Glottolog,

An opportunity to connect state-of-the-art Natural Language Processing (NLP) with linguistics in its unrestricted, 'full-diversity' mode: HuggingFace Transformers meet Glottolog?

NLP colleagues go to HuggingFace datasets to run experiments on all the language data they can lay their grubby hands on. It seems important to 'push' data from less-documented / less-studied / less-resourced / endangered languages up there, as a contribution to connecting the world of language documentation, description & conservation with the world of state-of-the-art NLP research. The stakes are high for both fields. (For anyone interested in longer reads, there's the argument of the ComputEL conference series, for instance.)

Currently, HuggingFace uses IANA language codes, not Glottolog codes. Thus for Japhug and Na: the datasets from the Pangloss Collection have been made available as HuggingFace datasets, here. We would like to use Glottocodes to identify the language varieties of these two corpora: japh1234 for Japhug, and yong1288 for Na.

image

But we can't input those in the metadata. @BenjaminGalliot had to remove Glottolog codes and only provide the closest ISO 639-3 equivalents. Glottolog are currently confined to (i) corpus card description, (ii) language details, and (iii) subcorpora names. (Pull request and discussion are here.)

It makes trouble for linguists, for reasons which are obvious to us but not so for computer science researchers. Thus, Japhug is one of the rGyalrongic (=Jiarong, rGyalrong) languages: it does not have an Ethnologue (3-letter) code of its own. So labelling as 'jya' (Jiarong) is under-specific.

For want of proper referencing in the metadata, finding 'Japhug' becomes really hard (defeating the purpose of the whole plan of a HF deposit): as the language name is not foregrounded in the metadata, a search for 'Japhug' among corpora returns zero results. ('Na' has false positives like 'Vietnamese'.) Another occasion to confirm that we really want Glottocodes!

Wouldn't it be great if Glottolog, a CIL (Cool and Internationally Leading) database of language names (and more), committed to Open Science, met HuggingFace, a CIL (Cool and Internationally Leading) group: "The AI community building the future" committed to Open Science?

Specifically, the question raised by the HF team (here) is: "is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes and be kept up to date, and ideally that would be accessible as a Node.js npm package?"

Ball's in your court for answering this question, not? Looks like an opening for adoption of Glottocodes (with ISO compatibility), for the mutual benefit of NLP research and linguistics+language documentation, doesn't it? To what extent would pyglottolog fit the bill / do the job? (API documentation here) I'm reaching my technical limitations here: I can't assess the distance between what they offer and what the HF team needs.

@alexis-michaud alexis-michaud changed the title For HuggingFace: making Glottolog available as a Node.js npm package? For HuggingFace: making Glottocodes DB available as a Node.js npm package? Aug 24, 2022
@alexis-michaud
Copy link
Author

(this issue probably belongs in pyglottolog rather than here, right?)

@xrotwang
Copy link
Collaborator

xrotwang commented Aug 25, 2022 via email

@alexis-michaud
Copy link
Author

alexis-michaud commented Aug 25, 2022

Glottocodes registered as BCP47 codes: indeed, that is what was recently suggested in the discussion (3rd point here): using Glottolog codes after an -x- tag in the BCP-47 format to maintain BCP-47 validity. Maybe that's all that needs to be said?

I'd still have a suggestion concerning an additional distribution format of Glottolog data tailored to the needs of computer scientists. (With apologies in case this comment is wide of the mark: I'm not an expert on computational matters.) I have a feeling that it would be useful (even though it's redundant information) to indicate the closest ISO 3-letter code equivalents for all language varieties, including "dialects".

The automatically generated list that Benjamin Galliot produced with pyglottolog (see here) lacks a 3-letter code for dialects (columns G and H in Benjamin's spreadsheet). The absence of this piece of information (currently useful for various purposes) might cause enough friction that someone who wants to do simple table lookup would go and search elsewhere for a simpler tool.

Thus, Yongning Na is (correctly) indicated in Glottolog (as yong1270) as corresponding to the 3-letter language code nru (links to Ethnologue and OLAC are provided), but this piece of information is not copied into the lines for its two dialects (Lataddi, lata1234, and Yongning, yong1288). So the 3-letter language code nru does not appear in Benjamin's pyglottolog export for the Lataddi (line 23194) and Yongning (line 23195) dialects.
image
The fact that both come under Ethnologue nru is not accessible through table lookup. To obtain this piece of information ("What is the closest ISO 639-3 code?"), it is necessary to go through a reasoning involving several steps, requiring familiarity with how Glottolog is structured:
(i) realizing why the information is absent: "it's not a bug, it must be because this entry is of a subtype that does not have this information... Yes! Column N says this entry is a "dialect", not a "language", and “dialects" do not always get an ISO 639-3 code of their own in this table".
(ii) finding one's way to the information on the higher-level grouping to which the variety at issue belongs ("this is a dialect of... let's see... yong1270! OK let's move to the corresponding line. Good, now we can look up the ISO 639-3 column for that higher-level grouping ("language"). Here it is: nru!")

My impression (for what it's worth) is that computer scientists who want a database of language names would want the information laid out flat in the table. Looking up a Glottocode and finding the closest ISO 639-3 code in the relevant column on the same line would be cool & helpful. So the table would be like this:
image
(I added the missing 3-letter codes manually, but it could be done automatically based on the information in the right-hand column, of course.)

I can't see any clear drawback in adding the closest 3-letter code for all language varieties. Then linguists (like me) could more easily push for the use of Glottocodes, with the argument that by using Glottocodes you also get the ISO 639-3 codes. (People who care about the many caveats surrounding language codes and language names can always find information & discussions elsewhere.)

Don't know if this makes any sense as seen from Kahlaische Straße 10? :)

@alexis-michaud
Copy link
Author

Just in case someone from the Glottolog team feels like jumping in, the conversation on the HuggingFace repo is continuing.

@alexis-michaud alexis-michaud changed the title For HuggingFace: making Glottocodes DB available as a Node.js npm package? For HuggingFace: indicating closest ISO 639-3 code? Sep 6, 2022
@alexis-michaud
Copy link
Author

I have changed the Issue title: apologies for switching the goalposts, but it seems the Hugging Face team is not so focused on getting the database accessible as a Node.js npm package. Instead, recent episodes in the discussion focused on how to retrieve the next closest ISO 639-3 code for a given Glottocode. The idea is to use Glottocodes, and also to use the Glottocode to arrive at the nearest ISO 639-3 code.
Thus, the metadata of a given dataset would contain:

  1. a Glottocode tag (e.g. japh1234 for the Japhug language)
  2. a matching ISO 639-3 tag when there is one (for Japhug: there is no matching tag, so that field would be empty)
  3. if (2) is not provided: a 'next closest' ISO 639-3 tag when one is available (e.g. for Japhug: jya), with the caveat that (3) is not an exact match.

(Empty fields for both (2) or (3) would indicate that the language at issue is an isolate and lacks an ISO 639-3 code.)

@alexis-michaud
Copy link
Author

alexis-michaud commented Sep 6, 2022

As noted in the conversation on the HuggingFace repo, I appreciate that, from the perspective of Glottolog, it may appear as unnecessary and potentially misleading to provide next-closest ISO 639-3 codes: additional work with no obvious gain. It could help for interoperability, though. So I make bold to mention the discussion to the team. Apologies for the 'noise' in case this is irrelevant.
All best wishes

@xrotwang
Copy link
Collaborator

xrotwang commented Sep 6, 2022 via email

@xrotwang
Copy link
Collaborator

@alexis-michaud with "next-closest ISO 639-3 code" you mean an ISO code of scope "individual", right? Otherwise, there could be multiple closest ISO codes - of dubious value. If so, I think Glottolog could provide this, presumably in the [glottolog-cldf dataset]. While this dataset isn't a "just-one-table" dataset, I'd still think it is easy enough to access to make integration into a HuggingFace workflow possible.

@alexis-michaud
Copy link
Author

alexis-michaud commented Sep 13, 2022

Exactly, that is the idea: "an ISO code of scope "individual"".
Great to hear that it is feasible and not hopelessly difficult.

@xrotwang xrotwang transferred this issue from glottolog/glottolog Sep 13, 2022
xrotwang added a commit that referenced this issue Oct 13, 2022
@xrotwang xrotwang added this to the v4.7 milestone Nov 28, 2022
@xrotwang xrotwang mentioned this issue Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants