-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For HuggingFace: indicating closest ISO 639-3 code? #13
Comments
(this issue probably belongs in pyglottolog rather than here, right?) |
No, it belongs here, I think, because it's about an additional distribution
format of Glottolog data.
I'm still not entirely sure, what exactly Glottolog should provide here.
Maybe glottocodes registered as BCP47 codes might have the same effect?
Alexis Michaud ***@***.***> schrieb am Mi., 24. Aug. 2022,
22:40:
… (this issue probably belongs in pyglottolog
<https://github.com/glottolog/pyglottolog> rather than here, right?)
—
Reply to this email directly, view it on GitHub
<#13>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUOKEFGUBBOBG26Y6GNA3V22CCTANCNFSM57PRZOQQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Glottocodes registered as BCP47 codes: indeed, that is what was recently suggested in the discussion (3rd point here): using Glottolog codes after an -x- tag in the BCP-47 format to maintain BCP-47 validity. Maybe that's all that needs to be said? I'd still have a suggestion concerning an additional distribution format of Glottolog data tailored to the needs of computer scientists. (With apologies in case this comment is wide of the mark: I'm not an expert on computational matters.) I have a feeling that it would be useful (even though it's redundant information) to indicate the closest ISO 3-letter code equivalents for all language varieties, including "dialects". The automatically generated list that Benjamin Galliot produced with pyglottolog (see here) lacks a 3-letter code for dialects (columns G and H in Benjamin's spreadsheet). The absence of this piece of information (currently useful for various purposes) might cause enough friction that someone who wants to do simple table lookup would go and search elsewhere for a simpler tool. Thus, Yongning Na is (correctly) indicated in Glottolog (as yong1270) as corresponding to the 3-letter language code nru (links to Ethnologue and OLAC are provided), but this piece of information is not copied into the lines for its two dialects (Lataddi, lata1234, and Yongning, yong1288). So the 3-letter language code nru does not appear in Benjamin's pyglottolog export for the Lataddi (line 23194) and Yongning (line 23195) dialects. My impression (for what it's worth) is that computer scientists who want a database of language names would want the information laid out flat in the table. Looking up a Glottocode and finding the closest ISO 639-3 code in the relevant column on the same line would be cool & helpful. So the table would be like this: I can't see any clear drawback in adding the closest 3-letter code for all language varieties. Then linguists (like me) could more easily push for the use of Glottocodes, with the argument that by using Glottocodes you also get the ISO 639-3 codes. (People who care about the many caveats surrounding language codes and language names can always find information & discussions elsewhere.) Don't know if this makes any sense as seen from Kahlaische Straße 10? :) |
Just in case someone from the Glottolog team feels like jumping in, the conversation on the HuggingFace repo is continuing. |
I have changed the Issue title: apologies for switching the goalposts, but it seems the Hugging Face team is not so focused on getting the database accessible as a Node.js npm package. Instead, recent episodes in the discussion focused on how to retrieve the next closest ISO 639-3 code for a given Glottocode. The idea is to use Glottocodes, and also to use the Glottocode to arrive at the nearest ISO 639-3 code.
(Empty fields for both (2) or (3) would indicate that the language at issue is an isolate and lacks an ISO 639-3 code.) |
As noted in the conversation on the HuggingFace repo, I appreciate that, from the perspective of Glottolog, it may appear as unnecessary and potentially misleading to provide next-closest ISO 639-3 codes: additional work with no obvious gain. It could help for interoperability, though. So I make bold to mention the discussion to the team. Apologies for the 'noise' in case this is irrelevant. |
I'll be back from vacation next week and will try to answer then.
Alexis Michaud ***@***.***> schrieb am Di., 6. Sep. 2022,
21:22:
… As noted in the conversation on the HuggingFace repo
<huggingface/datasets#4881 (comment)>,
I appreciate that, from the perspective of Glottolog, it may appear as
unnecessary and potentially misleading to provide next-closest ISO 639-3
codes: additional work with no obvious gain. Just thought I'd mention the
discussion to the team.
All best wishes
—
Reply to this email directly, view it on GitHub
<#13>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUOKFVHLKOSKHSB2B3GXTV46KXBANCNFSM57PRZOQQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@alexis-michaud with "next-closest ISO 639-3 code" you mean an ISO code of scope "individual", right? Otherwise, there could be multiple closest ISO codes - of dubious value. If so, I think Glottolog could provide this, presumably in the [glottolog-cldf dataset]. While this dataset isn't a "just-one-table" dataset, I'd still think it is easy enough to access to make integration into a HuggingFace workflow possible. |
Exactly, that is the idea: "an ISO code of scope "individual"". |
Esteemed maintainers of Glottolog,
An opportunity to connect state-of-the-art Natural Language Processing (NLP) with linguistics in its unrestricted, 'full-diversity' mode: HuggingFace Transformers meet Glottolog?
NLP colleagues go to HuggingFace datasets to run experiments on all the language data they can lay their grubby hands on. It seems important to 'push' data from less-documented / less-studied / less-resourced / endangered languages up there, as a contribution to connecting the world of language documentation, description & conservation with the world of state-of-the-art NLP research. The stakes are high for both fields. (For anyone interested in longer reads, there's the argument of the ComputEL conference series, for instance.)
Currently, HuggingFace uses IANA language codes, not Glottolog codes. Thus for Japhug and Na: the datasets from the Pangloss Collection have been made available as HuggingFace datasets, here. We would like to use Glottocodes to identify the language varieties of these two corpora: japh1234 for Japhug, and yong1288 for Na.
But we can't input those in the metadata. @BenjaminGalliot had to remove Glottolog codes and only provide the closest ISO 639-3 equivalents. Glottolog are currently confined to (i) corpus card description, (ii) language details, and (iii) subcorpora names. (Pull request and discussion are here.)
It makes trouble for linguists, for reasons which are obvious to us but not so for computer science researchers. Thus, Japhug is one of the rGyalrongic (=Jiarong, rGyalrong) languages: it does not have an Ethnologue (3-letter) code of its own. So labelling as 'jya' (Jiarong) is under-specific.
For want of proper referencing in the metadata, finding 'Japhug' becomes really hard (defeating the purpose of the whole plan of a HF deposit): as the language name is not foregrounded in the metadata, a search for 'Japhug' among corpora returns zero results. ('Na' has false positives like 'Vietnamese'.) Another occasion to confirm that we really want Glottocodes!
Wouldn't it be great if Glottolog, a CIL (Cool and Internationally Leading) database of language names (and more), committed to Open Science, met HuggingFace, a CIL (Cool and Internationally Leading) group: "The AI community building the future" committed to Open Science?
Specifically, the question raised by the HF team (here) is: "is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes and be kept up to date, and ideally that would be accessible as a Node.js npm package?"
Ball's in your court for answering this question, not? Looks like an opening for adoption of Glottocodes (with ISO compatibility), for the mutual benefit of NLP research and linguistics+language documentation, doesn't it? To what extent would pyglottolog fit the bill / do the job? (API documentation here) I'm reaching my technical limitations here: I can't assess the distance between what they offer and what the HF team needs.
The text was updated successfully, but these errors were encountered: