Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh journal lists periodically #5749

Merged
merged 24 commits into from
Aug 31, 2020
Merged

Conversation

koppor
Copy link
Member

@koppor koppor commented Dec 16, 2019

Before a release, the journal lists have to be updated manually. With this PR, that is happening periodically. Similar to our CSL files (refs #5718)

@tobiasdiez
Copy link
Member

What's actually the reason of having them in an extra repository, and not here in the main one?

@koppor
Copy link
Member Author

koppor commented Dec 16, 2019

I stumbled upon this while reading our howto for releasing a new version:

grafik

Currently, we collect the data at https://abbrv.jabref.org/. We split the different lists up to fields:

grafik

For instance, we have a script for fetching the latest abbreviations from https://mathscinet.ams.org. See https://github.com/JabRef/abbrv.jabref.org/blob/master/update_mathscinet.py for details.

The split of of the lists helps

What's changed in journalLists.txt

(Didn't you ask that - I just made the effort. Mhh...)

TODO:

Backport PRs on journalLists.csv / check whether they are included in abbrv.jabref.org

@tobiasdiez
Copy link
Member

tobiasdiez commented Dec 16, 2019

What could also work is to include the abbreviation files directly here in this rep, and then run the combine script as part of the build progress. Might reduce the management overhead concerning backporting.

Did you checked the performance and memory consumption of the current version? I'm a bit worried that 100k new abbreviations take their toll.

@koppor
Copy link
Member Author

koppor commented Dec 17, 2019 via email

@tobiasdiez
Copy link
Member

Do we have that many contributions to the abbreviation list? I thought it was something like twice per year...

@koppor
Copy link
Member Author

koppor commented Dec 18, 2019

Even though, it is not that often, I find the homepage https://abbrv.jabref.org more welcoming than a sub directory in the JabRef source with a README.md explaining the contribution. The target group of contributors to JabRef is different from the group contributing to abbrv.jabref.org. The former are people knowing how to code, the latter are just users of JabRef with some IT skills.

Maybe, it is no issue for the second group to overlook that they don't need to setup a workspace, they don't need to run tests, ... (see https://github.com/JabRef/jabref/blob/master/CONTRIBUTING.md and https://github.com/JabRef/jabref/blob/master/.github/PULL_REQUEST_TEMPLATE.md).

I would keep the abbrv repo separate to adress the other target group.

Maybe, I should do a study using the GenderMag method by @mendezc1 to more scientifically undermine (or reject) my argument. WDYT @igorsteinmacher?

@koppor
Copy link
Member Author

koppor commented Dec 19, 2019

Performance test is not necessary any more (#5769)

I just checked the performance and think the list is too big. Memory increased from about 400mb to 1gb after abbreviating one journal title.

I wonder about the quality - need to find out when the files diverged. We had the requirement create the list based on our abbreviation list.

Moreover, the quality of the data seemed way worse than it was before. A lot of abbreviations don't have dots anymore.

@github-actions
Copy link
Contributor

Please do not update journalList.csv directly. Please update the list at https://github.com/JabRef/abbrv.jabref.org/tree/master/journals

@koppor koppor self-assigned this Dec 20, 2019
@jlaehne
Copy link
Contributor

jlaehne commented Mar 1, 2020

I modified combine_journal_lists.py to separately merge each of the lists on abbrv.jabref with the combined 'journalList.csv'. Checking how much the list increases gives a good idea, where all the extra entries come from:

journals/journal_abbreviations_acs.csv: 1717
Combined key count: 15531, Not in journalList.csv: 326

journals/journal_abbreviations_ams.csv: 2547
Combined key count: 17366, Not in journalList.csv: 2161

journals/journal_abbreviations_annee-philologique.csv: 848
Combined key count: 16047, Not in journalList.csv: 842

journals/journal_abbreviations_dainst.csv: 1590
Combined key count: 16782, Not in journalList.csv: 1577

journals/journal_abbreviations_entrez.csv: 19506
Combined key count: 33601, Not in journalList.csv: 18396

journals/journal_abbreviations_geology_physics.csv: 681
Combined key count: 15273, Not in journalList.csv: 68

journals/journal_abbreviations_ieee.csv: 262
Combined key count: 15463, Not in journalList.csv: 258

journals/journal_abbreviations_lifescience.csv: 9744
Combined key count: 15238, Not in journalList.csv: 33

journals/journal_abbreviations_mathematics.csv: 3312
Combined key count: 18281, Not in journalList.csv: 3076

journals/journal_abbreviations_mechanical.csv: 4759
Combined key count: 15478, Not in journalList.csv: 273

journals/journal_abbreviations_medicus.csv: 3168
Combined key count: 15970, Not in journalList.csv: 765

journals/journal_abbreviations_meteorology.csv: 119
Combined key count: 15205, Not in journalList.csv: 0

journals/journal_abbreviations_sociology.csv: 94
Combined key count: 15290, Not in journalList.csv: 85

journals/journal_abbreviations_webofscience-dots.csv: 87224
Combined key count: 91085, Not in journalList.csv: 75880

journals/journal_abbreviations_webofscience.csv: 87225
Combined key count: 91086, Not in journalList.csv: 75881

journals/journal_abbreviations_general.csv: 4538
Combined key count: 16927, Not in journalList.csv: 1722

From this, ams, annee-philologique, dainst, entrez, ieee, mathematics, sociology and webofscience were probably not part of the repository when journalList.csv was created.

However, also acs, mechanical, medicus, general have several hundred entries added after the lists were last merged.

The biggest increases come from entrez.csv and the webofscience files. Leaving them out gives still a moderate database size (instead of the 15205 entries currently in journalList.csv):
Combined key count: 24606

However, merging all but these three files with journalList.csv gives
Combined key count: 25492
So also journalList.csv still contains about 900 abbreviations which were probably added after the last merge.

@jlaehne
Copy link
Contributor

jlaehne commented Mar 1, 2020

The huge size of the webofscience files can be probably explained by the fact that it contains A LOT of conference proceedings, where several conference names lead to the same abbreviation. The most striking case is P. Soc. Photo-opt. Ins., which is the abbreviation for 6044 entries! (See retorquere/zotero-better-bibtex#1436 (comment))

@jlaehne
Copy link
Contributor

jlaehne commented Mar 1, 2020

Another issue is that abbr.jabref combines both lists with dots in the abbreviation and some lists without dots. I don't think it makes sense to merge them into a single journalList, as you normally want either, but not a mixed style of abbreviations.

So it could make sense to maintain two journalLists in the future.

@jlaehne
Copy link
Contributor

jlaehne commented Mar 1, 2020

To backport the changes to journalList.csv, I would propose to find out the about 900 entries that are not in any of the other lists with dots and add these entries to general.csv using a python script.

Afterwards, I would propose to maintain two lists
a) without dots, merging entrez, index-medicus
Combined key count: 22451
(+ webofscience: 101891)
b) with dots, merging general, ams, acs, geology-physics, lifescience, mathematics, mechanical, meteorology, sociology
Combined key count: 21627
(+ webofscience: 97024)

If you don't want to have too huge indices due to the memory issue, leave out the webofscience lists and people can load them separately if they want this huge extra chunk.

ieee seems to be added to JabRef separately anyway.

The external lists astronomy (very small anyway) and economics are not in the csv format yet.

daist, annee-philologique use a non-ISO abbreviation style without dots and spaces or largely even just a string of capitalized letters. I addedd a PR for a comment to the readme.md.

@koppor koppor force-pushed the master branch 5 times, most recently from b8ef7b7 to 21c6e5e Compare March 4, 2020 17:02
@koppor koppor changed the title Refresh journal lists periodically [WIP] Refresh journal lists periodically Mar 6, 2020
@jlaehne
Copy link
Contributor

jlaehne commented Mar 9, 2020

Another note on the webofscience lists. They contain some odd variations of journal names and with their size are hard to maintain in a clean state. Particularly, capitalization is wrong on abbreviations such as AIP, IOP, IEEE, ..., which are Aip, Iop, Ieee, .... Also e.g. Journal of Physics D: Applied Physics becomes Journal of Physics D-applied Physics in the webofscience lists.

I guess this behaviour is due to the ISI lists being fully capitalized, which was then translated using a first letter capitalized scheme (and I don't see a better way unless they provide an interface to get the non-capitalized lists).

So I would argue in favour of keeping them as optional files that are not integrated into the standard journal list(s). Edit: Maybe even add a note on abbrv.jabref site to mention the caveats of these lists!

@jlaehne
Copy link
Contributor

jlaehne commented Mar 9, 2020

Do you know this project: https://github.com/marcinwrochna/abbrevIso
With frontend: https://marcinwrochna.github.io/abbrevIso/
And API: https://tools.wmflabs.org/abbreviso/

It takes the official list of ISO4 abbreviations of single words, plus the general rules defined in the ISO4 specifications to deduce the abbreviation for any journal name you input.

Could be an alternative or complementary (when missing in the lists) approach to abbreviate journal names. But of course, it does not handle unabbreviation, for which there is no alternative to lists. It can also be a way to check the consistency of existing lists and it might make sense to link to the frontend on the abbrv.jabref website, so that people who want to add abbreviations can check for the correct one.

@koppor
Copy link
Member Author

koppor commented May 25, 2020

I need to dive into the comments and the delta between the journal lists here and the other repo. Need to think whether we need a combined journal list. Did not have the time formit yet.

@koppor
Copy link
Member Author

koppor commented Jul 27, 2020

Still need to dive into the comments. Complicated stuff, needs more than one hour concentration.

@koppor koppor added this to the v5.2 milestone Aug 27, 2020
@koppor
Copy link
Member Author

koppor commented Aug 27, 2020

We need to work on this for the next release as this (somehow) blocks contributors using IntelliJ 2020.2

@koppor
Copy link
Member Author

koppor commented Aug 31, 2020

Thanks to the work by @jlaehne, this was "just" to implement the steps described by @tobiasdiez.

@koppor koppor marked this pull request as ready for review August 31, 2020 22:45
@koppor koppor changed the title [WIP] Refresh journal lists periodically Refresh journal lists periodically Aug 31, 2020
@koppor koppor merged commit 93c6f97 into master Aug 31, 2020
@koppor koppor deleted the update-journal-lists-automatically branch August 31, 2020 23:00
Siedlerchr added a commit that referenced this pull request Sep 1, 2020
* upstream/master:
  Squashed 'src/main/resources/csl-styles/' changes from a8dafef..fad76fe
  Update journalList.mv
  Update journalList.mv
  Fix github-push-action
  Revert "Try to refresh on master push"
  Try to refresh on master push
  Refresh journal lists periodically (#5749)
  Cancel Previous Workflow Runs (#6826)
koppor pushed a commit that referenced this pull request Dec 15, 2021
60bf7d5 Add encyclopedia type to wikipedia-templates.csl (#5778)
031afe1 Update and rename dependent/organization-studies.csl to organization-… (#5779)
7ed71e7 Update harvard-newcastle-university.csl (#5765)
46bab91 Update annals-of-oncology.csl (#5760)
6158ae6 Create research-in-plant-disease.csl (#5738)
04422a8 Create chemmedchem.csl (#5753)
7c11521 Create clinical-kidney-journal.csl (#5749)
e7ee983 Create kit-karlsruher-institut-fur-technologie-germanistik-ndl-neuere… (#5729)
96a1106 Update article format for STM journal (#5755)
a4ca057 Update historia-scribere.csl (#5748)

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: 60bf7d5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants