Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate groups in an input file are not detected #7554

Closed
1 task done
sauliusg opened this issue Mar 20, 2021 · 65 comments · Fixed by #7558
Closed
1 task done

Duplicate groups in an input file are not detected #7554

sauliusg opened this issue Mar 20, 2021 · 65 comments · Fixed by #7558

Comments

@sauliusg
Copy link
Contributor

sauliusg commented Mar 20, 2021

JabRef version 5.2, 5.3, master commit 049acb9 on Linuxmint-20.1.

JabRef 100.0.0
Linux 5.8.0-45-generic amd64
Java 14.0.2
JavaFX 16+8

Duplicated groups in different branches of the group tree are not detected when a database that contains them is read in.

But on a second thought, and after playing a bit with JabRef 5.x master branch, I start thinking that this is not a bug but actually a very useful feature! I've opened a detailed discussion in https://discourse.jabref.org/t/hierarchical-groups-with-duplicated-names-are-actually-working-in-jabref-5-x/2619. So don't fix it!

Steps to reproduce the behavior:

  1. Start a new JabRef process, create a new library, add an entry, save the database as 'new.bib' (see the attached 'biblio.zip');
  2. Show the groups interface, create two groups 'Human' and 'Computer', create a subgroup 'Languages' in 'Human', save over the 'new.bib';
  3. Assign the new entry to 'Human/Languages';
  4. Copy 'new.bib' to 'new-duplicate-group.bib';
  5. Open the 'new-duplicate-group.bib' with a text editor, and duplicate the 'Languages' group in the 'Computer' branch:
saulius@starta duplicate-groups-created-by-hand-are-not-detected/ $ diff -u new.bib new-duplicate-group.bib 
--- new.bib	2021-03-20 14:13:07.189182296 +0200
+++ new-duplicate-group.bib	2021-03-20 13:35:47.348262111 +0200
@@ -17,4 +17,5 @@
 1 StaticGroup:Human\;2\;1\;0x8a8a8aff\;\;\;;
 2 StaticGroup:Languages\;0\;1\;0x8a8a8aff\;\;\;;
 1 StaticGroup:Computer\;0\;1\;0x8a8a8aff\;\;\;;
+2 StaticGroup:Languages\;0\;1\;0x6a7a9aff\;\;\;;
 }

The resulting BibTeX files are attached in 'biblio.zip':
biblio.zip

  1. Load the edited file into JabRef; no error report is produced, both groups are shown (correctly?), even though the GUI would not allow me to create such situation; the entry is reported to belong to both groups:
    Screenshot from 2021-03-20 13-37-01

The manually created duplicate is used only to reproduce a minimal example of this behaviour; in real life large number of duplicates emerged when porting previous group trees from the previous revisions of my database.

This behaviour actually seems to be very useful, since it allows me to create (once again!) groups with identical names in different places of the group hierarchy, and assignment to one such group adds the same entry to all such groups, which might be very useful (detailed discussion in https://discourse.jabref.org/t/hierarchical-groups-with-duplicated-names-are-actually-working-in-jabref-5-x/2619 ).

This is how it looks after some editing of the database:
Screenshot from 2021-03-20 16-38-55

The resulting bibtex file is added in 'experimets-with-group-hierarchy.zip'.
experimets-with-group-hierarchy.zip

Log File
Paste an excerpt of your log file here
@tobiasdiez
Copy link
Member

We disallowed to have two groups with the same name (even in different hierarchical levels / subtrees), because they would contain exactly the same entries (as you have remarked). This lead to many confusions for users, so we prevented it. See #1495 for details.

I'm not sure if we really need a warning message. The only way to get duplicate names is by manually editing the bib file (or upgrading from an older version). I would say that's pretty rare.

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

We disallowed to have two groups with the same name (even in different hierarchical levels / subtrees), because they would contain exactly the same entries (as you have remarked). This lead to many confusions for users, so we prevented it. See #1495 for details.

I think the bibliography tool should allow to do what users want, and not to "disallow" something. JabRef supports a lot of very different bibliography management styles, which is its big advantage, but the block in the GUI that inhibits creation of groups with the same name, even in different parts of the hierarchy, is a serious usability issue, and in this particular respect JabRef clearly looses to Zotero :( (even thougt the general policy of JabRef is so much superior of what other bibliography managers offer!)

I'm not sure if we really need a warning message. The only way to get duplicate names is by manually editing the bib file (or upgrading from an older version). I would say that's pretty rare.

If a group with the same name in different places of the hierarchy (or even in the same group) is being created, it would be nice if the GUI would issue a warning here, not a blocking error. The warning would explain that the group with the same name already exists (and mention its places in the hierarchy), but let the user do what he/she wants. Explicit and complete information wold remove most of confusion, IMHO. The GUI could even suggest a unique group name based on its current place in the group hierarchy.

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

The only way to get duplicate names is by manually editing the bib file (or upgrading from an older version). I would say that's pretty rare.

In my case, it is not rare but ubiquitous, and judging from other responses in #1495 and other related posts many users miss the group hierarchy with arbitrary group names. I tried to use a workaround in #1495, like "Diabetes > Treatment", and with the current implementation it is very, very inconvenient.

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

So lets sum up the current situation:

  1. The current group implementation has useful features (nicely summarised in Serialization of groups in entry #629: independence of BibTeX entries, possibility to manually edit BibTeX file for grouping, among the others);
  2. Users want flexible hierarchy and want to be able to create groups with whatever names they want, without a tool standing in their way;

A possible (easy?) way to reconcile these two requirements could be:

  • JabRef supports (as it already does now!) groups with identical names anywhere in the hierarchy; no changes is needed here, just the promise that the future versions of JabRef will not "disallow" such use of JabRef :);
  • The GUI allows creating such groups if the user wants to (so that we do not have to save the database, close JabeRef, edit the file by hand, restart Jabref every time I want the group that has a name already in the hierarchy). For this and the following functionality, a change would be needed in JabRef but the change will be small and localised (affecting, presumably, only the group creation dialogue).
  • The GUI warns the user that the group already exists elsewhere in the hierarchy, and that of created this group will contain the same entries as all other groups with the same name (I estimate that 50% cases this is what is actually needed);
  • The GUI suggests a unique name based on the hierarchy, by appending previous levels of the tree, and possibly numbering, until the name becomes unique; a user can chose that name by a single click or can continue using the duplicated name (with merged entries.

Example:

Groups:
Bioinformatics
- Drug design
-- Diabetes
--- Treatment
Medicine
- Diabetes mellitus
-- Treatment
- Diabetes
-- Treatment
- Asthma

When creating a subgroup called Treatment in Asthma, the GUI could respond:

Warning:
A group named 'Treatment' already exists in 'Medicine/Diabetes/Treatment', 'Medicine/Diabetes mellitus/Treatment'. 
If you create the group 'Treatment' here, it will contain entries from all mentioned groups.
[x] Use a unique name 'Treatment (Asthma)'
[OK] <- the button is *always* active!

I deliberately suggest using 'Treatment (Asthma)' instead of 'Asthma/Treatment' to avoid impression that the group name is strictly bound to group's position in the hierarchy. It is not; so moving group 'Treatment (Asthma)', say, one level up to 'Medicine' would not make its name "incorrect", in the sense that the group still contains papers about asthma treatment no matter where in the hierarchy it resides. This also means that JabRef does not need to bother about renaming groups when their position in the hierarchy changes (as is implemented now).

The suggested change would:

  • mostly solve most peoples problems with the current group implementation;
  • make the suggested workaround for group uniqueness more usable;
  • Add additional capability to JabRef (at no or very little implementation cost!) that would allow the same group to be placed in different parts of the hierarchy; a use case demonstrated by my SQL example shows that this can be very useful.
  • Explicit warnings in group creation dialogue and complete information about the same group names wold remove (most of) the confusion.

@sauliusg
Copy link
Contributor Author

PS. If the developers agree on such (or similar) design I would take a the issue and offer a PR :)

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

The only way to get duplicate names is by manually editing the bib file (or upgrading from an older version). I would say that's pretty rare.

Another use case for groups with the same name – have a look at the screenshot example with computer/human languages:
The groups 'Language' in different branches of hierarchies are union groups, and contain (under the current implementation!) different sets of papers, since they contain different subgroups (Programming/Languages – 17 papers; Data description/Languages – 13 papers). As expected. These groups have no papers assigned to them directly; this is not needed since the purpose of these groups is to summarise all papers assigned to their subgroups, which they do nicely.

In this setting there is no need to name the 'Language' groups differently, that would be very awkward. Also, the group's semantics will not change if I move them to different location in the tree (with all their subgroups), for example into 'Computers/Software' as opposed to 'Computers/Hardware' at some stage.

@sauliusg
Copy link
Contributor Author

A logic (classification) case for non-unique group names:

  • currently, a paper can belong to several different groups (of course :);
  • but, by the same token, a subgroup can also belong to several different super-groups (together with all papers assigned to it)!

@sauliusg
Copy link
Contributor Author

I'm not sure if we really need a warning message.

If you mean a warning when opening file with duplicated group names, then I agree, there is not need for the warning there. Which just means that it is a legit situation :)

@tobiasdiez
Copy link
Member

I agree with you that it sometimes could be advantages to have the same group in two different places. However,

subgroup can also belong to several different super-groups (together with all papers assigned to it)

is a highly unusual concept from a user-interface perspective. For example, files with the same name but in different folders don't automatically sync their content. I think we should really stick to the tree nature of the group interface, meaning that groups are independent.

As a way forward, I would propose the following:

  • Allow to have groups with the same name in different positions. I.e. properly fix Group hierarchy lost when moving to BibEntry #1495. Warning: this is might be very complicated.
  • Once this is done, we can think about adding a "symlink" group that acts as a copy of a different group.

What do you think? Help on this of course very much appreciated.

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

I agree with you that it sometimes could be advantages to have the same group in two different places.

ACK.

However,

subgroup can also belong to several different super-groups (together with all papers assigned to it)

is a highly unusual concept from a user-interface perspective.

I was thinking about this. It could be that we see the concept as unusual only because we are so much used to the file system tree structure. But file system tree is a bad tool for bibliography management (tried it, switched to JabRef :) ), precisely because we can not place the same file in several different directories (or rather we can, but this is not very convenient).

Groups in bibliography managers have the advantage that we can place the same file (or the same subgroup) in different hierarchies. In this case groups serve as a sort of "canned queries" on the database.

When one gets used to the current implementation of JabRef groups, one starts thinking that the group tree and the group assignments are actually disjoint features. A tree is described in @Comment{jabref-meta: grouping: element, no matter how the entries are assigned to grops; the group assignments is written in entries, no matter how the groups are moved around in the tree. Orthogonality of these two features is nice. Once you realise this, the uniqueness of group name no longer seems important or necessary, and synonymous groups can have many uses.

For example, files with the same name but in different folders don't automatically sync their content.

Well, this is not always true. On Unix (and Unix-like systems like Linux and probably on Mac), a hard-linked file will be "syncronised" across two directories. Viz.:

saulius@tasmanijos-velnias hl-demo/ $ mkdir folder1
saulius@tasmanijos-velnias hl-demo/ $ mkdir folder2
saulius@tasmanijos-velnias hl-demo/ $ echo "Initial content." > folder1/file.txt
saulius@tasmanijos-velnias hl-demo/ $ ln -v folder1/file.txt folder2
'folder2/file.txt' => 'folder1/file.txt'
saulius@tasmanijos-velnias hl-demo/ $ head folder?/*
==> folder1/file.txt <==
Initial content.

==> folder2/file.txt <==
Initial content.
saulius@tasmanijos-velnias hl-demo/ $ echo "Replacing content in folder1..." > folder1/file.txt
saulius@tasmanijos-velnias hl-demo/ $ head folder?/*
==> folder1/file.txt <==
Replacing content in folder1...

==> folder2/file.txt <==
Replacing content in folder1...

So the concept of writing things into one place and finding the change in another place is not that foreign, even if sometimes different to trace.

The concept of "same group in different tree branches" is quite similar to hard links in Unix (not soft links).

I think we should really stick to the tree nature of the group interface, meaning that groups are independent.

Here more input from users would be very welcome. I and you can see the situation one way, but it is hard to see how other users who did not see the code internals (or even the BibTeX database internals) will perceive the situation. I have started a discussion on https://discourse.jabref.org/t/hierarchical-groups-with-duplicated-names-are-actually-working-in-jabref-5-x/2619, as was advised in JabRef instructions, but so far it did not attract much comments or voting (well, it's weekend...). Maybe some feedback can be collected there?

I am, personally, currently very enthusiastic about just allowing duplicate group names, with a warning, and start using them. I'm right now playing with the JabRef tip, and all functions with duplicate group names seem to work as expected.

I must admit that I experienced several mysterious "index out of bound" exceptions, but they are very difficult to reproduce, and they seem to happen also with unique group names, so I do not know if this is connected with the groups. But in all in all having same-name groups does not seem to cause any bug trouble and works as expected.

As a way forward, I would propose the following:

Allow to have groups with the same name in different positions. I.e. properly fix #1495. Warning: this is might be very complicated.

I would be very happy with such solution. However, as you say, the fix might be very complicated, so it makes sense to see what can be done with simple means, even if functionality is not ideal.

An astonishingly simple fix is in my PR #7558 .

Of course, we (I) can try to implement the true group hierarchy, naming groups 'Computers/Programming/Languages' and 'Human/Languages', as suggested as a workaround in #1495, and showing just the leaf names in the group tree. I do not think that renaming a group or moving it in the hierarchy will cause big performance issues; groups tend to be small on average, I trust they are in dictionaries (so finding a group is constant time op), and renaming the group has exactly the same performance overhead even with the current implementation (since all entries belonging to the group must be updated when the group is renamed). So this sounds doable.

But, as said, maybe a "lazy" solution proposed in the above PR is just good enough? If so we can have it as a semi-permanent solution and move on. At my site, many other things preclude migration to JabRef 5.4 (from 5.2), for example timestamp issue (#7527), file parsing issue (#7553).

* Once this is done, we can think about adding a "symlink" group that acts as a copy of a different group.

Currently entry assignments to groups works as hard links, and this seems to be enough. If the full power of tree hierarchy is implemented (i.e. we can assign whatever names we want to groups in different tree branches), then symlinks no longer seem necessary. Probably they will be an overkill, IMHO.

What do you think? Help on this of course very much appreciated.

I'll be more than happy to give any help I can, in implementation or otherwise; just my time resources are not always abundant (but I understand that everyone's time is precious :), so I'll do my best).

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 21, 2021

As a way forward, I would propose the following:

Allow to have groups with the same name in different positions. I.e. properly fix #1495. Warning: this is might be very complicated.

One more thought about a possible fix:

  1. Let's add a unique ID field, an integer, to each group described in @Comment{jabref-meta: grouping:. This field shall be unique within the BibTeX file/database, and shall be assigned by JabRef;

E.g.:

-----------------vvv----------- The group ID
1 StaticGroup:sg\;1\;2\;1\;0x8a8a8aff\;\;\;;
2 StaticGroup:Bibliometrics\;2\;0\;1\;\;\;\;;
2 StaticGroup:Biochemistry\;3\;0\;1\;\;\;\;;
3 StaticGroup:Carbonic anhydrases\;4\;0\;1\;\;\;\;;
... and so on
  1. The uniqueness of the group shall be decided not based on its name, but on its ID;
  2. In each entry, the group = {...} list shall contain group names with IDs appended;
    E.g.:
group = {Carbonic anhydrases(4), Biochemistry(3)},
  1. For the user, in the GUI only group names will be displayed, not the IDs.
  2. Multiple groups with the same name shall be allowed as per PR Allow identical group names #7558 (I know, I am biased ;) ); these names can be distinguished by different group IDs assigned automatically by JabRef;

In this way, we will have a unique group name(id) string identifying each group for each entry, and the current group implementation remains as it is. From the user's perspective, however, groups can be given any names, even overlapping with other names in the hierarchy. Each entry will be assigned to precisely one unique group, and no ambiguity (or multiple assignments) will emerge.

  1. Symlinks to identical groups can be implemented by duplicating the group name+ID, if a user wishes so, in different places of the group hierarchy;
  2. When copying an entry to another BibTeX database, old IDs would be discarded and new matches will be created based on names and possibly on user feedback (No. 7 is a future enhancement).

Since the ID is always added as a fixed format numeric suffix, no ambiguity ever arises. For instance, if a user names her group "slaughterhouse(5)", and JabRef assigns ID 1024 to this group, then the JabRef entries will identify this group as slaughterhouse(5)(1024), and only the last braced number will be interpreted as an ID.

When groups are moved around the tree, or even renamed, no changes need to be done to the BibTeX entries, since groups IDs do not change. Group names and/or IDs no longer need to reflect group's position in the group hierarchy.

There will be hardly more than 10000 groups in a typical database, so the IDs will be comfortable 1 to 4 digit numbers, manageable also by humans.

What is your opinion about such design?

@tobiasdiez
Copy link
Member

The id solution would be the best from an implementation point of view (since then we indeed have an unique identifier). However, the idea was that users can also edit the groups directly by changing the groups field. That's no longer possible (without looking up the id of the group...).

The only way I see for this is to use the complete path from the root to the group and save this in the groups field. But then you have to worry about migration...

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 22, 2021

The id solution would be the best from an implementation point of view (since then we indeed have an unique identifier). However, the idea was that users can also edit the groups directly by changing the groups field. That's no longer possible (without looking up the id of the group...).

Thought about that. Editing a BibTeX file is in fact possible, at least as good as it is now:

  1. In the current setup, you have look up the group name anyway: how do you know otherwise if your group is named "Asthma>Treatment', 'Treatment (Asthma)' or 'Treatment of asthma'? Capitalisation also matters (checked), so you need to look up if it was 'Treatment' or 'treatment' or 'TREATMENT'. If you do this, you will also find out that it was actually 'treatment(23)' or similar. A command can be as simple as `grep 'group.*=' biblio.tex | grep -i treatment', of 'Edit|Find' in a GUI text editor. And if you mistype, the entry will not show up in any group of your tree, so once again you either need to know your group name precisely by heart or yo need to look it up...
  2. Since IDs, as mentioned, will be small integer numbers, and stay stable, they are mildly manageable by humans. When I talk about TCP ports 80 and 22, you do not need to look up in /etc/services to figure out what I am talking about, do you? :) The same will happen for most popular groups.
  3. We will have to be able to import entries that do not yet have IDs assigned (e.g. when importing an old style database), or assigned in a different namespace/database (when pasting). In that case, JabRef will have to look up the ID using group name and either use its ID (if the name is unique), create a new ID (if the group did not yet exist) or ask user which group or groups in the hierarchy do they want to put the entry into. Very natural, IMHO.

In short, when editing, the ID number will be a natural extension of the group's name, and can handled together with the name. In JabRef internally ID will be the sole unique index identifying the group.

There is a topic to think – whether just the ID needs to be unique or the combination 'name(ID)', and if just ID is unique then what should JabRef do when it encounters the same ID on the other group... But this is solvable, I believe.

@sauliusg
Copy link
Contributor Author

The only way I see for this is to use the complete path from the root to the group and save this in the groups field. But then you have to worry about migration...

Indeed, if you have the full path in the group name, then you need to update it each time the group is moved in the tree. You also need to cope with errors when incorrect path is edited in manually. I think this is a complex solution, and complexity arises from severe denormalisation (in database sense).

In case you have just an ID, you only need to update entries when you rename groups – which you also have to do in the current implementation, and I believe it is working (need to check ;). Moving groups in a tree around does not affect the entries. I would thus favour the 'name(ID)' solution.

@tobiasdiez
Copy link
Member

But finding the name is easy: I just look at the left side bar for the group, and put exactly what I see in the groups field and it works. That's harder for the id-based solution (you could add a tooltip but yeah...).

Yes, the "complete path" solution also has disadvantages...

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 22, 2021

But finding the name is easy: I just look at the left side bar for the group, and put exactly what I see in the groups field and it works. That's harder for the id-based solution (you could add a tooltip but yeah...).

True, I somehow overlooked this. But then you can not simply copy that text with a mouse :) (or can you?).

But I agree that having no IDs is simpler. This brings us again to a simple implementation with the current group code that just adds possibility to create identically named groups. I have hacked also the delete group possibility, without removing entries from other groups that have the same name (see the updated PR #7558), and it works OK for "Delete group and subgroups" command. For just "Delete subgroups" command I have troubles (entries are left with the deleted group name), but this behaviour may be related to #7556.

@AEgit
Copy link

AEgit commented Mar 27, 2021

Just as a comment from my side: As mentioned by @tobiasdiez you have to be very careful if you allow users to have duplicate groups. Problems I remember from the past (I am afraid, cannot find the respective bug reports anymore, except for #1495):

  1. In JabRef 3.8.2 all hell broke loose if you had two groups with the same name (inadvertently created!) at the same hierarchy level: when scrolling through the groups tree, you would reach these groups at some point - when that happened, the whole group tree disappeared/was no longer visible and you had to force-quit JabRef to make it usable again (and then manually edit the respective groups names in a text editor, to avoid this from happening again).

  2. If you had duplicated group names, removing items from groups could have side-effects that were not obvious to the user. The user might have forgotten that they had these duplicate group names (especially at the time, when no (stable) group name check was implemented in JabRef). Removing the item from one group would also remove it from the group with the duplicated name - even though that was not the intention of the user.

Now, I am not saying, that this will happen with the current JabRef version as well. But I just want to raise awareness, that these decisions can have major implications downstream.

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 28, 2021

Dear @AEgit, thank you very much for the comment.

Just as a comment from my side: As mentioned by @tobiasdiez you have to be very careful if you allow users to have duplicate groups. Problems I remember from the past (I am afraid, cannot find the respective bug reports anymore, except for #1495):

In JabRef 3.8.2 all hell broke loose if you had two groups with the same name (inadvertently created!) at the same hierarchy level: when scrolling through the groups tree, you would reach these groups at some point - when that happened, the whole group tree disappeared/was no longer visible and you had to force-quit JabRef to make it usable again (and then manually edit the respective groups names in a text editor, to avoid this from happening again).

I have checked with my group tree, with all disambiguation prefixes removed: the problem is no longer there in the current master tip and in the suggested PR #7558. The groups, also with duplicated names, can be scrolled and used without problems.

If you had duplicated group names, removing items from groups could have side-effects that were not obvious to the user. The user might have forgotten that they had these duplicate group names (especially at the time, when no (stable) group name check was implemented in JabRef).

Sure, the user will not keep in mind all possible configurations and duplications of all groups...

Removing the item from one group would also remove it from the group with the duplicated name - even though that was not the intention of the user.

I agree. Therefore I suggest in the PR #7558 make a very simple check – the group is only removed from the group list in the entries if there are no more groups with that name left in the group tree.

In this way, as long as the deleted group exists somewhere in the tree (possibly in multiple places), the entries stay assigned to these groups.

I have just noticed that when I rename one of the dupliactes, its entries are reassigned to the new group and vanish from all old group duplicates. This is not what I (and probably most other users) would expect; the fix should be easy – if the group(s) with the original name exist somewhere in the tree, the old group name should be retained in the entries. The entries will thus belong to two groups if one of the duplicates is renamed. Only of the group name vanishes from the tree, it should be removed from the entries as well.

I'll amend the PR to fix this.

Now, I am not saying, that this will happen with the current JabRef version as well. But I just want to raise awareness, that these decisions can have major implications downstream.

My intuition is that as long as group objects inside JabRef are identified by, say, their index in an array or by their position in an internal tree structure, and not by their (external!) names, no big problems should arise.

I agree that the behaviour that emerges on the duplicate group names is somewhat unusual, but it is now quite clear to me, and as long as it is consistent and no group information loss arises when manipulating groups, this might actually be an advantage of Jabref :).

The most important issue is to preserve group information without information loss when importing old group files. As @swekia wrote in the recent issue #7442, creating a good grouping for yourself is "many, many hours of manual work" (which I can confirm from my side).

To import groups created in the pre-JabRef 3.x version, when the group structure changed to unique (and so duplicates where very likely present in the input databases), a following import strategy might help:

a) if a group name is unique leave its name as it is;
b) if an old group name is a duplicate, postfix it with the position in the tree (e.g. in braces, 'Treatment (Medicine/Asthma)'); in this way a unique name is created with all old entries assigned to it (no info loss). A user can then rename or retain the names, as they wish;

If at some point it is decided that the groups in distinct parts of the tree should be distinct (either by appending group ID or by some other mechanism), importing from the current JabRef group tree (with possible duplicates, which, recall, can be made by manual editing of the file!) should assign the same entries to all new groups, make groups "synchronised" if syncrinisation is supported, and if not – issue a warning that this function is lost. IMHO this would very much ease the pain of upgrading and make work experience with JabRef much smoother :).

Note that import of previous databases is a separate issue from the current behaviour of the JabRef group system at any implementation point, and it should be addressed separately.

Seems like the road ahead is pretty save with these considerations... Or are there any dragons ahead which I have overlooked?

@AEgit
Copy link

AEgit commented Mar 29, 2021

@sauliusg : Thank you for this extensive explanation.
I think I would be happy if:

  1. There is always a warning when a new group is created who's name is NOT unique (as proposed here before).
  2. There is a warning when a (previous) JabRef database is imported which does not contain unique groups. It would be best, if the position of these groups in the group tree could be indicated. Note, that this is of relevance for users with many groups - I am using thousands of manually created groups (so, yes, I would agree that creating a good grouping takes "many, many hours of manual work", if not more ;)).
  3. You suggested that when importing old group names the name of non-unique groups could be fixed with the position in the tree. While I agree, that this sounds like a reasonable quick fix/workaround, I wonder, what effect this will have on databases with thousands of groups and a large group tree depth. Again, in my example, I have a group tree depth of up to 26 - if my database were to contain any non-unique groups, your solution would result in groups with VERY VERY long names. I wonder, whether those names would then exceed some threshold for character limits in group names and what that effect could have (in either saving this modified group trees or the display of them).

@sauliusg
Copy link
Contributor Author

sauliusg commented Mar 29, 2021

I think I would be happy if:

There is always a warning when a new group is created who's name is NOT unique (as proposed here before).

Agreed; a possible implementation is in the suggested in the PR #7558 (and a screenshot is present demonstrating the look-n-feel). Please have a look at the branch/screenshot how you find it.

There is a warning when a (previous) JabRef database is imported which does not contain unique groups. It would be best, if the position of these groups in the group tree could be indicated. Note, that this is of relevance for users with many groups - I am using thousands of manually created groups (so, yes, I would agree that creating a good grouping takes "many, many hours of manual work", if not more ;)).

I'm very glad that you confirm the importance of grouping for the users of JabRef, and your info volumes are impressive :).

Several ideas how JabRef could highlight duplicated groups in a convenient way:

  1. For groups that are encountered more than once in the tree, show a small bubble (as a superscript, for instance) with a number of repetitions; e.g. "My Groups/Computers/Programming/Languages³/SQL²";
  2. A tool tip could be displayed for that bubble explaining that "The group 'SQL' is present 2 times in the group tree. All its copies will always have exactly the same entries" (or similar text);
  3. The duplicated group name could be written in italics, to highlight its peculiar character.

Which of the display features, or which combination you think are most usable?

I think, however, that it would be very inconvenient if JabRef would pop up a warning box each time a database with duplicated groups is read in. The duplicates should be accepted and dealt with as the user wishes.

You suggested that when importing old group names the name of non-unique groups could be fixed with the position in the tree. While I agree, that this sounds like a reasonable quick fix/workaround, I wonder, what effect this will have on databases with thousands of groups and a large group tree depth. Again, in my example, I have a group tree depth of up to 26 - if my database were to contain any non-unique groups, your solution would result in groups with VERY VERY long names. I wonder, whether those names would then exceed some threshold for character limits in group names and what that effect could have (in either saving this modified group trees or the display of them).

Indeed, deep trees like yours will have problems. A further refinement of the "tree prefix" method would be to use only as many levels of the tree as is needed for disambiguation. For instance, if two of your no-synonymous groups would be at level 26, but they would differ at the 24-th level, only two extra group prefixes would be needed ('Leaf Group (branch24/branch25)').

If even this is too long (one would set some limit, as you suggest), one could resort to middle character contraction (as in 'i18n') and/or to plain numbering ('i18n, 2nd group'). Of course this is only the import behaviour, the user can then rename the groups as one wishes.

Note that in my suggestion this would only be relevant for importing old group trees that gad synonymous but different groups. If your group has no synonyms, or if you synonyms follow the suggested behaviour (i.e. they are intended to always keep the same entries), no transformation of your tree would be necessary.

So, are we going for it? :)

NB. The PR #7558 does not contain import functionality, it only contains changes in new format group handling.

@AEgit
Copy link

AEgit commented Mar 30, 2021

So far sounds this good to me. The import should not matter too much in my personal case (hopefully) since I tried to make sure, that all my group names were unique (I am still forced to use 3.8.2 since I cannot use JabRef before the following feature request is implemented: #4237).
But I thought, I should mention these issues, since other users might not have given that much attention to this.

So here my take on the import feature (which might be added later?):

  1. The user should be able to turn on/off the warning when dealing with a database, that contains non-unique group names. I see your point that it can be inconvenient for users who have decided to accept the duplicates. But for ?most users (including myself) it is essential that it is clear, that the database contains non-unique group names that must be dealt with. Those should be able to see this every time they have to deal with such a database. Therefore the warning needs to be a feature that guarantees to find the duplicates and it needs to be available to those who require it. Therefore maybe a switch to turn on/off this warning message could be a solution (but maybe there is a better one)?
  2. You suggest various ways on how JabRef could highlight duplicated groups. All of them seem reasonable to me, but they have one flaw. They do not work well, if you have very large group trees, i. e. thousands of groups. A user is not going to scroll through the whole group tree to find all the duplicates. Instead, JabRef should tell the user directly, where to find those groups. Two things come to my mind:
    a) Either the groups filtering can be used to display all non-unique groups
    AND/OR
    b) a separate list is shown, that tells the users exactly where to find the non-unique groups by giving the full group path for each affected group
    There might be a better way to do this, but these are the two options that come to my mind at the moment, that deal with the problem of non-unique groups within massive group trees.

Thanks for your help!

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 2, 2021

  • The user should be able to turn on/off the warning when dealing with a database, that contains non-unique group names. I see your point that it can be inconvenient for users who have decided to accept the duplicates. But for ?most users (including myself) it is essential that it is clear, that the database contains non-unique group names that must be dealt with. Those should be able to see this every time they have to deal with such a database. Therefore the warning needs to be a feature that guarantees to find the duplicates and it needs to be available to those who require it. Therefore maybe a switch to turn on/off this warning message could be a solution (but maybe there is a better one)?

In the PR that I suggest groups with repeated names should become, once again, a normal situation in JabRef databases. Thus making a popup window each time you start JabRef, even if you can "configure it away", looks like an inconsistent behaviour (why should a program stop your workflow to warn about a feature which it created itself and which fully supports?) See below for a possible idea.

  • You suggest various ways on how JabRef could highlight duplicated groups. All of them seem reasonable to me, but they have one flaw. They do not work well, if you have very large group trees, i. e. thousands of groups. A user is not going to scroll through the whole group tree to find all the duplicates. Instead, JabRef should tell the user directly, where to find those groups. Two things come to my mind:
    a) Either the groups filtering can be used to display all non-unique groups
    AND/OR
    b) a separate list is shown, that tells the users exactly where to find the non-unique groups by giving the full group path for each affected group
    There might be a better way to do this, but these are the two options that come to my mind at the moment, that deal with the problem of non-unique groups within massive group trees.

A quick indication of whether the current database contains groups with repeated names would be a dashboard-style indicator that says:

There are 8 groups with _repeated_ names in the database.

This announcement could be added right below the "Filter groups" input field, and above the "All entries" group.
Optionally, to could contain number of all groups as well (and so it would be displayed always, not only when groups with repeated names are present).
Also optionally, there could be 3-state switch next to the announcement: "show only repeated groups/show only unique groups/show all groups", so that you can locate such groups fast (and rename them if you wish).

"Optionally" in this context means that, IMHO, JabRef group interface both with and without these features would be very usable, so maintainers/community should consider whether the benefits of adding extra these features justifies extra work and code increase.

If the group count and duplicated groups are not displayed in JabRef itself, they are still possible to find quickly with some command-line magic :):

  1. Display all groups with repeated names:
saulius@starta multipl-groups/ $ awk -F: '$1~/ StaticGroup$/{print $2}' c-standard-test.bib | awk -F'\' '{print $1}' | sort | uniq -c | sort -nr -k1,1 | awk '$1>1'
      2 Ontologies
      2 MOFs
      2 Languages
      2 Databases
      2 Data
      2 Crystallography
      2 COD
      2 CIF
  1. Could how many such groups you have:
saulius@starta multipl-groups/ $ awk -F: '$1~/ StaticGroup$/{print $2}' c-standard-test.bib | awk -F'\' '{print $1}' | sort | uniq -c | sort -nr -k1,1 | awk '$1>1' | wc -l
8

These bash commands were run on Linux, but should work on MaOS and on Windows with CygWin as well; they can be defined as short commands such as 'bibtex-count-goups' or similar.

From a perspective of Unix principles as I understand them, a function that can be conveniently factored out into a separate, independent program should be factored out and not clutter the main tool. But this might be a different philosophy from what Windows/Mac users are used to... :)

@AEgit
Copy link

AEgit commented Apr 2, 2021

A quick indication of whether the current database contains groups with repeated names would be a dashboard-style indicator that says:

There are 8 groups with _repeated_ names in the database.

This announcement could be added right below the "Filter groups" input field, and above the "All entries" group.

If the "repeated" names are mentioned in the announcement, then this would indeed already improve the situation. Note, that with the current JabRef version not having the full path is less of a problem, since you can easily filter the groups using the groups filter if you know, which groups are affected.
With the old JabRef (e.g., 3.8.2) this was not possible, since there was no such filter.
Yet, having the direct path or automatically filtering the respective groups (as you propose in the options) would be even better, since it is not intuitive to the user, that one has to find the duplicated groups by themselves.

Maybe I should explain further, why this can be a problem: If you have a database with thousands of groups, it might not be obvious when creating a new group, that it already exists somewhere in the database (you will not notice this immediately just by looking at the groups panel). If a user wants to avoid having unique groups (there are domain-specific reasons for this, which I won't go into), not knowing about the existence and the location of these non-unique group names can be an issue.

As for the command-line magic: Yes, that would be a workaround and I appreciate you sharing the respective scripts. But I would argue, that this is detrimental to the user experience. People will want to have a simple reference does-it-all-for-you tool.
Further: If we were to go by a strict reading of Unix principles, then one shouldn't even use JabRef in the first place. It is definitely not a simple tool for a single use, but a quite complex piece of software. If you wanted to, you could also emulate most (all?) functions of JabRef using Emacs + various scripts (in fact I know people, that do that) - but for most people that would just be inconvenient and I do not think that is the goal of JabRef.

@tobiasdiez
Copy link
Member

Thanks for the ongoing discussion. The twitter poll clearly showed that many users would like to have groups with the same name. However, it became also clear that it might be confusing if the entries are automatically shared. In the devcall we also discussed that there are also scenarios where the groups with the same name actually don't show the same entries (e.g. if one uses the hierarchical modifiers), potentially leading to even more confusion.

For this reason I would strongly propose to go the id-based solution:

  • Each group gets an unique identifier
  • The identifier is stored in the groups field (and no longer the name of the group)
  • Users can create completely independent groups with the same name (even in the same hierarchy level if they desire, although a warning should be shown in this case)
  • In case, they want to have a mirror group they can use an automatic keyword group pointing to the groups field and searching for the identifier of the other group.

This unique identifier will also be very helpful for other scenarios. For example, we are planning a new feature in connection with JabRef Online, where users can share a group. For this a unique identifier is essential, since one needs to a way to identify the group even for different users.

@sauliusg your work here in this PR is very much appreciated. But I think the id-based system is the more universal way forward. Would you be interested in implementing it? We core developers help of course where we can!

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 2, 2021

@sauliusg your work here in this PR is very much appreciated. But I think the id-based system is the more universal way forward. Would you be interested in implementing it? We core developers help of course where we can!

Let me think if the PR can be extended, with reasonable effort, to handle uniquely identified groups.

I'm a bit worried that the full unique group ID solution, while I agree that it is cleaner, but is at the same time more complicated, will take more time to implement, may introduce new bugs; whereas we need badly to continue working with our bibliographies, and the limitation on group names makes this cumbersome...

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 2, 2021

For this reason I would strongly propose to go the id-based solution:

* Each group gets an unique identifier

I agree that this is a more universal and more usual solution.

But then, more questions arise:

  • what identifiers should be used? There are several options:
  1. UUIDs. They are world unique (alowing users to share their grouped data without interferences), but not easily human readable, so manual editing of a BibTeX file would become cumbersome;
  2. Numeric group IDs (1,2,3,...,999). Small numbers would be easier to handle, but would only be meaningful within one database;
  3. Group names+IDs ( Languages(15), Languages(16), Treatment(4), Treatment(7) ) – both unique and human-readable, but more complicated to handle (one needs to decide what to do when names match but number's don't, etc.)
  4. Group names+UUID? (Languages(6153c16e-93d1-11eb-b0a3-73a851179dd4), Languages(782c5ea0-93d1-11eb-8716-8bccdb471bb8) ) – unique, user-readable and editable, but long.
  5. UUID sha1 or md5 hash fragments? Sorter but higher risk of collision.
  6. Names + option 5?
  7. Anything else?

Technically, (4) will probably be the easiest to handle both manually and in JabRef code, but will bloat the files.

Your ideas?

* The identifier is stored in the groups field (and no longer the name of the group)

* Users can create completely independent groups with the same name (even in the same hierarchy level if they desire, although a warning should be shown in this case)

True.

* In case, they want to have a mirror group they can use an automatic keyword group pointing to the `groups` field and searching for the identifier of the other group.

Or create a duplicate manually, using a text editor :). The bottom line is that JabRef, IMHO, will have to deal with such situations anyway.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 2, 2021

For example, we are planning a new feature in connection with JabRef Online, where users can share a group. For this a unique identifier is essential, since one needs to a way to identify the group even for different users.

In that case UUID or UUID+Group name seems the best suitable solution.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 2, 2021

Maybe I should explain further, why this can be a problem: If you have a database with thousands of groups, it might not be obvious when creating a new group, that it already exists somewhere in the database (you will not notice this immediately just by looking at the groups panel). If a user wants to avoid having unique groups (there are domain-specific reasons for this, which I won't go into), not knowing about the existence and the location of these non-unique group names can be an issue.

This is actually handled in the proposed PR: when you try to create a group with the name that already exists in your tree, a yellow exclamation mark appears, and a tool-tip explains what would happen if you go for it. This is proposed instead of bluntly forbidding such groups. You can decide whether you want a duplicated group name, or whether you add more charters for disambiguation.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

This unique identifier will also be very helpful for other scenarios. For example, we are planning a new feature in connection with JabRef Online, where users can share a group. For this a unique identifier is essential, since one needs to a way to identify the group even for different users.

A a side note: as far as group/file sharing goes, this problem we solved at my site once and for all, by simply using an SVN repo for storing bibliographies. This solution works perfectly since 2008, and Subversion gives numerous advantages: a) unlimited undo b) full history of changes c) synchronisation (with automatic merges) on different computers, for different users, and even in different directories of the same computer d) easy and informative diff's... and many more bonuses.

What makes it difficult to share data is namely the decision to forbid (in the GUI) groups with identical names. It is all that natural that I (as a user sg) will want to have group "Languages", and my colleague also wants to have group "Languages". Why not? If the same papers land into that group, this is on one hand a benefit for us, on the other hand easy to sort out afterwards if we decide to split the group.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

This is actually handled in the proposed PR: when you try to create a group with the name that already exists in your tree, a yellow exclamation mark appears, and a tool-tip explains what would happen if you go for it. This is proposed instead of bluntly forbidding such groups. You can decide whether yo want a duplicated group name, or whether you add more charters for disambiguation.

Ok, if this works rock-solid, then there should be no problems,

Well, I can not promise that it is "rock-solid" ("no warranty" disclaimer, as always :); but I have now tested my branch for quite a while and things work as expected; I also do not see any big dangers ahead since, as I have mentioned, the proposed PR #7558 is a conservative extension, it does not change the current design of the groups, just allows to use more features in it.

except for (a) older databases, as discussed before, or (b) users that change their opinion in regards to unique groups.
(a) So basically only someone, who has an older database, requires unique groups, but is not aware that the older database currently does not meet the requirements, would need help from JabRef in finding those or (b) people who later on realise that they prefer unique groups.

People who read old databases with unique groups will not be affected at all, since the extension leaves all old functions intact.

People who wish to maintain unique groups in their databases are in the position to do so easily; essentially they get the same message as before if they try to create group with a duplicated name. The only difference is that in the proposed PR #7558, you are no longer left with the disabled "OK" button, but you have a explanation and a choice. In addition, several small fixes take care that group renaming and deleting does not loose information when you have duplicated groups. Since you can have duplicated groups anyway (disabling "OK" button in the GUI does not ensure that program invariant – unique group names – is maintained in a database), the fixes are needed regardless to how GUI behaves.

People (like me) who want to reuse group names are in a position to do so conveniently.

People who later want to make groups unique can do so by renaming one or several groups; the code fixes in the PR #7558 take care that the entries stay assigned to the old group when a duplicated group is renamed, and to the new group if you select "assign all entries to the new group". So the system is very flexible in what you do.

Seems like everyone-wins situation, doesn't it?

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

@sauliusg your work here in this PR is very much appreciated. But I think the id-based system is the more universal way forward. Would you be interested in implementing it? We core developers help of course where we can!

@tobiasdiez These are hard choices. May I suggest the following roadmap for group interface:

  1. We take, as a quick temporary fix, the PR Allow identical group names #7558 , with the enhancements suggested by @AEgit , namely with a dashboard line indicating how many repeated (duplicated) groups are there in the database, and possibly with a button to show all these groups. I take responsibility to enhance the PR Allow identical group names #7558 in the suggested way; of course the help and hints from the developers are extremely appreciated since I am not (yet) very familiar with the GUI code of the JabRef source;
  2. We then move to fix the bibtex parser/writer (issue Markdown comments dissapear on new start when having # chars #7010 (rediscovedred in Entries with hash symbos ('#') are processed incorrectly #7553)) and time stamps (issue new time field creation date - missing possibility to read other field names; missing backup of old information #7351 (redicovered in Updated JabRef deletes 'timestamp' fields from the database #7527)); IMHO these are really stumbling blocks to migrate to a new version of JabRef (at least at our site); I am ready to contribute on these fixes as well, if needed;
  3. We then deliberate how to make transition to truly unique groups in JabRef, without breaking what people have already built for themselves using the current JabRef implementations, and considering all possible use scenarios (sharing groups on a central JabRef server, sharing groups via version control, etc.). I could give a helping and here as well, but, as said, time is scarce so this will not happen instantly. Also, a broader discussion with users is probably needed to design (3) in a best way.

With (1) and (2) in place, people (including me) can go on migrating to new JabRef (I would do this in my branch and start using it on the spot, and the switch to >= version 5.4 when this comes out; I can not use 5.2 since it still has #7548 PR not merged and I can not use 5.3 since it corrupts my database (issue #7010). Those who use unique group names within their databases, as @AEgit described it in detail, would continue to have full functionality available to them. Their databases would continue to work as before.

Unique shared groups and collaboration scenarios are nice to have; however, these features are trickier to implement and needs careful design and more time; they could come somewhat later. Their introduction would be compatible with changes in (1) if it is compatible with the current JabRef group implementation.

Would you (developers) agree on such process?

Update: actually, after reading discussion in #7010 I found a workaround for my database: setting "Resolve strings for all fields except: url;abstract;not" saves and reads the database correctly again, without touching the abstract field.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

I suggest we accept that we loose manual editing and would go for uuid's. The user experience can be improved afterwards by using a tag-based interface, for easy management of groups from the entry editor.

On a third thought, it seems that there is a way to introduce UUIDs without loosing editability:

  1. UUIDs and group names are added to ' StaticGroup:' lines;
  2. Bibtex entries retain names; if names are unique, they are not changed (UUID is not appended);
  3. If names in entries are not unique, they are disambiguated automatically using UUID appended in braces (e.g. Languages(36592084-944f-11eb-b17a-5fe0541bb1b3));
  4. If user maintains a database with duplicated group names, functionality of PR Allow identical group names #7558 handles the situation properly.

This would ensure smooth migration from the current group implementation and to the future eventual sharing of groups on a server.

@tobiasdiez
Copy link
Member

#7554 (comment) These are all valid concerns. But most of these disadvantages can be taken care of by a more advanced implementation. For example, search for groups = ... can look up the groups by name, and then search for the entries that have the corresponding group-identifier. Similarly, before sharing (e.g. export) one could replace the group name by the identifier.

#7554 (comment) The problem I see with this approach is that there is a time where people may create groups with duplicate names. This makes the migration a lot harder. Thus I would prefer to sort this out before.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

#7554 (comment) The problem I see with this approach is that there is a time where people may create groups with duplicate names. This makes the migration a lot harder. Thus I would prefer to sort this out before.

If people will create groups with duplicate names, even after being warned, this means that they need such groups. Why not supporting this workflow?

For eventual migration to groups with IDs, I have outlined one of the possible migration strategies above; it seems to be simple and to maintain smooth transition. Sure one needs to be cautious, but there seems to no unsurmountable difficulties on this path.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

#7554 (comment) These are all valid concerns. But most of these disadvantages can be taken care of by a more advanced implementation. For example, search for groups = ... can look up the groups by name, and then search for the entries that have the corresponding group-identifier. Similarly, before sharing (e.g. export) one could replace the group name by the identifier.

More advanced implementation inevitably means more complicated, has chances to more bugs, and takes more time to implement.

As you surely know: the cheapest, fastest and most reliable software components are those that are absent :)

@tobiasdiez
Copy link
Member

If people will create groups with duplicate names, even after being warned, this means that they need such groups. Why not supporting this workflow?

I really would like people to have groups with duplicate names. It's just that I feel this should be properly implemented instead of using a temporary workaround that makes the proper implementation harder.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 3, 2021

I really would like people to have groups with duplicate names. It's just that I feel this should be properly implemented instead of using a temporary workaround that makes the proper implementation harder.

My argument would be that, on one hand, the proper implementation will not be made (much) harder by the suggested change; on the other hand, a quick fix now would enable users to continue working on their databases, while at the same time developers could design and implement group update without hectic, in parallel, and have enough time to consider multiple possibilities and multiple scenarios.

@AEgit
Copy link

AEgit commented Apr 4, 2021

@tobiasdiez These are hard choices. May I suggest the following roadmap for group interface:

From a user perspective this roadmap sounds like a good plan.

@AEgit
Copy link

AEgit commented Apr 4, 2021

#7554 (comment) The problem I see with this approach is that there is a time where people may create groups with duplicate names. This makes the migration a lot harder. Thus I would prefer to sort this out before.

If people will create groups with duplicate names, even after being warned, this means that they need such groups. Why not supporting this workflow?

I reckon one problem here could be, that user initially (mistakenly) ignore the warning and only later on realize, that they actually wanted unique groups. Yes, that would be a user mistake - but it is something to think about. Fortunately, you already outlined an approach to deal with this issue: If the non-unique groups are flagged and the groups can be filtered to show the non-unique ones, then the user can easily solve this problem within JabRef.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

#7554 (comment) The problem I see with this approach is that there is a time where people may create groups with duplicate names. This makes the migration a lot harder. Thus I would prefer to sort this out before.

If people will create groups with duplicate names, even after being warned, this means that they need such groups. Why not supporting this workflow?

I reckon one problem here could be, that user initially (mistakenly) ignore the warning and only later on realize, that they actually wanted unique groups. Yes, that would be a user mistake - but it is something to think about. Fortunately, you already outlined an approach to deal with this issue: If the non-unique groups are flagged and the groups can be filtered to show the non-unique ones, then the user can easily solve this problem within JabRef.

Correct, the non-unique groups should be indicated, and I would implement this in the PR #7558 if we get a go from developers :).

When discovered, duplicated groups can be renamed without loss of information – this already works in the current PR (tested in several different situations). Entries assigned to the original group stay in that group, and you have a choice whether to include the same entries into the newly renamed group or not. Here the existing JabRef functionality (a popup that asks whether the modified group should receive the entries of the original group) turned out very useful.

@AEgit
Copy link

AEgit commented Apr 4, 2021

I really would like people to have groups with duplicate names. It's just that I feel this should be properly implemented instead of using a temporary workaround that makes the proper implementation harder.

My argument would be that, on one hand, the proper implementation will not be made (much) harder by the suggested change; on the other hand, a quick fix now would enable users to continue working on their databases, while at the same time developers could design and implement group update without hectic, in parallel, and have enough time to consider multiple possibilities and multiple scenarios.

I would agree with that - people, who already need this functionality, would have immediately access to it. Given, that as @sauliusg points out, this is technically already part of JabRef (it is just the GUI that forbids creating non-unique groups), the resulting changes to JabRef appear minor.
At a later stage, an improved solution as proposed by @tobiasdiez could be implemented - but this might take some time during which the users which requires this feature are left without options.
Note also, that by allowing this minor fix, experience can be gained regarding the pros and cons of allowing non-unique group names. Maybe there are some hidden drawbacks that are not obvious at the moment.

Finally, if that is not convincing. maybe a compromise could be a solution. Implement the fix as described by @sauliusg but add an additional setting option to the preferences, where the user can actively control whether non-unique groups should be allowed in the GUI or not. Set the default to not allowing the creation of non-unique groups in the GUI (the current default). Add a description/tooltip indicating, that the user has to know, what he is doing if he wishes to allow non-unique groups and maybe call this a BETA feature (given that a later point in time this implementation might be replaced by a more sophisticated one). I reckon this should address the possible concerns and hopefully not be too much more work to implement?

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

@tobiasdiez Regarding migration plan to truly unique groups with names in the entries (i.e. user-editable):

  • If names in entries are not unique, they are disambiguated automatically using UUID appended in braces (e.g. Languages(36592084-944f-11eb-b17a-5fe0541bb1b3));

The UUID would always be a suffix with distinct syntax. Therefore, it is always possible to include a name that looks like it has UUID at the end – just add the true UUID as a suffix. E.g.:
Uuid like group name (68a0a878-950d-11eb-b17b-ef8af80ecb58)(63f1d8d9-7505-4dde-af90-8a83642fbc33)

Here ``Uuid like group name (68a0a878-950d-11eb-b17b-ef8af80ecb58)is a *group* *name* (which does not contain UUID, although it looks superficially like containing UUID), and63f1d8d9-7505-4dde-af90-8a83642fbc33` is the UUID for that group.

Thus even in the very unlikely event of the user generating such strange UUID-like names, JabRef would be able to handle them, and still stay compatible with the current group = {Group Name} implementation. The only limitation would be that for such crazy names (names that end in a UUID-like string) you will have to include UUID (while you can skip it in the rest of the cases). I strongly suspect that the only place in the world where such group names will be used will be JabRef test cases that check the functionality :D. So this seems very safe and usable.

@AEgit
Copy link

AEgit commented Apr 4, 2021

I suggest we accept that we loose manual editing and would go for uuid's. The user experience can be improved afterwards by using a tag-based interface, for easy management of groups from the entry editor.

This only came to my mind later on and it is not directly related to the problem at hand (so, if this goes too much off-topic, maybe we should bring this discussion elsewhere): but, does a tag-based interface reflect the hierarchy that groups provide?
E.g., let's say I have the main group "Physics" and the subgroups "Temperature" and "Mass". If an article is assigned to "Temperature", I reckon its tag would be "Temperature". But how would a tag-based system reflect the fact, that within the groups hierarchy it is the subgroup of "Physics". You cannot just add the tag "Physics" since it is not part of the group, it is only found lower in the hierarchy. Would you need separate tags for the actual group assignment and for the position within the hierarchy? That would result in many "subtags" if you have a deep groups tree (unrelated to this, but also to consider - if an item is assigned to many groups, it will also have many tags).
And yes, the position in the groups hierarchy is relevant (I can provide an actual example, if necessary).
These are just things, that came to my mind, while thinking about the tags-feature.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

E.g., let's say I have the main group "Physics" and the subgroups "Temperature" and "Mass". If an article is assigned to "Temperature", I reckon its tag would be "Temperature". But how would a tag-based system reflect the fact, that within the groups hierarchy it is the subgroup of "Physics". You cannot just add the tag "Physics" since it is not part of the group, it is only found lower in the hierarchy. Would you need separate tags for the actual group assignment and for the position within the hierarchy? That would result in many "subtags" if you have a deep groups tree (unrelated to this, but also to consider - if an item is assigned to many groups, it will also have many tags).

A very good point – and fortunately, the current JabRef group design handles this nicely.

No, you do not need to assign sub-tags :). What you need to is to declare the "Physics" group to be a "union" group (i.e. set "Hierarchical context" to "union" (see the attached image)). Then you can create groups "Temperature" and "Mass", and place them as subgroups of "Physics". This latter fact is recorded only in the JabRef jabref-meta: grouping: comment at the end of the file, not in the entries. The entries will contain only group = {Temperature} tag if assigned to the "Temperature" group.
If you now click on the "Physics" group in the JabRef group GUI, you will get entries from both "Temperature" and "Mass" subgroups, as expected for a 'union' group. If you add more subgroups to "Physics", entries from the new groups will be displayed in "Physics" as well. If you reorganise your tree and move the Temperature group elsewhere, this will be automatically reflected in the display of "Physics" and all other 'union' (and, I suppose, 'intersection') groups.
Screenshot from 2021-04-04 09-35-18

In this way you can quickly navigate in the group hierarchy and have papers assigned to only "Temperature", ony "Mass", or to "Physics" which contains both entries from the "Temperature" and "Mass" subgroups.

If you have two 'Physics' groups in two distinct places of the tree, each of them will show entries (possibly different ones) from their respective subgroups, in addition to identical papers directly assigned to the "Physics" group. I plan to keep non-leaf groups "empty", i.e. not having papers assigned directly to them but only to their leaf subgroups, but of course nothing prevents me at the moment from adding papers directly to "Physics" if I wish so, so that they appear in all "Physics" groups in my tree – might be useful if I wan to have some important papers always at my mouse click, no matter what else I select on the topic.

@AEgit
Copy link

AEgit commented Apr 4, 2021

Sorry, I might not have explained this well. I had the impression that @tobiasdiez was planning to add a tag-based feature which should replace the current groups hierarchy. That is where my questions are coming from.

What I have been using so far, was the global union/intersection settings (they were available in 3.8.2 so - I hope - they are still available in the current version 5).
This allows the following:
a) If I select only the group "Mass", I will only see the articles assigned to "Mass". The same applies to an example where I only select the group "Temperature" or I only select the group "Physics".
b) If I select the groups "Mass" and "Temperature", I will see the articles assigned to both groups (depending on the global setting I chose).
c) If I select the groups "Mass", "Temperature", and "Physics" I will see the articles assigned to all three groups.

That is exactly, what I want - if I select "Physics" I only want to see articles, that have directly been assigned to "Physics". I do not want to see articles, that are part of subgroups of "Physics" UNLESS I explicitly select those subgroups as well.
This is all possible with JabRef 3.8.2 (and hopefully also with version 5 - would need to test that) - but I am not so sure, how you would provide this functionality with a newly introduced tag-based system.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

Sorry, I might not have explained this well. I had the impression that @tobiasdiez was planning to add a tag-based feature which should replace the current groups hierarchy. That is where my questions are coming from.

Oh I see! I misunderstood you, sorry. No, the group tree is implemented in 5.x, and, AFAIU, there are not plans to remove it... (are we right, @tobiasdiez ?)

What I have been using so far, was the global union/intersection settings (they were available in 3.8.2 so - I hope - they are still available in the current version 5).
This allows the following:
a) If I select only the group "Mass", I will only see the articles assigned to "Mass". The same applies to an example where I only select the group "Temperature" or I only select the group "Physics".

Works in 5.x and in PR #7558.

b) If I select the groups "Mass" and "Temperature", I will see the articles assigned to both groups (depending on the global setting I chose).

Works as above, except for a "union/intersection" toggle bug (you need to restart the JabRef for your change to take effect).

c) If I select the groups "Mass", "Temperature", and "Physics" I will see the articles assigned to all three groups.

Works, depending on the global settings and on the "Physics" "Hierarchical context" setting, I guess.

That is exactly, what I want - if I select "Physics" I only want to see articles, that have directly been assigned to "Physics". I do not want to see articles, that are part of subgroups of "Physics" UNLESS I explicitly select those subgroups as well.
This is all possible with JabRef 3.8.2 (and hopefully also with version 5 - would need to test that) - but I am not so sure, how you would provide this functionality with a newly introduced tag-based system.

As of current, in my tests, the 5.x supports all functionality what you described.

In the suggested roadmap above, all this functionality could be (hopefully) retained, plus group disambiguation UUID cold be added. If I correctly understand, unique unambiguous group IDs and the current group hierarchy is in the plans, isn't it, @tobiasdiez ?

@tobiasdiez
Copy link
Member

tobiasdiez commented Apr 4, 2021

The groups tree stays of course! My idea was to replace the groups field in the entry editor (which right now is only a text box) by tags that show the group name and on hover the complete path of the group.

Concerning the future steps, how would the migration to an id-based system look like after we allowed groups with duplicate names? Right now we can assume that the information in the groups field uniquely identifies a group (the only way to not have this unique identifier is by either migrating from a very old JabRef or manually editing the groups tree - in both cases we can simply issue a warning and take the first group matched by the name). Thus the migration is very easy, one simply needs to replace the group name by its id. But once it is officially allowed to have groups with the same name, this identification breaks down.

Moreover, I don't really see a reason for urgency right now. The serialization of groups in the groups field has been done 5 years ago, and shortly thereafter the restriction in the ui was added. So this is in place for quite some time. A few months more doesn't change much in the grand scheme of things, and I actually don't think that a proper solution takes so long to implement. The changes should be comparable to the changes in the linked PR, plus some additional tests.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

Concerning the future steps, how would the migration to an id-based system look like after we allowed groups with duplicate names? Right now we can assume that the information in the groups field uniquely identifies a group (the only way to not have this unique identifier is by either migrating from a very old JabRef or manually editing the groups tree - in both cases we can simply issue a warning and take the first group matched by the name). Thus the migration is very easy, one simply needs to replace the group name by its id. But once it is officially allowed to have groups with the same name, this identification breaks down.

Migration after the proposed PR will be just as easy as in the case of "unique" group names. The point is that the groups with identical names are still unique, in the sense that they contain, by definition, the same entries. So you have two options when migrating:

  1. either add the same UUID to the "duplicated" groups (or, rather, the same group represented in multiple place in the group tree); or
  2. Add different UUIDs to such groups, and assign all entries in the original "duplicated" group to all such groups; issue a warning that from now on, handling of group synonyms in different tree branches has changed.

In both cases the users' group assignments will not change, and their work will be preserved. The different behaviour will have to be made explicit, but, once understood, it can either can be used. Also, nothing prevents me from adding the same group with the same UUID to multiple branches in the tree, so the functionality of PR #7558 is still needed, in one way or another, even when you transition to a new group implementation.

Moreover, as you say yourself, "the /only/ way to not have this unique identifier is by either migrating from a very old JabRef or manually editing the groups tree - in both cases we can simply issue a warning and take the first group matched by the name" – which means that JabRef will have to deal with duplicated group names, no matter what. Thus one can not "officially allow" or "disallow" duplicated groups – the bibtex file is an external input for JabRef, managed at the users' discretion, and it can be processed by many external tools, many of which we don't even know. So asking whether "one should allow duplicates" is a wrong question based on unsatisfyable assumptions; what could be asked instead is "how do we deal with duplicated groups in the way that is most convenient for the users, and reasonably easy to implement/support for developers"?

@AEgit has a very good point: "experience can be gained regarding the pros and cons of allowing non-unique group names. Maybe there are some hidden drawbacks that are not obvious at the moment." Or maybe the user will like the feature and start using it. I definitely see the use for this feature for myself.

So the bottom line would be: migrating to group names based on UUIDs will be as easy with PR #7558 as it is now, since the groups essentially continue to be unique, in the sense that there is only one set of entries assigned to each named group.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

The groups tree stays of course!

That's very reassuring! :)

My idea was to replace the groups field in the entry editor (which right now is only a text box) by tags that show the group name and on hover the complete path of the group.

This can be done in any group representation, even after the PR #7558 – for groups that are present several times in the group tree you could just show all their locations, one location per line :)

@tobiasdiez
Copy link
Member

either add the same UUID to the "duplicated" groups (or, rather, the same group represented in multiple place in the group tree); or

That's not gona work. The point of having id's was that they uniquely identify a group...otherwise what's the point?

Add different UUIDs to such groups, and assign all entries in the original "duplicated" group to all such groups; issue a warning that from now on, handling of group synonyms in different tree branches has changed.

That's my point: we allow now duplicate group names, but they will behave differently (and more correctly in my opinion) after the migration to the id approach. Sorry, but I don't see the advantages. To be honest, in the we now spent discussing this, one could have easily implemented the solution for id-based groups.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

Moreover, I don't really see a reason for urgency right now. The serialization of groups in the groups field has been done 5 years ago, and shortly thereafter the restriction in the ui was added.

To me, this was a very unfortunate decision. It rendered the group mechanism unusable for my purposes (lost all group assignments that I had), and I am very upset each time when I see the deactivated OK button in the group creation dialogue, even more so since I know that this restriction is completely unnecessary, and even checked this statement in practice on my JabRef clone :).

So this is in place for quite some time. A few months more doesn't change much in the grand scheme of things, and I actually don't think that a proper solution takes so long to implement. The changes should be comparable to the changes in the linked PR, plus some additional tests.

May I quote another related Issue (#7442): "To be honest, I'm really angry at the moment as I'm not using JabRef for fun but really need it for work and I cannot waste that much time on recreating tagging, grouping etc. over and over again". It is actually a big praise to JabRef that it is used in professional environment, but also adds certain constraints on what changes and timings the users can tolerate – we need to have the work done, and hope that the tool will help us and spare us time – provided we can give due support to your guys who develop JabRef :).

As a matter of fact, for me the issue is rather urgent. For 5 years, we could not use the group interface, so my colleagues abandoned JabRef grouping in favour of CLI tools; some neighbouring labs switched to completely different solutions; in case you are interested I have currently my grouping tree in Zotero (which does not put any restrictions on my group names) and do some command-line fiddling to sync the two repos, my bibtex file which I manage in JabRef and the Zotero DB. But this year I have started a new course in our university and have new students who need to manage bibliographies for their theses. I need to recommend them a good tool for managing bibliographies. I can not offer them a pure command line or dual tool solution, they have still many things to learn. JabRef would be a candidate no. 1 if it had an arbitrary (including duplicated) group name support and would not crash too often. I need to make the recommendation in the next weeks; two or three months are way too long to wait, by that time the students need to have their thesis well advanced, and definitely all literature overview done.

So if we find a solution how to handle multiple duplicated group names into JabRef, for example as my suggested PR or in some other way, I would pull or write fixes/workarounds for #7010, #7442, #7351 and #7606 into my branch; if I know that these fixes will be in the next official JabRef release, we (I and my students) can then start working next week with my branch and switch back to mainstream when the JabRef release is out (no hurry there, then). I would sync my Zotero tree with JabRef, finally have my groups assigned and continue working on my main subjects :).

As quick fix for grouping now is much much more important for us than a "correct" solution in tree months...

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

either add the same UUID to the "duplicated" groups (or, rather, the same group represented in multiple place in the group tree); or

That's not gona work.

It will work the same way as it works now. It will work the same way as it works now. In the essence, the current JabRef code handles multiple groups with the same name :) I understand that this was not planned, but it can be viewed as an emergent phenomenon :).

The point of having id's was that they uniquely identify a group...otherwise what's the point?

The point in having UUIDs would be to help users have truly different groups in different parts of the tree that have identical names. This feature can be added later, as you say, in two or three months if timing permits.

@AEgit
Copy link

AEgit commented Apr 4, 2021

The groups tree stays of course! My idea was to replace the groups field in the entry editor (which right now is only a text box) by tags that show the group name and on hover the complete path of the group.

Cheers, that is re-assuring. I am still not sure, as to how this is going look and/or what the benefits are (the good thing at the moment is, that groups can easily be modified in any text editor), but probably this is something the user needs to see in action to understand the advantages.

@sauliusg
Copy link
Contributor Author

sauliusg commented Apr 4, 2021

Sorry, but I don't see the advantages.

The advantage is outlined in my timeline description above.

To be honest, in the we now spent discussing this, one could have easily implemented the solution for id-based groups.

Could be, but I do not see how. When I claimed three weeks ago that I think one can allow identical group names safely and conveniently in just a few lines of code, I then made code that demonstrates this (#7558). I do not see how I can make an UUID based solution in a comparable time with a comparable effort. You know JabRef code much better, so maybe for you it is easer to implement, but for me it would definitely take more time and effort; on the months range, actually, compatible with your estimate.

ALso, as you said, the full UUID solution needs to be thought through. Leaving just UUIDs in bibtex entries and eliminating names is not a good option, as I have mentioned before...

@AEgit
Copy link

AEgit commented Apr 4, 2021

That's my point: we allow now duplicate group names, but they will behave differently (and more correctly in my opinion) after the migration to the id approach. Sorry, but I don't see the advantages. To be honest, in the we now spent discussing this, one could have easily implemented the solution for id-based groups.

I reckon @sauliusg point is that his solution will just enable something, that is already implicitly possible with the current JabRef version (either by using old databases or by editing the database manually in a text editor). As such, it does not pose a major change in the implementation of JabRef. The ID based approach, on the other hand, requires major thinking on its potential downstream effects and thus will not be that easy to implement. If this more sophisticated ID approach is implemented without much thought, it could lead to much more problematic situations then the available simple approach (which only enables something that is already possible without using the GUI).

I am only a user and I will probably also not use non-unique group names, so I am in no position to demand anything - take my input here just as thoughts on the topic. Having said that and given these different opinions on the topic, would it be worthwhile to have a sort of a BETA version of Jabref (similar to a separate branch on Github, but I have something more stable in mind), which allows users to use features, which the main developers of JabRef do not think should currently appear in the main/stable version? It could either be a separate version (though I think that would just make things more complicated) or the currently used developer version, but with an additional preference window, which allows the user to enable certain BETA features, e.g. the creation of non-unique groups.

@koppor
Copy link
Member

koppor commented Apr 10, 2021

but with an additional preference window, which allows the user to enable certain BETA features, e.g. the creation of non-unique groups.

I think, our different branches should cover that request. We aim for reducing the number of preferences and keep the number testing candidates small. We even cannot catch up with the current PRs; thus there IMHO is no chance to maintain a beta version. Maybe, if some developer jumps in ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants