Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement album merging for duplicates #2725

Merged
merged 6 commits into from
Nov 11, 2017
Merged

Conversation

udiboy1209
Copy link
Contributor

Fixes #112

Would removing the album before running the merge task make sense? This code seems to work now
but I think it gives problems when reimporting library with existing duplicate albums. The tests don't fail though.

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly cool! As you can tell from the small issue number, this is an extremely long-standing request, and this is a nice, elegant approach.

I actually like the current strategy you have, which resembles a "re-import." Removing the items from the library first shouldn't be necessary.

I made a few small comments and asked one question. Mainly, some helper functions might be helpful, and some additional commenting to make the intent clear could help future readers of the code.

Would you also mind adding to the importer guide in the documentation to describe what this option does?

@@ -1352,7 +1358,26 @@ def emitter(task):
])
return pipeline.multiple(ipl.pull())

resolve_duplicates(session, task)
if type(task) != MergedImportTask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this seems crazy, but just to check: is this condition actually necessary? That is, if we merge a bunch of albums into one, would it be OK to check again for duplicates of the newly-tagged merged album?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because the initially detected duplicates weren't removed, hence would again be marked in this MergedImportTask's duplicate list and give a second prompt to the user. Choosing "Remove" at this stage would be the correct way to go, which is basically what this condition does automatically. Also it helps avoid another pointless check for duplicates which we know exist.

Is there a way to remove those albums from the library, so that they don't get detected as duplicates again? like probably setting the album's id to None ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, got it. This makes me think that there might be a slightly deeper issue to address here: we probably shouldn't detect a duplicate when the "duplicate" album actually consists of the same songs as the newly-imported album! This should probably be true for any re-import situation, not just the new functionality you've added here.

In particular consider this check from find_duplicates:

beets/beets/importer.py

Lines 637 to 639 in 9c6910d

album_paths = set(i.path for i in album.items())
if album_paths != task_paths:
duplicates.append(album)

This says that we don't detect a duplicate if the set of songs is exactly the same between the old and new album. Perhaps this should be changed to a subset check instead. Does that seem reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A subset check would be correct. I will make that change because the subset case would only occur in merge scenario. We also wouldn't need a separate MergeImportTask class and check too.

duplicate_items = task.duplicate_items(session.lib)
for item in duplicate_items:
item.id = None
item.album_id = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the same rigamarole we have to do when re-importing (which is logical). Instead of duplicating the code to do this, maybe we need a helper function for this.

iter([merged_task]),
lookup_candidates(session),
user_query(session)
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, maybe a helper function would be useful here to encapsulate the temporary pipeline as used here and in user_query.

also implement repeating code as helper functions
@udiboy1209
Copy link
Contributor Author

udiboy1209 commented Nov 3, 2017

The problem now with re-importing duplicate albums from the library is that multiple sessions are created for each album, but if the user decides to merge the two in the first session, the second session would run into an error because the files would probably have moved. A similar error should occur when the user decides to remove the duplicates in the first session. What is done in that case?

@sampsyo
Copy link
Member

sampsyo commented Nov 3, 2017

Hmm, I don't quite understand… what should I do to trigger a problem? For example, is it enough just to choose to merge and then "apply changes" in the temporary sub-pipeline that's created, or is there something more exotic that I need to do?

@udiboy1209
Copy link
Contributor Author

I'll explain better. There are two albums in the library , A and B which are duplicates (imported using the Keep both option)

Now when I try re-importing these two albums, two import sessions would be created for each album A and B. In the first session, when B is marked as duplicate, if I choose to Merge, tracks of B are re-imported with A into one album.

This works well and is the intended result, but now the second import session which was working on the original album B would fail. Same would happen if I choose to Remove the duplicates of A, and files of B are removed.

@wisp3rwind
Copy link
Member

(Disclaimer: I have read neither this PR entirely nor looked at the relevant code in detail)

Maybe, if you could detect this situation from the task that is processed second, you could make the user_query stage yield a beets.pipeline.BUBBLE (which means to abort the task and not execute any further pipeline stage) early (before showing any prompt). That would work since stages of the same type (e.g. user_query) are run strictly in sequence, so when the second one is run, the decision about merging the first has been made). One remaining point of failure that would need investigation might be any access to old database items that have been changed by the first task.

@sampsyo
Copy link
Member

sampsyo commented Nov 4, 2017

Ah! I see; thanks for explaining, @udiboy1209, and for adding some thoughts, @wordofglass. In summary, the problem only happens when re-imports collide with merges (which is not an exotic scenario). When we merge with an old album, we don't also want to import the old album separately.

In a way, this is similar to the problem we used to have with recent duplicates. If you imported album X and then another copy of album X again immediately afterward, the first copy might not have reached the database yet, so no duplicate would be detected. We used to solve this with an extra data structure on the side to keep track of "recently imported" albums, but we've since replaced that with a strategy that just adds albums to the database earlier—before the files are copied and the metadata is updated.

@wordofglass's suggestion amounts to bringing back approximately the same strategy. The user_query stage can keep track of in-library albums that have been merged already and emit a bubble if they're encountered later on. Sounds perfect to me!

@sampsyo
Copy link
Member

sampsyo commented Nov 9, 2017

Yay! This looks perfect from here. Are there any outstanding concerns, or shall we hit the merge button?

@udiboy1209
Copy link
Contributor Author

Great! Merge is good to go for me.

@sampsyo sampsyo merged commit a6215f3 into beetbox:master Nov 11, 2017
sampsyo added a commit that referenced this pull request Nov 11, 2017
Implement album merging for duplicates
sampsyo added a commit that referenced this pull request Nov 11, 2017
@sampsyo
Copy link
Member

sampsyo commented Nov 11, 2017

Awesome! Thank you again for tackling this—it's been a very long-standing issue that many people have requested, and this is a very elegant solution. I hereby award you the Pull Request of the Month award. 🥇

@sean-abbott
Copy link

Awesome, thank you so much! Just in time for me to merge my library with my partner's. :-)

@DArtagan DArtagan mentioned this pull request Feb 9, 2018
@udiboy1209 udiboy1209 deleted the merge-albums branch March 8, 2018 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants