Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new, deduplicated mailbox with unique emails too (Documentation: What is "discarded"?) #599

Open
turian opened this issue Jan 13, 2024 · 3 comments
Assignees
Labels
🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors

Comments

@turian
Copy link

turian commented Jan 13, 2024

Is your feature request related to a problem? Please describe.

I have several mailboxes, with many duplicates.

I want to create a new mailbox, with all de-duplicated mail from the old mailboxes, including non-duplicates.

Documentation confusion
I'm puzzling over the documentation, because it is not really clear what "selected" and "discarded" mean.

Let's say there are emails 1A, 1B, and 2. (1A and 1B are duplicates in different mailboxes.)

Whatever strategy I choose, 1A and 1B are compared and one is selected and the other is discarded.

But what happens to 2?
a) Has no hash matches so it is never compared, or selected, and isn't copied to my new mailbox. Then I am stuck on how to solve my problem.
b) There is always a "selected" mail, even if it is unique and has no hash matches.

Can you please clarify? (I also think a documentation update would help. I read over the main docs and didn't understand, which is why I post.)

@turian turian added 🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors labels Jan 13, 2024
@turian
Copy link
Author

turian commented Jan 13, 2024

One other thing that isn't clear from the documentation:

If two items tie, e.g. have the same datestamp, is a tiebreak made. This would be logical, but a strict reading of the documentation would be that BOTH emails are selected.

Meaning, if 1A and 1B have identical timestamps, are BOTH selected and acted upon? Or just one, for actions that typically select one message.

@turian
Copy link
Author

turian commented Jan 13, 2024

Just to followup, I still could not determine the behavior. I used GPT4 and plugged in each file, trying to see if I could determine code that would answer my question. However, I was unable to determine which code directly addresses the handling of unique emails in the deduplication process or the resolution of ties in duplicate selection.

@turian
Copy link
Author

turian commented Jan 13, 2024

So I am constucting a toy mbox to understand the behavior, but now I am more confused than ever:

From test@example.com Thu Jan  1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: test@example.com
To: recipient@example.com

This is a duplicate email.

From test@example.com Thu Jan  1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: test@example.com
To: recipient@example.com

This is a duplicate email.

From test@example.com Thu Jan  1 00:01:00 2021
Subject: Slightly Different Email
Date: Thu, 1 Jan 2021 00:01:00 +0000
From: test@example.com
To: recipient@example.com

This email is slightly different.

From test@example.com Thu Jan  1 00:02:00 2021
Subject: Unique Email
Date: Thu, 1 Jan 2021 00:02:00 +0000
From: test@example.com
To: recipient@example.com

This is a unique email.

Giving:


● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │ Metric │ Description                                                  │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found      │      4 │ Total number of mails encountered from all mail sources.     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │      0 │ Number of mails rejected individually because they were      │
│            │        │ unparseable or did not have enough metadata to compute       │
│            │        │ hashes.                                                      │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │      4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │      3 │ Number of unique hashes.                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │      0 │ Number of unique mails (which where automatically added to   │
│            │        │ selection).                                                  │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │      4 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │        │ sets with at least 2 mails).                                 │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │      4 │ Number of mails ignored in the selection step because the    │
│            │        │ whole set they belong to was skipped.                        │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │      0 │ Number of mails discarded from the final selection.          │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │      0 │ Number of mails kept in the final selection on which the     │
│            │        │ action will be performed.                                    │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │      0 │ Number of mails copied from their original mailbox to        │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │      0 │ Number of mails moved from their original mailbox to         │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │      0 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │ Metric │ Description                                                │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total              │      3 │ Total number of duplicate sets.                            │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single             │      0 │ Total number of sets containing only a single mail with no │
│                    │        │ applicable strategy. They were automatically kept in the   │
│                    │        │ final selection.                                           │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they had encoding issues.                                  │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in size.                          │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content  │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in content.                       │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │      3 │ Number of sets skipped from the selection process because  │
│                    │        │ the strategy could not be applied.                         │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │      0 │ Number of valid sets on which the selection strategy was   │
│                    │        │ successfully applied.                                      │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛

This suggests:

  • If there are enough headers, the email is considered (not sure what term you use here) and will either be selected or discarded. (Otherwise, it is "rejected". Which is important but not well documented, since these emails are not acted upon.)
  • Since the two dup emails have identical timestamps, they are both selected. This is technically correct but also a bit hard to get around when you do want tiebreak behavior.

Anyway, what is clear is that all emails are selected, and move-discarded thus moved none. So move-selected should move ALL of them, right? But I do the same command with move-selected and nothing happens and mbox is unchanged!

● Step #3 - Select mails in each group
info: select-newest strategy will be applied on each duplicate set to select candidates.
info: ◼ 2 mails sharing hash 05a3285c1254315fa50966ae1bed99e47ab51a592d9e728a7a70e526
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459200 timestamp...
warning: Skip set: all 2 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459260 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459320 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.

● Step #4 - Perform action on selected mails
info: Perform move-selected action...
warning: No mail selected to perform action on.

● Step #5 - Report and statistics
╒════════════╤════════╤══════════════════════════════════════════════════════════════╕
│ Mails      │ Metric │ Description                                                  │
╞════════════╪════════╪══════════════════════════════════════════════════════════════╡
│ Found      │      4 │ Total number of mails encountered from all mail sources.     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Rejected   │      0 │ Number of mails rejected individually because they were      │
│            │        │ unparseable or did not have enough metadata to compute       │
│            │        │ hashes.                                                      │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Retained   │      4 │ Number of valid mails parsed and retained for deduplication. │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Hashes     │      3 │ Number of unique hashes.                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Unique     │      0 │ Number of unique mails (which where automatically added to   │
│            │        │ selection).                                                  │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Duplicates │      4 │ Number of duplicate mails (sum of mails in all duplicate     │
│            │        │ sets with at least 2 mails).                                 │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Skipped    │      4 │ Number of mails ignored in the selection step because the    │
│            │        │ whole set they belong to was skipped.                        │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Discarded  │      0 │ Number of mails discarded from the final selection.          │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Selected   │      0 │ Number of mails kept in the final selection on which the     │
│            │        │ action will be performed.                                    │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Copied     │      0 │ Number of mails copied from their original mailbox to        │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Moved      │      0 │ Number of mails moved from their original mailbox to         │
│            │        │ another.                                                     │
├────────────┼────────┼──────────────────────────────────────────────────────────────┤
│ Deleted    │      0 │ Number of mails deleted from their mailbox in-place.         │
╘════════════╧════════╧══════════════════════════════════════════════════════════════╛
╒════════════════════╤════════╤════════════════════════════════════════════════════════════╕
│ Duplicate sets     │ Metric │ Description                                                │
╞════════════════════╪════════╪════════════════════════════════════════════════════════════╡
│ Total              │      3 │ Total number of duplicate sets.                            │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Single             │      0 │ Total number of sets containing only a single mail with no │
│                    │        │ applicable strategy. They were automatically kept in the   │
│                    │        │ final selection.                                           │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Encoding │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they had encoding issues.                                  │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Size     │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in size.                          │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Content  │      0 │ Number of sets skipped from the selection process because  │
│                    │        │ they were too dissimilar in content.                       │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Skipped - Strategy │      3 │ Number of sets skipped from the selection process because  │
│                    │        │ the strategy could not be applied.                         │
├────────────────────┼────────┼────────────────────────────────────────────────────────────┤
│ Deduplicated       │      0 │ Number of valid sets on which the selection strategy was   │
│                    │        │ successfully applied.                                      │
╘════════════════════╧════════╧════════════════════════════════════════════════════════════╛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎁 feature request Not existing yet and need to be implemented 🙏 help wanted I can't do this alone and need contributors
Projects
None yet
Development

No branches or pull requests

2 participants