Check for duplicate sources in messages #156

zachcmadsen · 2024-01-27T01:38:58Z

With rounded line numbers, it's possible for messages to have duplicate sources. This checks for duplicate sources before updating messages. An alternative approach is deduplicating sources after all the messages are added/updated. I went with the "on-the-fly" approach since it's simpler.

The check scans all of the sources for a message. If performance is a concern, we could try other ways to check for duplicates.

Fixes #154

google-cla · 2024-01-27T01:39:01Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

codecov-commenter · 2024-01-27T12:34:28Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (fba1b0a) 90.74% compared to head (ba57994) 90.80%.

Files	Patch %	Lines
i18n-helpers/src/xgettext.rs	94.73%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #156      +/-   ##
==========================================
+ Coverage   90.74%   90.80%   +0.05%     
==========================================
  Files          11       11              
  Lines        2399     2436      +37     
  Branches     2399     2436      +37     
==========================================
+ Hits         2177     2212      +35     
  Misses        159      159              
- Partials       63       65       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kdarkhan · 2024-01-27T12:36:28Z

i18n-helpers/src/xgettext.rs

@@ -41,7 +41,13 @@ fn strip_link(text: &str) -> String {

 fn add_message(catalog: &mut Catalog, msgid: &str, source: &str) {
    let sources = match catalog.find_message(None, msgid, None) {
-        Some(msg) => wrap_sources(&format!("{}\n{}", msg.source(), source)),
+        Some(msg) => {
+            if msg.source().contains(source) {


I think checking for existence should be done per line rather than regular string contains. An example case where the proposed logic would fail will be if source contains entry some-dir/README.md and the new source is README.md.

That makes sense. I'll update the logic and add a test case for that scenario

Good catch @kdarkhan! I think this could be handled nicely by

Separating the sources with \n when add_message is called.

Use lines() to split the source field back to lines and then deduplicate these lines.

Do a call to wrap_sources to nicely wrap things.

Both 2 and 3 could be done after adding all messages to the catalog — I think that would give us linear complexity here. Right now, we have some potential for O(n²) complexity here.

Putting a filename per line remove ambiguity about filenames that contain spaces. I'm not actually sure how such filenames ought to be stored in the sources field — feel free to experiment a bit with xgettext if you are curious!

Thanks for the outline @mgeisler! I pushed an update along those lines. What I have assumes that duplicate sources are consecutive. Are there cases where that doesn't hold?

What I have assumes that duplicate sources are consecutive. Are there cases where that doesn't hold?

I think it should hold right now — we walk over the contents of the book in the order given by Book::iter. It says that it's doing a depth-first traversal on the files of the book, so it ought to do the right thing!

This change looks good to me.

Here is the diff of the generated messages.pot from comprehensive-rust repo.

https://gist.github.com/kdarkhan/9e270ae1e5842abd10f826dcaa5940aa

kdarkhan · 2024-02-09T00:10:25Z

Thanks for contribution @zachcmadsen! Feel free to pick up the followup task #171 if you want.

zachcmadsen added 2 commits January 26, 2024 17:04

Add a check to prevent duplicate sources in messages

cd8f9fc

Add a test

ba57994

kdarkhan reviewed Jan 27, 2024

View reviewed changes

Deduplicate sources after all messages are added to the catalog

994ac6e

kdarkhan approved these changes Feb 7, 2024

View reviewed changes

kdarkhan merged commit 4853f55 into google:main Feb 8, 2024
7 checks passed

zachcmadsen deleted the zach/dedup-sources branch February 9, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for duplicate sources in messages #156

Check for duplicate sources in messages #156

zachcmadsen commented Jan 27, 2024

google-cla bot commented Jan 27, 2024

codecov-commenter commented Jan 27, 2024

kdarkhan Jan 27, 2024

zachcmadsen Jan 29, 2024

mgeisler Feb 6, 2024

zachcmadsen Feb 7, 2024

mgeisler Feb 7, 2024

kdarkhan Feb 7, 2024

kdarkhan commented Feb 9, 2024

Check for duplicate sources in messages #156

Check for duplicate sources in messages #156

Conversation

zachcmadsen commented Jan 27, 2024

google-cla bot commented Jan 27, 2024

codecov-commenter commented Jan 27, 2024

Codecov Report

kdarkhan Jan 27, 2024

Choose a reason for hiding this comment

zachcmadsen Jan 29, 2024

Choose a reason for hiding this comment

mgeisler Feb 6, 2024

Choose a reason for hiding this comment

zachcmadsen Feb 7, 2024

Choose a reason for hiding this comment

mgeisler Feb 7, 2024

Choose a reason for hiding this comment

kdarkhan Feb 7, 2024

Choose a reason for hiding this comment

kdarkhan commented Feb 9, 2024