Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umlauts in copyrights are removed from output files #1566

Open
MMarwedel opened this issue May 15, 2019 · 12 comments
Open

Umlauts in copyrights are removed from output files #1566

MMarwedel opened this issue May 15, 2019 · 12 comments

Comments

@MMarwedel
Copy link

Hi,
when scanning files with umlauts, they are converted to non umlauts. It should be better to keep them in the original form.
Sample file:
https://chromium.googlesource.com/native_client/nacl-newlib/+/master/newlib/libc/time/strptime.c
Output:
"holders": [
{
"value": "Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).",
"start_line": 2,
"end_line": 4
}
],
"copyrights": [
{
"value": "Copyright (c) 1999 Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).",
"start_line": 2,
"end_line": 4
}
],

The right output would be ... Högskolan ...

@pombredanne
Copy link
Member

@MMarwedel this is indeed the case. What happens in fact is called transliteration, that is converting unicode characters to a plain ASCII form. And this effect is to remove umlauts and all other punctuation. I agree this is not perfect, and I am not sure what the rationale was when the decision was made. Since we are porting to Python 3 that uses unicode by default, it may no longer be issue in a the near future. So I will keep this open until we have completed the port so we can revisit how to fix this then. Would this work for you?

@MMarwedel
Copy link
Author

Yes, this would work for me. As I am doing some postprocessing of the results anyway, I can fix the umlauts there for the few cases I found.

@pombredanne
Copy link
Member

If you are doing some post processing it would be best to handle this in here if possible... everyone could then benefit?

@MMarwedel
Copy link
Author

Hmm... while I like to share code, I guess the post processing may not be in a state easy useable for other projects. And my employer would have to agree too.

@pombredanne
Copy link
Member

your call :)

@Ben-Thelen
Copy link

Ben-Thelen commented Jun 11, 2021

Now that the port to python3 happened, what are the chances that this issue is getting a look at?

I had a look into the current logic and it seems that the parse-tree breaks when using the non-ascii String version of a line (achieved by e.g. setting the to_ascii=True in the prepare_text_line method of src/cluecode/copyrights.py to False).

E.g. the tree for the line

Copyright (c) 2004-2007 Gerhard Häring

changes from (with to_ascii=True)

(S
  (COPYRIGHT
    Copyright/COPY
    (c)/COPY
    (NAME-YEAR
      (NAME-YEAR
        (NAME-YEAR
          (YR-RANGE (YR-RANGE 2004-2007/YR))
          Gerhard/NNP
          Haring/NNP)))))

to (with to_ascii=False)

  (COPYRIGHT
    Copyright/COPY
    (c)/COPY
    (NAME-YEAR
      (NAME-YEAR
        (NAME-YEAR (YR-RANGE (YR-RANGE 2004-2007/YR)) Gerhard/NNP))))
  Häring/NN)

I could not exactly find out what is breaking here internally but it's a lead at least.

A workaround could maybe be dragging the utf-8 string along for the process and enabling (maybe via flag?) to write out that original string to the result.json instead of the prepared ascii-String.
This would mean the internal logic is not broken and still Umlauts and other utf-8 characters could be properly displayed in the result.

@pombredanne
Copy link
Member

With the latest switch from NLTK to Pygmars https://github.com/nexB/pygmars/ we now have more opportunities to fix lexing and support Unicode all the way.
One of the reason why Haring is lexed as NNP and Häring is lexed as NN is due to several possible factors:

  1. the text may have been converted/transliterated to ASCII in pre-processing and this drops the umlaut
  1. once you keep the umlauts, the regex used for token recognition aka. lexing https://github.com/nexB/scancode-toolkit/blob/3f7da81d6b207ac2b1d384defb83a5f2c82216f4/src/cluecode/copyrights.py#L456 are not aware of certain characters at two levels:

So I made a test with these fixes:

diff --git a/src/pygmars/lex.py b/src/pygmars/lex.py
index f60a9de..7fe50bc 100644
--- a/src/pygmars/lex.py
+++ b/src/pygmars/lex.py
@@ -91,7 +91,10 @@
         """
         try:
             self._matchers = [
-                (re.compile(m).match if isinstance(m, str) else m, label)
+                (
+                    re.compile(m, re.UNICODE).match if isinstance(m, str)else m,
+                    label,
+                )
                 for m, label in matchers
             ]
         except Exception as e:

and:

diff --git a/src/cluecode/copyrights.py b/src/cluecode/copyrights.py
index 74c5293..10214a4 100644
--- a/src/cluecode/copyrights.py
+++ b/src/cluecode/copyrights.py
@@ -3374,14 +3374,14 @@
 remove_man_comment_markers = re.compile(r'.\\"').sub
 
 
-def prepare_text_line(line, dedeb=True, to_ascii=True):
+def prepare_text_line(line, dedeb=True, to_ascii=False):
     """
     Prepare a text ``line`` for copyright detection.
 
     If ``dedeb`` is True, remove "Debian" <s> </s> markup tags seen in
     older copyright files.
 
-    If ``to_ascii`` convert the text to ASCiI characters.
+    If ``to_ascii`` convert the text to ASCII characters.
     """
     # remove some junk in man pages: \(co
     line = (line

and voila!

$ echo " * Copyright (c) 1999 Kungliga Tekniska Högskolan
>  * (Royal Institute of Technology, Stockholm, Sweden). 
>  * All rights reserved." > baz
r$ scancode -c --json-pp - baz
Setup plugins...
Collect file inventory...
Scan files for: copyrights with 1 process(es)...
[####################] 0             
{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "21.6.7",
      "options": {
        "input": [
          "baz"
        ],
        "--copyright": true,
        "--json-pp": "-"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-07-07T092932.468312",
      "end_timestamp": "2021-07-07T092932.580056",
      "duration": 0.11177682876586914,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "baz",
      "type": "file",
      "copyrights": [
        {
          "value": "Copyright (c) 1999 Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)",
          "start_line": 1,
          "end_line": 2
        }
      ],
      "holders": [
        {
          "value": "Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)",
          "start_line": 1,
          "end_line": 2
        }
      ],
      "authors": [],
      "scan_errors": []
    }
  ]
}Scanning done.
Summary:        copyrights with 1 process(es)
Errors count:   0
Scan Speed:     9.14 files/sec. 
Initial counts: 1 resource(s): 1 file(s) and 0 directorie(s) 
Final counts:   1 resource(s): 1 file(s) and 0 directorie(s) 
Timings:
  scan_start: 2021-07-07T092932.468312
  scan_end:   2021-07-07T092932.580056
  scan: 0.11s
  total: 0.12s
Removing temporary files...done.


>>> print( "Copyright (c) 1999 Kungliga Tekniska H\u00f6gskolan (Royal Institute of Technology, Stockholm, Sweden)")
Copyright (c) 1999 Kungliga Tekniska Högskolan (Royal Institute of Technology, Stockholm, Sweden)
>>> 

@sschuberth
Copy link
Collaborator

This is still an issue with ScanCode 31.2.1. The Copyright (C) 2011 Felix Geisendörfer appears as

  "copyrights": [
    {
      "copyright": "Copyright (c) 2011 Felix Geisendorfer",
      "start_line": 1,
      "end_line": 1
    }
  ],

in the result. This is problematic as it might even have legal implications in the worst case if both "Felix Geisendörfer" and "Felix Geisendorfer" people exist.

@pombredanne
Copy link
Member

@sschuberth We still have not addressed this ... I have made some tests in the past in d35e308 but there were too many induced issues to complete this.

An alternative could be an "oe" and "ae" transliteration for German... do you think this could work out (I am not saying this could be simpler, btw)? ... but then, there are all the other languages.

@sschuberth
Copy link
Collaborator

An alternative could be an "oe" and "ae" transliteration for German... do you think this could work out

No, I don't think so, because "oe" is not really equivalent to "ö" in German language. It's just a "work-around" if you have to stick to ASCII characters, but strictly (like, legally) speaking e.g. "Möller", "Moeller" and "Moller" are all different (and valid) family names.

@sschuberth
Copy link
Collaborator

As a follow-up question: In

  "copyright": "Copyright (c) 2011 Felix Geisendorfer",
  "start_line": 1,
  "end_line": 1

is "copyright" always a full line match? Because if so, we could probably do some hacky post-processing that goes over all copyright findings and re-extracts the lines from the real files to get the original string.

@pombredanne
Copy link
Member

re: #1566 (comment)

is "copyright" always a full line match? Because if so, we could probably do some hacky post-processing that goes over all copyright findings and re-extracts the lines from the real files to get the original string.

No, there are cases where a statement spans multiple lines and many cases where what is before and after a copyright statement is not part of the copyright at all. That being said, we could find way to get back to the original unprocessed text but the difficulty is that words and letters do not align one for one between the original text and its transliteration. We track neither position nor offsets be it of characters or words for now.

The general approach is roughly:

  • transliterate and/or extract strings (for binaries)
  • collect lines of text from that
  • identify regions of lines that may contain copyright/authors
  • for each region tokenize text in words
  • lex tokens to recognize and tag token sequences (such as a name, date range, copyright sign, etc.)
  • parse token sequences with a grammar to recognize actual copyright/author statements
  • do various misc post detection cleanups
  • return the statements (and holders separately)

As I said, we never track the position or offsets in the original text (which could be a binary). This would be technically possible, but there is a big overhead to track these. We only track line numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants