-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umlauts in copyrights are removed from output files #1566
Comments
@MMarwedel this is indeed the case. What happens in fact is called transliteration, that is converting unicode characters to a plain ASCII form. And this effect is to remove umlauts and all other punctuation. I agree this is not perfect, and I am not sure what the rationale was when the decision was made. Since we are porting to Python 3 that uses unicode by default, it may no longer be issue in a the near future. So I will keep this open until we have completed the port so we can revisit how to fix this then. Would this work for you? |
Yes, this would work for me. As I am doing some postprocessing of the results anyway, I can fix the umlauts there for the few cases I found. |
If you are doing some post processing it would be best to handle this in here if possible... everyone could then benefit? |
Hmm... while I like to share code, I guess the post processing may not be in a state easy useable for other projects. And my employer would have to agree too. |
your call :) |
Now that the port to python3 happened, what are the chances that this issue is getting a look at? I had a look into the current logic and it seems that the parse-tree breaks when using the non-ascii String version of a line (achieved by e.g. setting the E.g. the tree for the line
changes from (with
to (with
I could not exactly find out what is breaking here internally but it's a lead at least. A workaround could maybe be dragging the utf-8 string along for the process and enabling (maybe via flag?) to write out that original string to the result.json instead of the prepared ascii-String. |
With the latest switch from NLTK to Pygmars https://github.com/nexB/pygmars/ we now have more opportunities to fix lexing and support Unicode all the way.
So I made a test with these fixes:
and:
and voila!
|
This is still an issue with ScanCode 31.2.1. The Copyright (C) 2011 Felix Geisendörfer appears as
in the result. This is problematic as it might even have legal implications in the worst case if both "Felix Geisendörfer" and "Felix Geisendorfer" people exist. |
@sschuberth We still have not addressed this ... I have made some tests in the past in d35e308 but there were too many induced issues to complete this. An alternative could be an "oe" and "ae" transliteration for German... do you think this could work out (I am not saying this could be simpler, btw)? ... but then, there are all the other languages. |
No, I don't think so, because "oe" is not really equivalent to "ö" in German language. It's just a "work-around" if you have to stick to ASCII characters, but strictly (like, legally) speaking e.g. "Möller", "Moeller" and "Moller" are all different (and valid) family names. |
As a follow-up question: In
is "copyright" always a full line match? Because if so, we could probably do some hacky post-processing that goes over all copyright findings and re-extracts the lines from the real files to get the original string. |
re: #1566 (comment)
No, there are cases where a statement spans multiple lines and many cases where what is before and after a copyright statement is not part of the copyright at all. That being said, we could find way to get back to the original unprocessed text but the difficulty is that words and letters do not align one for one between the original text and its transliteration. We track neither position nor offsets be it of characters or words for now. The general approach is roughly:
As I said, we never track the position or offsets in the original text (which could be a binary). This would be technically possible, but there is a big overhead to track these. We only track line numbers |
Hi,
when scanning files with umlauts, they are converted to non umlauts. It should be better to keep them in the original form.
Sample file:
https://chromium.googlesource.com/native_client/nacl-newlib/+/master/newlib/libc/time/strptime.c
Output:
"holders": [
{
"value": "Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).",
"start_line": 2,
"end_line": 4
}
],
"copyrights": [
{
"value": "Copyright (c) 1999 Kungliga Tekniska Hogskolan (Royal Institute of Technology, Stockholm, Sweden).",
"start_line": 2,
"end_line": 4
}
],
The right output would be ... Högskolan ...
The text was updated successfully, but these errors were encountered: