Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort/rling sorting disagreement #30

Open
PenguinKeeper7 opened this issue Oct 31, 2021 · 8 comments
Open

sort/rling sorting disagreement #30

PenguinKeeper7 opened this issue Oct 31, 2021 · 8 comments

Comments

@PenguinKeeper7
Copy link

PenguinKeeper7 commented Oct 31, 2021

It seems sort and rling disagree on how to deal with empty lines, any ideas? (Tested on Windows, sort used through git bash & wsl)

$ LC_ALL=C sort testFile.txt > testFile2.txt

./rling -2 "testFile2.txt" NUL
...
File "testFile2.txt" is not in sorted order at line 2
Line 1:
Line 2: 0x020x02☻

Test file:
https://anonfiles.com/ZfEcEfR5uf/testFile_txt

@0xVavaldi
Copy link

0xVavaldi commented Dec 3, 2021

#25 (comment)

@roycewilliams
Copy link
Contributor

@PenguinKeeper7 's example shows that he's setting LC_ALL=C for his run of sort.

@roycewilliams
Copy link
Contributor

@hops could I impose upon you to look at this one briefly, as time allows? I'm not clear about what the root cause is.

@flaggx1
Copy link

flaggx1 commented Jan 6, 2022

This indeed appears to be an issue, I've seen it on Linux with special characters. Here is an example just with the 2nd line containing a tab.

echo $'testing\ntesting\t' > file1
LC_ALL=C sort file1 > file1_sorted
rling -2 file1_sorted NUL

File "file1_sorted" is not in sorted order at line 2
Line 1: testing
Line 2: testing0x090x09

@0xVavaldi
Copy link

Not a full fix per-se, but this fixes the tab character, lmk if other characters are issues as well and we can look at fixing those too.

@0xVavaldi
Copy link

int mystrcmp(const char *a, const char *b) {
  const unsigned char *s1 = (const unsigned char *) a;
  const unsigned char *s2 = (const unsigned char *) b;
  unsigned char c1, c2;
      do
        {
          c1 = (unsigned char) *s1++;
          if (c1 < 10)
              c1 = (unsigned char) *s1++;
          c2 = (unsigned char) *s2++;
          if (c2 < 10)
              c2 = (unsigned char) *s2++;
          if (c1 == '\n')
            return c1 - c2;
        }
      while (c1 == c2);
      return c1 - c2;
}

This is a better fix. but the real issue is also that the sort function isn't correctly sorting.

echo $'testing\ntesting\x03\ntesting\x02' > file1
./rling file1 file1_rling
hexdump -c file1_rling

0000000   t   e   s   t   i   n   g  \n   t   e   s   t   i   n   g 003
0000010  \n   t   e   s   t   i   n   g 002  \n
000001a

@0xVavaldi
Copy link

I pushed a new fix a while ago but forgot to clarify that this PR should resolve this issue entirely

@PenguinKeeper7
Copy link
Author

Above PR does help some situations but doesn't fix it entirely, so the issue is still very open

$ cat /dev/random | head -n 50000 > test.txt
$ LC_ALL=C sort test.txt -o test2.txt
$ ./rling -2 test2.txt test3.txt
Estimated memory required: 52,429,024 (50.00Mbytes)
Allocated in 0.0563 seconds
Start processing input "test2.txt"
File "test2.txt" is not in sorted order at line 205
Line 204: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00
Line 205: 0x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x000x00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants