Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded run changes line order #16

Open
solardiz opened this issue Sep 20, 2020 · 1 comment
Open

Multi-threaded run changes line order #16

solardiz opened this issue Sep 20, 2020 · 1 comment

Comments

@solardiz
Copy link

solardiz commented Sep 20, 2020

Input file generation:

./john -w=all.lst -ru -stdo > all.lst-rules-with-dupes

all.lst is from https://download.openwall.net/pub/wordlists/all.gz (MD5 f7b3b76d15bbb95fcb267ea6be108cce), john is current bleeding-jumbo with its default john.conf. The resulting all.lst-rules-with-dupes is 173188126 lines, 2037345891 bytes (MD5 4c221f4df353aae89bdcd6888e92887a).

These commands produce the same unique lines, but in different order:

./rling -t 1 all.lst-rules-with-dupes /dev/shm/t1
./rling -t 2 all.lst-rules-with-dupes /dev/shm/t2
$ md5sum /dev/shm/t?
59b8b432957640387ba2b83d2583c792  /dev/shm/t1
625f25208a5ea41f4fb03fc51626c68b  /dev/shm/t2
$ wc -l /dev/shm/t?
 164074000 /dev/shm/t1
 164074000 /dev/shm/t2

t1 is the same as what JtR's unique program produces, t2 isn't.

Edit: more detail: t2 changes between command invocations. This is on Scientific Linux 6.10 (so old glibc, and I had to add -lrt for clock_gettime to be found). I tried with two gcc versions (system detault gcc 4.4.7 and devtoolset-8 gcc 8.2.1) - same behavior.

@Waffle2
Copy link
Collaborator

Waffle2 commented Sep 23, 2020

I have identified and replicated the issue. The core of the problem is that rling splits the file into large "chunks", and processes these on multiple cores at the same time. For example, in your test file, the word "svn7" appears at line 62541, 43312836, 71731224, 71733302 and 71749022. Depending on the number of cores (threads) in use, the later uses of the word "svn7" may be processed prior to the "earlier" line numbers. There, of course, is no issue with the file actually being re-ordered, just that any duplicates may be dropped, not necessarily the later ones in the file. I was able to see this behaviour on several different systems, and in all cases the correct number of lines were output - all without duplication.

All of that said, the implication that "first in file wins" is the principle of least astonishment, and there will be a change to the code to implement this (though I may offer a switch, as it is significantly faster to process the file as cores become available, rather than waiting for a previous block to complete prior to starting the next run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants