Multi-threaded run changes line order #16

solardiz · 2020-09-20T11:06:43Z

Input file generation:

./john -w=all.lst -ru -stdo > all.lst-rules-with-dupes

all.lst is from https://download.openwall.net/pub/wordlists/all.gz (MD5 f7b3b76d15bbb95fcb267ea6be108cce), john is current bleeding-jumbo with its default john.conf. The resulting all.lst-rules-with-dupes is 173188126 lines, 2037345891 bytes (MD5 4c221f4df353aae89bdcd6888e92887a).

These commands produce the same unique lines, but in different order:

./rling -t 1 all.lst-rules-with-dupes /dev/shm/t1
./rling -t 2 all.lst-rules-with-dupes /dev/shm/t2

$ md5sum /dev/shm/t?
59b8b432957640387ba2b83d2583c792  /dev/shm/t1
625f25208a5ea41f4fb03fc51626c68b  /dev/shm/t2
$ wc -l /dev/shm/t?
 164074000 /dev/shm/t1
 164074000 /dev/shm/t2

t1 is the same as what JtR's unique program produces, t2 isn't.

Edit: more detail: t2 changes between command invocations. This is on Scientific Linux 6.10 (so old glibc, and I had to add -lrt for clock_gettime to be found). I tried with two gcc versions (system detault gcc 4.4.7 and devtoolset-8 gcc 8.2.1) - same behavior.

The text was updated successfully, but these errors were encountered:

Waffle2 · 2020-09-23T01:42:07Z

I have identified and replicated the issue. The core of the problem is that rling splits the file into large "chunks", and processes these on multiple cores at the same time. For example, in your test file, the word "svn7" appears at line 62541, 43312836, 71731224, 71733302 and 71749022. Depending on the number of cores (threads) in use, the later uses of the word "svn7" may be processed prior to the "earlier" line numbers. There, of course, is no issue with the file actually being re-ordered, just that any duplicates may be dropped, not necessarily the later ones in the file. I was able to see this behaviour on several different systems, and in all cases the correct number of lines were output - all without duplication.

All of that said, the implication that "first in file wins" is the principle of least astonishment, and there will be a change to the code to implement this (though I may offer a switch, as it is significantly faster to process the file as cores become available, rather than waiting for a previous block to complete prior to starting the next run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded run changes line order #16

Multi-threaded run changes line order #16

solardiz commented Sep 20, 2020 •

edited

Loading

Waffle2 commented Sep 23, 2020

Multi-threaded run changes line order #16

Multi-threaded run changes line order #16

Comments

solardiz commented Sep 20, 2020 • edited Loading

Waffle2 commented Sep 23, 2020

solardiz commented Sep 20, 2020 •

edited

Loading