Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New regex patterns to search for cripto wallets seed phrases #558

Closed
lfcnassif opened this issue May 21, 2021 · 23 comments
Closed

New regex patterns to search for cripto wallets seed phrases #558

lfcnassif opened this issue May 21, 2021 · 23 comments
Assignees

Comments

@lfcnassif
Copy link
Member

lfcnassif commented May 21, 2021

According to Marcelo Ruback, usually seed phrases use from 12 to 24 words from a 2048 words fixed dictionary. We can register regexes for the dictionaries used by the most important wallet softwares to locate seed phrases. I think the number of false positives would not be high.

@fmpfeifer
Copy link
Member

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

@lfcnassif
Copy link
Member Author

@fmpfeifer
Copy link
Member

I'm remember I have tried to implement that before as a simple regex, like that:
SEEDPHRASE, 0, 0, false = ((abandon|ability|able ... |zone|zoo)[ \t\n]?){12,24}

It doesn't work. The Initializing RegextTask takes forever and the memory is all eaten even before the processing itself starts.

image
image

@lfcnassif
Copy link
Member Author

Hum interesting... Java default Pattern class compiles that regex almost instantly and uses very few memory. But past tests done years ago have shown java Pattern is almost 50 times slower when matching, that's the reason I have chosen dk.brics.automaton library over other alternatives...

@lfcnassif
Copy link
Member Author

lfcnassif commented May 24, 2021

Current implementation combines all configured regexes in a single large automaton for matching. Maybe a single automaton just for this regex would use less resources and compile time...

@fmpfeifer
Copy link
Member

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.

Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

@lfcnassif
Copy link
Member Author

Hum sad... Just a related history, I made some tweaks in iped-ahocorasick module in the past (used for carving and based on another automaton library) to use dense arrays instead of sparse pointers to speed up transitions, but using more memory...

Lucene automaton package was based on dk.brics.automaton library, maybe Lucene could have enhanced things...

@lfcnassif
Copy link
Member Author

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.

Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

We could try to implement a specific task using java Pattern and check if running time is acceptable.

@fmpfeifer
Copy link
Member

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.
Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

We could try to implement a specific task using java Pattern and check if running time is acceptable.

I think it is worth to try.

Just for record, I let it go, it took 30 minutes and triggered an OOM.
image

@hauck-jvsh
Copy link
Member

hauck-jvsh commented May 25, 2021

Maybe test some of these libraries to see if they can handle this regex and the runtime speed.
https://tusker.org/regex/regex_benchmark.html

@lfcnassif
Copy link
Member Author

Coincidentally, this is the same benchmark I saw years ago and used as a starting point to test some few libs referenced by it, including dk.brics.automaton. But I don't have and don't remember the results anymore, and I didn't test the regex of this issue that time...

@hauck-jvsh
Copy link
Member

I found another beachmark, it seems more recent. https://github.com/almondtools/regexbench

@hauck-jvsh
Copy link
Member

I made some test and it looks like that none DFA-Matchers can handle the regex. For this specific case, I suggest using an NFA-Matcher that implements a Breadth-first search (BFS), as this regex will lead to lot o back-track (Some references https://kean.blog/post/regex-matcher http://www.amygdalum.net/en/efficient-regular-expressions-java.html). Java standard pattern class implements a Deep-first-search, so I think it will be much slower.

@hauck-jvsh
Copy link
Member

hauck-jvsh commented May 31, 2021

I think that I manage to create a new regex that compiles using the current module.
SEEDPHRASE, 0, 0, false = (abandon|ability|able ... |zone|zoo)( ([ \t\n]+) (abandon|ability|able ... |zone|zoo)){11,23}.
Please @fmpfeifer and @lfcnassif take a look and see if I'm making some mistake or if its suitable for the problem.

@lfcnassif
Copy link
Member Author

Thanks @hauck-jvsh will test the memory usage tomorrow if @fmpfeifer does not beat me!

@fmpfeifer
Copy link
Member

I can't test until next week. I'm in the middle of the Amazon rainforest right now

@lfcnassif
Copy link
Member Author

Wow good luck!

@lfcnassif lfcnassif self-assigned this Jun 1, 2021
@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 1, 2021

Thanks @hauck-jvsh, your regex worked fine! Heap usage is good, seems the + instead of * or ? made a huge difference. I tried to add more separators, but then the heap usage and compilation time exploded. Will just add \r to your regex.

@hauck-jvsh
Copy link
Member

hauck-jvsh commented Jun 1, 2021

I think that the obrigation of a small subset between the series of words leads to a huge optimization in the final automaton, so the * and the ? prevents the optimization.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jun 1, 2021

You're right. I did some more tests including seed phrases for PT language too:

  • without the new regexes, RegexTask takes 2s to initialize and uses ~24MB of heap
  • with just EN regex, RegexTask takes 21s to initialize and uses ~80MB of heap
  • with both EN & PT regexes, RegexTask takes 47s to initialize and uses ~143MB of heap

I will add both regexes to most profiles, just EN version to triage and none to fastmode profile, mainly because of initialization time.

@christoph2806
Copy link

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

this is not correct, any combination of words from the dictionary is a valid seed phrase.

@lfcnassif
Copy link
Member Author

this is not correct, any combination of words from the dictionary is a valid seed phrase.

Yes, we used a statistical approach to try to filter out false positives (e.g. when the same word is repeated several times). Of course that has a small chance of ignoring true seeds.

@fmpfeifer
Copy link
Member

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

this is not correct, any combination of words from the dictionary is a valid seed phrase.

At least for BIP39 seed phrases, there is a checksum. See How a Seed Phrase is Created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants