New regex patterns to search for cripto wallets seed phrases #558

lfcnassif · 2021-05-21T20:38:21Z

According to Marcelo Ruback, usually seed phrases use from 12 to 24 words from a 2048 words fixed dictionary. We can register regexes for the dictionaries used by the most important wallet softwares to locate seed phrases. I think the number of false positives would not be high.

fmpfeifer · 2021-05-24T13:35:23Z

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

lfcnassif · 2021-05-24T13:42:39Z

Good, some reference: https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md

fmpfeifer · 2021-05-24T17:15:29Z

I'm remember I have tried to implement that before as a simple regex, like that:
SEEDPHRASE, 0, 0, false = ((abandon|ability|able ... |zone|zoo)[ \t\n]?){12,24}

It doesn't work. The Initializing RegextTask takes forever and the memory is all eaten even before the processing itself starts.

lfcnassif · 2021-05-24T18:47:24Z

Hum interesting... Java default Pattern class compiles that regex almost instantly and uses very few memory. But past tests done years ago have shown java Pattern is almost 50 times slower when matching, that's the reason I have chosen dk.brics.automaton library over other alternatives...

lfcnassif · 2021-05-24T18:48:42Z

Current implementation combines all configured regexes in a single large automaton for matching. Maybe a single automaton just for this regex would use less resources and compile time...

fmpfeifer · 2021-05-24T18:56:33Z

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.

Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

lfcnassif · 2021-05-24T19:00:28Z

Hum sad... Just a related history, I made some tweaks in iped-ahocorasick module in the past (used for carving and based on another automaton library) to use dense arrays instead of sparse pointers to speed up transitions, but using more memory...

Lucene automaton package was based on dk.brics.automaton library, maybe Lucene could have enhanced things...

lfcnassif · 2021-05-24T19:02:25Z

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.

Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

We could try to implement a specific task using java Pattern and check if running time is acceptable.

fmpfeifer · 2021-05-24T19:27:32Z

Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..)
Maybe using the java pattern for this one would work.
Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this.

We could try to implement a specific task using java Pattern and check if running time is acceptable.

I think it is worth to try.

Just for record, I let it go, it took 30 minutes and triggered an OOM.

hauck-jvsh · 2021-05-25T22:40:52Z

Maybe test some of these libraries to see if they can handle this regex and the runtime speed.
https://tusker.org/regex/regex_benchmark.html

lfcnassif · 2021-05-25T23:38:53Z

Coincidentally, this is the same benchmark I saw years ago and used as a starting point to test some few libs referenced by it, including dk.brics.automaton. But I don't have and don't remember the results anymore, and I didn't test the regex of this issue that time...

hauck-jvsh · 2021-05-28T19:18:16Z

I found another beachmark, it seems more recent. https://github.com/almondtools/regexbench

hauck-jvsh · 2021-05-30T20:26:25Z

I made some test and it looks like that none DFA-Matchers can handle the regex. For this specific case, I suggest using an NFA-Matcher that implements a Breadth-first search (BFS), as this regex will lead to lot o back-track (Some references https://kean.blog/post/regex-matcher http://www.amygdalum.net/en/efficient-regular-expressions-java.html). Java standard pattern class implements a Deep-first-search, so I think it will be much slower.

hauck-jvsh · 2021-05-31T20:37:38Z

I think that I manage to create a new regex that compiles using the current module.
SEEDPHRASE, 0, 0, false = (abandon|ability|able ... |zone|zoo)( ([ \t\n]+) (abandon|ability|able ... |zone|zoo)){11,23}.
Please @fmpfeifer and @lfcnassif take a look and see if I'm making some mistake or if its suitable for the problem.

lfcnassif · 2021-05-31T23:50:03Z

Thanks @hauck-jvsh will test the memory usage tomorrow if @fmpfeifer does not beat me!

fmpfeifer · 2021-06-01T00:42:02Z

I can't test until next week. I'm in the middle of the Amazon rainforest right now

lfcnassif · 2021-06-01T00:56:02Z

Wow good luck!

lfcnassif · 2021-06-01T16:24:23Z

Thanks @hauck-jvsh, your regex worked fine! Heap usage is good, seems the + instead of * or ? made a huge difference. I tried to add more separators, but then the heap usage and compilation time exploded. Will just add \r to your regex.

hauck-jvsh · 2021-06-01T17:07:12Z

I think that the obrigation of a small subset between the series of words leads to a huge optimization in the final automaton, so the * and the ? prevents the optimization.

lfcnassif · 2021-06-01T17:45:33Z

You're right. I did some more tests including seed phrases for PT language too:

without the new regexes, RegexTask takes 2s to initialize and uses ~24MB of heap
with just EN regex, RegexTask takes 21s to initialize and uses ~80MB of heap
with both EN & PT regexes, RegexTask takes 47s to initialize and uses ~143MB of heap

I will add both regexes to most profiles, just EN version to triage and none to fastmode profile, mainly because of initialization time.

christoph2806 · 2023-05-11T09:46:38Z

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

this is not correct, any combination of words from the dictionary is a valid seed phrase.

lfcnassif · 2023-05-11T23:01:38Z

this is not correct, any combination of words from the dictionary is a valid seed phrase.

Yes, we used a statistical approach to try to filter out false positives (e.g. when the same word is repeated several times). Of course that has a small chance of ignoring true seeds.

fmpfeifer · 2023-05-25T14:11:24Z

Those seed phrases also have checksum, so it is possible to write a validator for the regex as well.

this is not correct, any combination of words from the dictionary is a valid seed phrase.

At least for BIP39 seed phrases, there is a checksum. See How a Seed Phrase is Created

lfcnassif added the enhancement label May 21, 2021

lfcnassif self-assigned this Jun 1, 2021

lfcnassif closed this as completed in c4789f1 Jun 1, 2021

This was referenced Jul 1, 2021

Refactor module configuration #538

Closed

Optimize RegexTask initialization #643

Closed

lfcnassif mentioned this issue Sep 10, 2021

Validation for crypto wallets seed phrases #752

Closed

lfcnassif added a commit that referenced this issue Oct 13, 2021

cherry pick #558 (just the EN regex) and #752 to 3.18.x branch

ec7ec15

andrewschreiber mentioned this issue Jan 23, 2023

Detecting private keys and seed phrases for cryptocurrency wallets gitleaks/gitleaks#1082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New regex patterns to search for cripto wallets seed phrases #558

New regex patterns to search for cripto wallets seed phrases #558

lfcnassif commented May 21, 2021 •

edited

Loading

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

lfcnassif commented May 24, 2021 •

edited

Loading

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

lfcnassif commented May 24, 2021

fmpfeifer commented May 24, 2021

hauck-jvsh commented May 25, 2021 •

edited

Loading

lfcnassif commented May 25, 2021

hauck-jvsh commented May 28, 2021

hauck-jvsh commented May 30, 2021

hauck-jvsh commented May 31, 2021 •

edited

Loading

lfcnassif commented May 31, 2021

fmpfeifer commented Jun 1, 2021

lfcnassif commented Jun 1, 2021

lfcnassif commented Jun 1, 2021 •

edited

Loading

hauck-jvsh commented Jun 1, 2021 •

edited

Loading

lfcnassif commented Jun 1, 2021 •

edited

Loading

christoph2806 commented May 11, 2023

lfcnassif commented May 11, 2023

fmpfeifer commented May 25, 2023

New regex patterns to search for cripto wallets seed phrases #558

New regex patterns to search for cripto wallets seed phrases #558

Comments

lfcnassif commented May 21, 2021 • edited Loading

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

lfcnassif commented May 24, 2021 • edited Loading

fmpfeifer commented May 24, 2021

lfcnassif commented May 24, 2021

lfcnassif commented May 24, 2021

fmpfeifer commented May 24, 2021

hauck-jvsh commented May 25, 2021 • edited Loading

lfcnassif commented May 25, 2021

hauck-jvsh commented May 28, 2021

hauck-jvsh commented May 30, 2021

hauck-jvsh commented May 31, 2021 • edited Loading

lfcnassif commented May 31, 2021

fmpfeifer commented Jun 1, 2021

lfcnassif commented Jun 1, 2021

lfcnassif commented Jun 1, 2021 • edited Loading

hauck-jvsh commented Jun 1, 2021 • edited Loading

lfcnassif commented Jun 1, 2021 • edited Loading

christoph2806 commented May 11, 2023

lfcnassif commented May 11, 2023

fmpfeifer commented May 25, 2023

lfcnassif commented May 21, 2021 •

edited

Loading

lfcnassif commented May 24, 2021 •

edited

Loading

hauck-jvsh commented May 25, 2021 •

edited

Loading

hauck-jvsh commented May 31, 2021 •

edited

Loading

lfcnassif commented Jun 1, 2021 •

edited

Loading

hauck-jvsh commented Jun 1, 2021 •

edited

Loading

lfcnassif commented Jun 1, 2021 •

edited

Loading