-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New regex patterns to search for cripto wallets seed phrases #558
Comments
Those seed phrases also have checksum, so it is possible to write a validator for the regex as well. |
Good, some reference: https://github.com/bitcoin/bips/blob/master/bip-0039/bip-0039-wordlists.md |
Hum interesting... Java default Pattern class compiles that regex almost instantly and uses very few memory. But past tests done years ago have shown java Pattern is almost 50 times slower when matching, that's the reason I have chosen dk.brics.automaton library over other alternatives... |
Current implementation combines all configured regexes in a single large automaton for matching. Maybe a single automaton just for this regex would use less resources and compile time... |
Tried to remove all other regexes from RegexConfig.txt. Same result (Initializing RegexTask still running as I write this, 18 GB Mem and going up..) Back when I first tried to implement this, I thought that it would be necessary to write a specific task for this. |
Hum sad... Just a related history, I made some tweaks in iped-ahocorasick module in the past (used for carving and based on another automaton library) to use dense arrays instead of sparse pointers to speed up transitions, but using more memory... Lucene automaton package was based on dk.brics.automaton library, maybe Lucene could have enhanced things... |
We could try to implement a specific task using java Pattern and check if running time is acceptable. |
Maybe test some of these libraries to see if they can handle this regex and the runtime speed. |
Coincidentally, this is the same benchmark I saw years ago and used as a starting point to test some few libs referenced by it, including dk.brics.automaton. But I don't have and don't remember the results anymore, and I didn't test the regex of this issue that time... |
I found another beachmark, it seems more recent. https://github.com/almondtools/regexbench |
I made some test and it looks like that none DFA-Matchers can handle the regex. For this specific case, I suggest using an NFA-Matcher that implements a Breadth-first search (BFS), as this regex will lead to lot o back-track (Some references https://kean.blog/post/regex-matcher http://www.amygdalum.net/en/efficient-regular-expressions-java.html). Java standard pattern class implements a Deep-first-search, so I think it will be much slower. |
I think that I manage to create a new regex that compiles using the current module. |
Thanks @hauck-jvsh will test the memory usage tomorrow if @fmpfeifer does not beat me! |
I can't test until next week. I'm in the middle of the Amazon rainforest right now |
Wow good luck! |
Thanks @hauck-jvsh, your regex worked fine! Heap usage is good, seems the + instead of * or ? made a huge difference. I tried to add more separators, but then the heap usage and compilation time exploded. Will just add \r to your regex. |
I think that the obrigation of a small subset between the series of words leads to a huge optimization in the final automaton, so the * and the ? prevents the optimization. |
You're right. I did some more tests including seed phrases for PT language too:
I will add both regexes to most profiles, just EN version to triage and none to fastmode profile, mainly because of initialization time. |
this is not correct, any combination of words from the dictionary is a valid seed phrase. |
Yes, we used a statistical approach to try to filter out false positives (e.g. when the same word is repeated several times). Of course that has a small chance of ignoring true seeds. |
At least for BIP39 seed phrases, there is a checksum. See How a Seed Phrase is Created |
According to Marcelo Ruback, usually seed phrases use from 12 to 24 words from a 2048 words fixed dictionary. We can register regexes for the dictionaries used by the most important wallet softwares to locate seed phrases. I think the number of false positives would not be high.
The text was updated successfully, but these errors were encountered: