Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup #1

Merged
merged 8 commits into from
Feb 9, 2023
Merged

Setup #1

merged 8 commits into from
Feb 9, 2023

Conversation

katat
Copy link
Contributor

@katat katat commented Jan 14, 2023

Per https://github.com/zk-email-verify/zk-email-verify/issues/23, this PR contains the following:

Converted the original python compiler to JS

All the logic should be the same, but just in JS to be consistent across the projects and potential use cases within the browser.

Use circom_tester to cover the circuit test

Keep in mind it is for now using my version of circom_tester, which has a pending PR to improve the testing API.

CLI

The command line hasn't been implemented due to a roadblock, which is the regex the compiler supposes right now is not a real regex as per the code here.

So at the moment, to compile with wildcard, it has it substitute with the hardcoded variables. like this. So obviously, when it comes to the command line, the users would find it confusing to compile regex as they can't use the real regex text and they have to be aware of these hardcoded variables.

To get around this, I am currently evaluating regexp-tree to help further process regex texts into the same level as the hardcoded variables for alternative string for the matching, before feeding the processed regex to the nfa processing.

Circuit template

The original regex circuit has final_state_sum to record the total count of the matches. Based on my tests, it seems it is not working as expected. The final value output all the 1 values in the states array signal, which is equivalent to counting the matched text, instead of counting the matches.

In this PR, it also attempts to fix the counting issue at here


Please have a check and feel free to let me know your thoughts.

@Divide-By-0
Copy link
Member

Divide-By-0 commented Jan 14, 2023

Does this circuit compile? count += check_start[i].out; looks like super sus circom code, since the left side is a var and the right side is a signal.

final_state_sum just counts the number of occurrences of the specific state where it's 1 in the state, so you're right that it may be too big if the regex state covers multiple characters. I imagine one way to do this may not be a count variable, but perhaps changing the existing final_state_sum logic a bit. You can count the number of non-contiguous regex matches by adding 1 if the state at that index is a 1 (like right now) and also multiplying by whether it's a different state than the previous location [i.e. state[i][28] * (state[i][28] - state[i-1][28])] with the appropriate loop bounds.

@katat
Copy link
Contributor Author

katat commented Jan 14, 2023

Does this circuit compile?

try run the test by yarn test, then the circuit should be generated in the test folder. (For testing purpose, we should find a way to avoid output the test circuit in the actual folder)

@Divide-By-0
Copy link
Member

Divide-By-0 commented Jan 14, 2023

Your generated circuit also creates a matches variable not used anywhere, what is that for?

Ok I looked at the circom docs again, you're right that the count code is valid. Your count code fails when the regex starts at character 0 however, since your comparator check does not consider this case correctly!

@katat
Copy link
Contributor Author

katat commented Jan 14, 2023

Your generated circuit also creates a matches variable not used anywhere, what is that for?

That should be removed for now. It was meant for the restructure refactor.

Ok I looked at the circom docs again, you're right that the count code is valid. Your count code fails when the regex starts at character 0 however, since your comparator check does not consider this case correctly!

Thanks for pointing out. Yeah. I wasn't sure if the use of count var is valid actually. Good to know this is the case. I should update the test cases to improve the coverage.

@katat
Copy link
Contributor Author

katat commented Jan 15, 2023

Ok I looked at the circom docs again, you're right that the count code is valid. Your count code fails when the regex starts at character 0 however, since your comparator check does not consider this case correctly!

I just pushed the code to take care of this scenario.

Another issue I found is the regex compiler does not quite align with the regex gramma. For example, email was meant for @(${generator.word_char}+) should be different from email was meant for @${generator.word_char}+ as the latter won't reveal the account name because it doesn't contain a group by wrapping the character set with (). But the regex circuit generated using email was meant for @${generator.word_char}+ can still reveal the account name, which means it returns a false positive result. I reckon this issue can be reproduced in the zk-email-verify as well. (You can test the difference of these two regexs using https://regexr.com/)

So I guess it would be some reviews needed to make sure the dfa processor align with the regex gramma.

@Divide-By-0
Copy link
Member

Divide-By-0 commented Jan 20, 2023

That's possible, we intended for the code to just extract specific states that are usually contiguous substrings in the context of zk-email -- I can see how it might need a bit of massaging to make certain states the same or different! It would be a great contribution to try to highlight some of those differences and help plan to modify it to make it more accurate. Curious your thoughts here.

I think we are almost ready to merge here -- I just had a question about how to manage the old Python code. Perhaps it makes sense to keep it around in a folder in case anyone wants to use it, but I can also see it diverging over time as we optimize the JS and make it more feature rich. So I think it makes sense to merge this PR into master and continue to work on top of it, and just have a note with the original source of the Python code, what do you think?

@katat
Copy link
Contributor Author

katat commented Jan 20, 2023

I think python needs to be deprecated when we have the JS version. As I mentioned earlier, it would be better to have a CLI to support the regex compilation and fix the bugs in the current python compiler and in the JS version derived.

One thing I would like to make sure of is the maintenance of the codebase of the compiler. It is quite difficult to navigate the codebase from the lexical to the circuit generator. What I am currently trying to achieve is to make the code easier for maintenance in the long run and potentially avoid bugs that the compilers currently has.

I am now prototyping with the regexp-tree library, which seems quite mature in the whole workflow from parsing regex to dfa transition table generation, to generate a circuit for dfa transition based on a provided regex. Once we are able to correctly generate dfa circuit using the regexp-tree, then I think a lot of problems at the moment will be resolved by themselves, such as being better aligned with the regex grammar or providing more error messages related to regex compilation process, which is critical to generalize the regex compilation via CLI I think.

I should be able to deliver an usable CLI to compile regex in a week.

@Divide-By-0
Copy link
Member

Awesome! These both sound like great goals, I am very excited to see it. Let me know if there's any way I can support it.

@katat katat force-pushed the init branch 5 times, most recently from 0615ccf to 710a35a Compare January 25, 2023 07:17
@katat
Copy link
Contributor Author

katat commented Jan 25, 2023

For now, I have created a simple CLI and wrap with the regexp-tree library to have a simple check on the regex gramma and display proper error messages if a regex grammar is supported or not.

To use the CLI, now you can run the command yarn compile "abc (a|b|c)+", then it should generate a circom file at build/compiled.circom.

Currently, character class, such as abc [a-c]+ is not supported. When a regex with this character class is presented, the CLI will display an error message to notify this is yet to be supported, to avoid confusion when interacting with the CLI commands.

In the upcoming releases, the circuit generator could be updated to integrate with regexp-tree for generating circuits with an aim to improve the robustness, flexibility and maintenance of regex circuit compiler. (After some further researches, there are some uncertainties for the integration work, so I will create another issue to outline the plan)

Meanwhile, I have added a github action to run the tests automatically for PRs. I think we can change the PR settings such that only the ones passed the CI tests can be merged.

Please have a play around with the CLI, and let me know if any comments for this PR.

@katat katat force-pushed the init branch 2 times, most recently from f725599 to e81c6eb Compare February 5, 2023 09:40
@Divide-By-0 Divide-By-0 merged commit 7bdb1c6 into main Feb 9, 2023
ewynx referenced this pull request in hashcloak/noir-zk-regex Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants