Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle embedded quote in mmcif #619

Merged
merged 5 commits into from
Jul 14, 2024

Conversation

0ut0fcontrol
Copy link
Contributor

@0ut0fcontrol 0ut0fcontrol commented Jun 30, 2024

fix #570

use 3 regex patterns to match fields in one line for handle embed quote in mmcif file:

  single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)"
  double_quote_pattern = r'("(?:"(?! )|[^"])*")(?:\s|$)'
  unquoted_pattern = r"([^\s]+)"

GPT4 explain single_quote_pattern:

This regex single_quote_pattern = r"('(?:'(?! )|[^'])*')(?:\s|$)" is engineered to identify and extract substrings enclosed in single quotes from a larger text, with a particular sensitivity to handle internal apostrophes correctly. Let's dissect this expression to understand how it functions:

  1. ': This matches the opening single quote ' of the target substring.

  2. (?: ... ): This is a non-capturing group, which means it groups the contained pattern parts without storing the matched substring. This is used here mainly for grouping purposes without needing backreferences.

  3. '(?! ): This is a negative lookahead assertion that matches a single quote ' only if it's not immediately followed by a space . This allows the regex to match apostrophes within words (like in contractions such as don't) without treating them as the end of the quoted substring.

  4. |: The logical OR operator presents an alternative within the non-capturing group. It separates the negative lookahead for internal apostrophes from the next part of the pattern.

  5. [^']: This is a negated character class that matches any character except a single quote '. This part of the expression ensures that the regex consumes all characters within the quotes until it encounters the next single quote, which might signify the end of the quoted substring.

  6. *: This quantifier applies to the non-capturing group, allowing the contained pattern to repeat any number of times — including zero times — thus enabling the regex to match quoted substrings of any length.

  7. ': Matches the closing single quote of the substring.

  8. (?:\s|$): Another non-capturing group that operates as a condition for what follows the closing quote. It matches either:

    • \s: A whitespace character, ensuring that the quoted substring is followed by a space, or
    • $: The end of a line or string, allowing for the quoted substring to appear at the end of the text.

The Key Points:

  • The pattern is designed to efficiently target substrings enclosed in single quotes within a larger string or document.
  • It smartly handles situations where an apostrophe is part of the enclosed text (like in contractions) without mistakenly recognizing it as the end of the quoted section.
  • By requiring the quoted substring to be followed by a space or the end of the text, it imposes a sensible boundary condition to identify discrete quoted substrings within a flow of text.

This regex could be particularly useful in text parsing applications where accurately distinguishing between quoted strings and regular text is crucial, such as in natural language processing tasks, data extraction, or in developing syntax highlighters for code editors.

@0ut0fcontrol
Copy link
Contributor Author

@padix-key
I'm sorry, I've been too busy with work and haven't had much time to delve into regex.
Regex can be quite a headache.

Could you take a look at this solution?
I'm not sure if the test covers all scenarios, and I'm thinking of adding more tests.
Do you have any suggestions?

@0ut0fcontrol 0ut0fcontrol force-pushed the handle_embedded_quote branch from a09e8ab to 23f4e2f Compare June 30, 2024 09:36
@0ut0fcontrol 0ut0fcontrol marked this pull request as ready for review June 30, 2024 09:54
Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for preparing the fix! I have put a few suggestions into the review.

src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Outdated Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
src/biotite/structure/io/pdbx/cif.py Show resolved Hide resolved
tests/structure/test_pdbx.py Outdated Show resolved Hide resolved
@0ut0fcontrol
Copy link
Contributor Author

Thank you for your review. I will provide feedback as soon as possible. I will have time in the evening or during the weekend.

@padix-key
Copy link
Member

Thanks for the benchmarks. I will look into your comments tomorrow.

@padix-key
Copy link
Member

Seems like your approach seems the most efficient one (at least I could not come up with a better one). So only two discussions remain.

@0ut0fcontrol
Copy link
Contributor Author

Thank you for your review, I will finish this PR ASAP.

@0ut0fcontrol
Copy link
Contributor Author

@padix-key
All discussions have been resolved. This PR is ready for review.

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks again for delving into regex and the the thorough benchmarks.

@padix-key padix-key merged commit 0404084 into biotite-dev:main Jul 14, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to deserialize category 'entity' with ValueError: No closing quotation
2 participants