You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I worked a bit on this performance issue and I can propose a function that should handle unquoting faster than the present regex implementation.
I reused the benchmark from PR #619 to compare every solutions:
run number=1000000 times timeit.timeit('test_str_split(line)', number=number, globals=locals()) = 2.673100476997206 s
run number=1000000 times timeit.timeit('test_shlex_split(line)', number=number, globals=locals()) = 33.558549724999466 s
run number=1000000 times timeit.timeit('test_re_split(line)', number=number, globals=locals()) = 17.103288242000417 s
run number=1000000 times timeit.timeit('test_partitioned_split(line)', number=number, globals=locals()) = 4.747100908996799 s
Here is the testing code:
importreimportshlex# Declare quoting patern and pre-compile regex oncesingle_quote_pattern=r"('(?:'(?! )|[^'])*')(?:\s|$)"double_quote_pattern=r'("(?:"(?! )|[^"])*")(?:\s|$)'unquoted_pattern=r"([^\s]+)"# Combine the patterns using alternationcombined_pattern=f"{single_quote_pattern}|{double_quote_pattern}|{unquoted_pattern}"FINAL_REGEX=re.compile(combined_pattern)
defsplit_line(line):
values=line.split()
forkinrange(len(values)):
# Remove quotesif (values[k][0] =='"'andvalues[k][-1] =='"') or (
values[k][0] =="'"andvalues[k][-1] =="'"
):
values[k] =values[k][1:-1]
def_split_one_line(line):
""" Split a line into its fields. Supporting embedded quotes (' or "), like `'a dog's life'` to `a dog's life` """# Special case of multiline value, where the line starts with ';'ifline[0] ==";":
return [line[1:]]
# Find all matchesmatches=FINAL_REGEX.findall(line)
# Extract non-empty groups from the matchesfields= []
formatchinmatches:
field=next(groupforgroupinmatchifgroup)
iffield[0] ==field[-1] =="'"orfield[0] ==field[-1] =='"':
field=field[1:-1]
fields.append(field)
returnfieldsdefpartitioned_split_line(line: str) ->list[str]:
# Special case of multiline value, where the line starts with ';'ifline[0] ==";":
return [line[1:]]
# Initialize loop variablesres= []
concat_flag=Falsequote=""# Loop over the linewhileline:
# Split the line on whitespaceword, _, line=line.partition(" ")
# Handle the case where the word strat with a quote...ifnotconcat_flagandword.startswith(("'", '"')):
quote=word[0]
# ... and ends with a quote as well, meaning we have a single quoted wordifword.endswith(quote):
res.append(word[1:-1])
continue# ... and does not end with a quote, meaning we are in a multiword# quoted string.# We will need to concatenated the words until we find the closing quoteconcat_flag=Trueto_concat=word[1:]
# Handle the case where we are in the middle of a multiword quoted stringelifconcat_flagandto_concatandquote:
# Check if the word ends with the matching closing quoteifword.endswith(quote):
concat_flag=Falseto_concat+=" "+word[:-1]
quote=""res.append(to_concat)
# If not, we continue to concatenate the wordselse:
to_concat+=" "+word# Handle the easiest case, where we have an unquoted wordelse:
res.append(word)
returnresline="""ATOM 1 N PRO A 1 '1.400' "10.00" 10.000 1.00 0.00 N """deftest_re_split(line):
_split_one_line(line)
deftest_str_split(line):
split_line(line)
deftest_shlex_split(line):
shlex.split(line)
deftest_partitioned_split(line):
partitioned_split_line(line)
if__name__=="__main__":
importtimeitnumber=1000000print(
f"run {number=} times {timeit.timeit('test_str_split(line)', number=number, globals=locals()) =} s"
)
print(
f"run {number=} times {timeit.timeit('test_shlex_split(line)', number=number, globals=locals()) =} s"
)
print(
f"run {number=} times {timeit.timeit('test_re_split(line)', number=number, globals=locals()) =} s"
)
print(
f"run {number=} times {timeit.timeit('test_partitioned_split(line)', number=number, globals=locals()) =} s"
)
I copied the function _split_one_line presently used in biotite to test the perf improvement of a pre-compiled regex vs standard findall.
Would it be possible with the better perfs of partitioned_split_line to use this whitespace safe function to read atom_site blocks by default? Or at least to expose a parameter in the reader to enable it?
The text was updated successfully, but these errors were encountered:
Hi, this looks intriguing. It would be amazing to get rid of this rather hacky expect_whitespace parameter, with no significant performance penalty. Could you create a PR for this? Then we can check if this handles all the edge case in the test suite and how much the overall performance impact is, using the benchmarks in the CI.
I have to work with in-house CIF files that have label_atom_id including whitespaces (which is allowed by the official mmCIF dictionary), but I found that biotite does not expect whitespaces in atom_site category to optimize parsing performance.
I worked a bit on this performance issue and I can propose a function that should handle unquoting faster than the present regex implementation.
I reused the benchmark from PR #619 to compare every solutions:
Here is the testing code:
I copied the function
_split_one_line
presently used in biotite to test the perf improvement of a pre-compiled regex vs standardfindall
.Would it be possible with the better perfs of
partitioned_split_line
to use this whitespace safe function to read atom_site blocks by default? Or at least to expose a parameter in the reader to enable it?The text was updated successfully, but these errors were encountered: