Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find-based replace implementation #146

Open
hiiamboris opened this issue May 22, 2023 · 18 comments
Open

Find-based replace implementation #146

hiiamboris opened this issue May 22, 2023 · 18 comments

Comments

@hiiamboris
Copy link

hiiamboris commented May 22, 2023

Rationale

Current replace implementation was stiched together by different people with different small needs and no central design:

  • it switches from find to parse mode when /deep is used and in other hard to predict cases, totally changing semantics and disabling the possibility of /same comparison and fast hash lookups
  • usage of active values (functions), which works only in find mode, adds loop properties to it and makes a little step into map-each/mapparse territory
  • usage of change makes it of quadratic complexity, seriously limiting applicability
  • its internal logic is extremely hard to follow, so only tests can ensure some level of validity
  • some of the find features (/only, /same, /part) are not supported by it
  • most used mode /all is not the default one

See also red/red#4945 previously ignored

Proposal

New implementation:

Arguments in favor of mapparse and against replace/parse refinement:

  • docstrings become very hard to write, ambiguous (main reason)
  • zero shared code
  • not only pattern, but also value changes meaning in /parse mode
  • replace expects a constant value (it's optimized for that), while mapparse evaluates a code block (how to tell /parse mode if block should be treated as code or as a value)?
  • in /parse mode: /same makes no sense, /only does not apply to pattern anymore
  • mapparse is consistent with forparse, and if we integrate it into replace, what to do with forparse?
  • forparse and mapparse are loops, so they support break & continue in their code blocks, so their logic is different
  • as a loop, mapparse has different argument order - same as the other loops (foreach, repeat)
  • extra refinement increases call length

Performance comparison

I have made two variants of the new implementation: replace2 is straightforward and compiler-friendly, while replace1 uses blocks of code to make the loop tighter (but they are interpreted). replace/all is the old implementation.

in place, big buffer
209 ms  [replace1 block 0 1]
106 ms  [replace2 block 1 2]
2486 ms [replace/all block 2 0]

length change, big buffer
216 ms  [replace1 block 0 [1 2]]
107 ms  [replace2 block 0 [1 2]]
7040 ms [replace/all block 0 [1 2]]

length change, mid buffer
421 ms  [loop 100 [replace1 block 0 [1 2] replace1 block [1 2] 0]]
214 ms  [loop 100 [replace2 block 0 [1 2] replace2 block [1 2] 0]]
318 ms  [loop 100 [replace/all block 0 [1 2] replace/all block [1 2] 0]]

length change, small buffer
672 ms  [loop 10000 [replace1 block 0 [1 2] replace1 block [1 2] 0]]
328 ms  [loop 10000 [replace2 block 0 [1 2] replace2 block [1 2] 0]]
254 ms  [loop 10000 [replace/all block 0 [1 2] replace/all block [1 2] 0]]

length change, mid buffer (hash)
2729 ms [loop 100 [replace1 hash 0 [1 2] replace1 hash [1 2] 0]]
2225 ms [loop 100 [replace2 hash 0 [1 2] replace2 hash [1 2] 0]]
27395 ms        [loop 100 [replace/all hash 0 [1 2] replace/all hash [1 2] 0]]

length change, small buffer (hash)
797 ms  [loop 10000 [replace1 hash 0 [1 2] replace1 hash [1 2] 0]]
504 ms  [loop 10000 [replace2 hash 0 [1 2] replace2 hash [1 2] 0]]
669 ms  [loop 10000 [replace/all hash 0 [1 2] replace/all hash [1 2] 0]]

init time
1064 ms [repeat i 100000 [replace1 block i - 1 i]]
455 ms  [repeat i 100000 [replace2 block i - 1 i]]
459 ms  [repeat i 100000 [replace/all block i - 1 i]]

Due to compiler benefits replace2 usually wins over replace1. And considering that get-words in paths cannot be compiled yet, and it includes workarounds for red/red#5319 and red/red#5320 , it should perform better once those kludges are removed. Compared to the old implementation, performance is much more predictable and linear, with worst case slowdown currently down to 23% due to allocation of a new buffer.

Tests

replace test suite adapted for the new version is here. Some tests were refitted with mapparse. Some currently fail due to red/red#5321

@greggirwin
Copy link
Contributor

Great stuff @hiiamboris.

if all [deep once] [cause-error 'script 'bad-refines []] ;-- incompatible

Isn't it possible that you'd want to replace a single value, anywhere in a deep structure? I agree that the normal case is /all with /deep, and this could be relaxed later if asked for. First use case I think of is that some data value shows up in an error scenario, and you want to test with the data to see if it happens in every case, or what related data there may be.

@greggirwin
Copy link
Contributor

unless any-block? series [deep: off]

This is an interesting one. If we remove paren support from paths, they shouldn't be nestable, right?

@greggirwin
Copy link
Contributor

greggirwin commented May 22, 2023

Also on paths, are we sure we want this behavior?

>> replace1/deep copy [a b c [a/b/c]] 'b 'x
== [a x c [a/x/c]]

In some cases that will be exactly what we want, but should we always treat block/list the same?

@greggirwin
Copy link
Contributor

Does size need to be dynamic?

;; pattern size will be found out after first match:
size: [size: offset? match find/:case/:same/:only/match/tail match :pattern]

It's niftier than size: either only [1][length? pattern], but even with the comment I had to scan and parse things a few times to figure it out as future references and do use are tricky. It is a nice use of fixed+dynamic refinements though.

@greggirwin
Copy link
Contributor

greggirwin commented May 22, 2023

Setup thought, just more straight-line:

start: series					;-- starting offset may be adjusted if part is negative
limit: system/words/case [
	none? limit [tail series]
	integer? limit [skip start limit]	;-- convert limit to series, or will have to update it all the time
	; no action if already a series.
]
if back?: negative? offset? start limit [	;-- flip start/end if limit was negative, always go forward
	start: limit
	limit: series
]

@greggirwin
Copy link
Contributor

I'm reminded that I don't love =?, preferring same?. I don't think it's a great symbol, given the meaning. === seems better if we're escalating comparison strictness, from = == ===, but even then I prefer same? for explicitness.

@hiiamboris
Copy link
Author

Isn't it possible that you'd want to replace a single value, anywhere in a deep structure? I agree that the normal case is /all with /deep, and this could be relaxed later if asked for. First use case I think of is that some data value shows up in an error scenario, and you want to test with the data to see if it happens in every case, or what related data there may be.

This case sounds made up to me ;) But anyway, if replace/once returns position after the replacement, what should it return in deep mode? Another series? How do you continue your replacement from another series? This is the main issue.

unless any-block? series [deep: off]

This is an interesting one. If we remove paren support from paths, they shouldn't be nestable, right?

Paths constructed at runtime will still be.

Also on paths, are we sure we want this behavior?

>> replace1/deep copy [a b c [a/b/c]] 'b 'x
== [a x c [a/x/c]]

In some cases that will be exactly what we want, but should we always treat block/list the same?

Not sure, but paths are troublesome without it. Question is, is it worth another refinement, e.g. /paths?

There's also another direction here: to run deep replacement on all inner series - strings, binaries, vectors. Should we have control over it or we just forward everyone into mapparse with this?

Does size need to be dynamic?

;; pattern size will be found out after first match:
size: [size: offset? match find/:case/:same/:only/match/tail match :pattern]

It's niftier than size: either only [1][length? pattern], but even with the comment I had to scan and parse things a few times to figure it out as future references and do use are tricky. It is a nice use of fixed+dynamic refinements though.

Let it be your homework then to find cases where size: either only [1][length? pattern] doesn't work :)
I'm using find because my head would explode from an attempt to figure out a reliable future-proof and concise way of manually keeping it in sync with find.

Setup thought, just more straight-line:

That may be less nested, but also heavier for the no-limit case.

I haven't by the way considered the past-tail index case here. I'd prefer a language-wide solution for it, rather than patching holes arbitrarily.

@greggirwin
Copy link
Contributor

Good thoughts. More questions. :^\

@greggirwin
Copy link
Contributor

That may be less nested, but also heavier for the no-limit case.

I'm all for profiling it in the grand scheme of replace. :^)

@hiiamboris
Copy link
Author

hiiamboris commented May 23, 2023

I'm all for profiling it in the grand scheme of replace. :^)

OK (init time, sample size=10):

@greggirwin
Copy link
Contributor

Cool, I can't tell what those numbers mean though, unless it makes the whole thing 10% slower, which is surprising.

@hiiamboris
Copy link
Author

It's no surprise. offset? is a mezz.

@greggirwin
Copy link
Contributor

OK then, just flip the blocks so the check is none? limit.

@hiiamboris
Copy link
Author

Done.

@greggirwin
Copy link
Contributor

greggirwin commented May 23, 2023

Cool. How is =? better than none? here, for my edification, as it's non-idiomatic?

@hiiamboris
Copy link
Author

Cool. How is =? better than none? here, for my edification, as it's non-idiomatic?

>> clock/times [1 =? none] 1e7
0.12 μs	[1 =? none]
>> clock/times [none? 1] 1e7
0.25 μs	[none? 1]

@dockimbel
Copy link
Member

unless pos =? match

I stumbled for a couple seconds there, as I wondered what was =?... Seriously, everybody is using the much more readable same?, we should remove =? entirely, as it's really non-intuitive. You could simply write unless same? pos match...

Also:

>> clock/times [none =? limit] 1e7
0.06 μs	[none =? limit]
== false
>> clock/times [not limit] 1e7
0.06 μs	[not limit]

@greggirwin
Copy link
Contributor

greggirwin commented May 23, 2023

@hiiamboris, which was why I said " in the grand scheme of replace" initially. profile tells the story nicely.

+1 for removing =?, but then I feel the same about unless. Do you remember why Carl added it? :^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants