-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonization of partial strings and strings #276
Comments
-- and: How does a pointer now look like? Where is it tagged? |
Yes, partial strings and complete strings should be consolidated into a single data type, I agree. The separation that exists now is for reasons dating back to the very start of the project. That is, Scryer still largely operates under the memory model of Rust, which has no GC and is RAII-based. In particular, Partial strings are implemented mostly in the terms described in your Representing Strings video (thank you for that resource, btw!) but complete strings aren't, yet. Once I begin writing the garbage collector, I plan to introduce custom data types for wrapping references to heap data, which should be much more compact than those generated by Rust, as they'll no longer need much of their current overhead. Overall they will more faithfully reflect the design of the WAM, and mark the beginning of a shift away from a Rust-style memory model and toward a style of memory management typical of other Prolog implementations, one over which I'll have a much finer degree of control. The pointer variants that you name, ie., With the current implementation of partial strings, I've tried to get Ulrich's semantics right and provide as close an approximation as I can to the representation he originally envisioned, under the current constraints of the memory model. I plan to lift those constraints soon and get even closer to his original idea by adding GC, and in the process satisfy most if not all of these concerns, but that will take still more time. |
Still I am a bit baffled. It seems that two pointers fit into a single variable. Right? |
.. or bettter: two pointers fit into a single register |
@mthom, could you please confirm/reject my two comments? |
Also from me, thank you for your explanation! I also have a follow-up question, regarding this statement: the number of variants dealing with partial strings will be reduced to just the two, partial string heap locations/byte offsets and partial string tails. In particular, why is there a variant for "partial string tails"? As I understand it, a tail of a partial string is simply a cell in the heap, much like any other cell. If that cell contains a variable, then it is the tail of a partial string (i.e., partial list). We recognize that it is a variable due to its being tagged as a variable. Otherwise, the tail is whatever term is present there, again tagged with the appropriate tag. For instance, the cell can become a pointer to a partial string, or the functor It should not be necessary to indicate specifically that any cell is the tail of a partial string. In the predicate So, only pointers to the heap should be necessary, because the cells are themselves tagged to indicate what is present there. A string (or partial string) is recognized by its tag, and from there the interpreter must sequentially read the bytes (UTF-8 decoding one char after the other) until a |
That's right, two pointers in a single variable. |
I think you're right, |
In other words, the key operation I expect to see somewhere is an actual scan of the heap, starting from an appropriately tagged pointer that points to the start of a string (whether partial or complete), and reading the characters, bytewise, using UTF-8 decoding, as they appear in the string, until a This is what you are now already working on in tandem with GC? I am very much looking forward to this! |
There are many opportunities to do batched comparisons of strings against strings, but for now Scryer follows the classic WAM style of unifying a single list element at a time, even on partial and complete strings. That's an optimization I intend to add before moving on to GC. One such trick is to detect patterns like:
and then compile the For GC, I'm still surveying papers for a suitable algorithm. I'm now leaning toward "Increment copying garbage collection for WAM-based Prolog systems" by Demoen and Vandeginste, but the exposition is unclear on some points. Even if I could say I understood it all, it seems too complex. So I'm open to other suggestions. |
Before looking at algorithms consider first what you want to maintain by GC. Variable ordering. It's effectively implementation dependent, but many rely on a specific ordering which roughly corresponds to the age of a variable. That is, as long as there is no unification happening, the order is preserved. From a standard's viewpoint this ordering needs to be preserved only during sorting. I.e., within a sort, all variables are ordered the same way. But still, people rely on such an order outside of sorting. For this reason, Morris I (CACM 1978) is used by most collectors to preserve relative variable ordering and permit chronological backtracking. Brief, Mark and Slide. Did you find anything of this mentioned in the paper? I did not on a first scan. Partial strings. There are some extra challenges. In particular, since the heap can no longer be scanned directly (but then, it is marked first) Also naive copying can destroy the sharing. Think of: Destructive, non-backtrackable updates. Some of these make sense. There was even an attempt to standardize them, but it did not find sufficient acceptance. At least, uses like Sharing of db-terms. A ground fact does not need to copy the ground term at all, it might be shared directly. Creating partial strings. Partial lists that have characters as elements may now be converted into partial strings. Partial strings themselves might be merged. Sharing of identical subterms. When terms are ==, they could be shared. Think of an expanded version of a blam list. The general algorithm behind is minimization of finite automata.
|
This is now brilliantly resolved, so I am closing this issue. The main remaining takeaway are the points about GC in #276 (comment). Thank you a lot! |
I have an implementation question regarding partial strings and strings: Why are these cases considered separately?
I mean for example, in
machine_state.rs
, there are definitions forPStrTail
and also forStringTail
:However, a string — as far as the internal implementation is concerned — can be represented as a partial string that ends with [] (instead of with a variable).
For instance, the partial string
[a,b,c|Xs]
is represented on the heap as:abc00000
Xs
and the string
"abc"
, i.e., the list[a,b,c]
, is represented on the heap as:abc00000
[]
Please see #24, which actually starts for (complete) strings, and mentions partial strings as "a step further". #95 contains examples for both cases. The point is that these cases are completely uniform:
0
terminates a string (or partial string), and the next cell on the heap tells us whether it is a partial list (if the next cell is a variable) or a complete list (if the next cell is[]
).Note that in this representations, strings (both partial and complete) are terminated by
0
, it is never necessary to store their lengths explicitly. Further instantiating a partial string means that the variable in the next heap cell becomes bound to a string in this representation (or to any other term).Hence, to implement both partial and complete strings efficiently, only a single tag seems to be necessary, indicating a pointer to a string in this representation.
I have uploaded a short video that explains this representation as I understand it:
https://www.metalevel.at/prolog/videos/representing_strings
Does this help to simplify the implementation, so that for example fewer tags and cases are needed?
The text was updated successfully, but these errors were encountered: