Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonization of partial strings and strings #276

Closed
triska opened this issue Mar 1, 2020 · 12 comments
Closed

Harmonization of partial strings and strings #276

triska opened this issue Mar 1, 2020 · 12 comments

Comments

@triska
Copy link
Contributor

triska commented Mar 1, 2020

I have an implementation question regarding partial strings and strings: Why are these cases considered separately?

I mean for example, in machine_state.rs, there are definitions for PStrTail and also for StringTail:

pub(super) enum HeapPtr {
    HeapCell(usize),
    PStrChar(usize, usize),
    PStrTail(usize, usize),
    StringChar(usize, Rc),
    StringTail(usize, Rc),
}

However, a string — as far as the internal implementation is concerned — can be represented as a partial string that ends with [] (instead of with a variable).

For instance, the partial string [a,b,c|Xs] is represented on the heap as:

abc00000
Xs

and the string "abc", i.e., the list [a,b,c], is represented on the heap as:

abc00000
[]

Please see #24, which actually starts for (complete) strings, and mentions partial strings as "a step further". #95 contains examples for both cases. The point is that these cases are completely uniform: 0 terminates a string (or partial string), and the next cell on the heap tells us whether it is a partial list (if the next cell is a variable) or a complete list (if the next cell is []).

Note that in this representations, strings (both partial and complete) are terminated by 0, it is never necessary to store their lengths explicitly. Further instantiating a partial string means that the variable in the next heap cell becomes bound to a string in this representation (or to any other term).

Hence, to implement both partial and complete strings efficiently, only a single tag seems to be necessary, indicating a pointer to a string in this representation.

I have uploaded a short video that explains this representation as I understand it:

https://www.metalevel.at/prolog/videos/representing_strings

Does this help to simplify the implementation, so that for example fewer tags and cases are needed?

@UWN
Copy link

UWN commented Mar 1, 2020

-- and: How does a pointer now look like? Where is it tagged?

@mthom
Copy link
Owner

mthom commented Mar 1, 2020

Yes, partial strings and complete strings should be consolidated into a single data type, I agree. The separation that exists now is for reasons dating back to the very start of the project. That is, Scryer still largely operates under the memory model of Rust, which has no GC and is RAII-based. In particular, Rc<String> is used to convey strings from the parser to the Scryer runtime.

Partial strings are implemented mostly in the terms described in your Representing Strings video (thank you for that resource, btw!) but complete strings aren't, yet. Once I begin writing the garbage collector, I plan to introduce custom data types for wrapping references to heap data, which should be much more compact than those generated by Rust, as they'll no longer need much of their current overhead. Overall they will more faithfully reflect the design of the WAM, and mark the beginning of a shift away from a Rust-style memory model and toward a style of memory management typical of other Prolog implementations, one over which I'll have a much finer degree of control.

The pointer variants that you name, ie., PStrChar(usize, usize), etc., do not store the length of partial strings. They store, first, the location of the partial string in the heap, and the byte offset into the partial string, never the length. I've watched Representing Strings several times, and I'm certain this is consonant with the standard outlined there. The HeapPtr type is only ever written to the S field of the WAM, BTW, and does not appear in any WAM registers, on the stack, or in the heap. It's solely used for managing offsets into lists between WAM instructions that manipulate list-like data. Further variants aren't needed for actual lists because of how lists are compressed, hence the HeapCell variant. But, pointers into partial strings are represented similarly in the WAM proper, and once complete strings are properly enshrined as a kind of specialized partial string, the number of variants dealing with partial strings will be reduced to just the two, partial string heap locations/byte offsets and partial string tails. I should stress that partial strings and complete strings' contents are not currently inlined in the heap, they are stored in dedicated buffers pointed to from within the heap, hence the need for both a heap offset and a byte offset.

With the current implementation of partial strings, I've tried to get Ulrich's semantics right and provide as close an approximation as I can to the representation he originally envisioned, under the current constraints of the memory model. I plan to lift those constraints soon and get even closer to his original idea by adding GC, and in the process satisfy most if not all of these concerns, but that will take still more time.

@UWN
Copy link

UWN commented Mar 1, 2020

Still I am a bit baffled. It seems that two pointers fit into a single variable. Right?

@UWN
Copy link

UWN commented Mar 1, 2020

.. or bettter: two pointers fit into a single register

@UWN
Copy link

UWN commented Mar 2, 2020

@mthom, could you please confirm/reject my two comments?

@triska
Copy link
Contributor Author

triska commented Mar 2, 2020

Also from me, thank you for your explanation! I also have a follow-up question, regarding this statement:

the number of variants dealing with partial strings will be reduced to just the two, partial string heap locations/byte offsets and partial string tails.

In particular, why is there a variant for "partial string tails"? As I understand it, a tail of a partial string is simply a cell in the heap, much like any other cell. If that cell contains a variable, then it is the tail of a partial string (i.e., partial list). We recognize that it is a variable due to its being tagged as a variable. Otherwise, the tail is whatever term is present there, again tagged with the appropriate tag.

For instance, the cell can become a pointer to a partial string, or the functor '.' with two arguments (this is necessary to represent 0-bytes that occur within strings. Or it can be (or become) the atom []. Each of these are indicated by a specific tag (a marker that indicates whether variable, an atom, or a partial string is stored in the cell), i.e., a specific bit combination that indicates the type of the cell at that place.

It should not be necessary to indicate specifically that any cell is the tail of a partial string. In the predicate partial_string/3, the third argument is a variable that points to the cell where the tail is stored, and that cell will often contain a variable, hence being tagged as a variable on the heap. Once that variable becomes instantiated, the cell is tagged accordingly, and the (pointer to the) instantiated term - whatever it is - is stored in the cell that originally denoted the tail.

So, only pointers to the heap should be necessary, because the cells are themselves tagged to indicate what is present there. A string (or partial string) is recognized by its tag, and from there the interpreter must sequentially read the bytes (UTF-8 decoding one char after the other) until a 0-byte is encountered. The following cell tells us (via its tag) how it continues, notably whether this is a partial or complete string. Is this the representation you are now envisaging?

@mthom
Copy link
Owner

mthom commented Mar 2, 2020

It seems that two pointers fit into a single variable. Right?

That's right, two pointers in a single variable.

@mthom
Copy link
Owner

mthom commented Mar 2, 2020

I think you're right, PStrTail shouldn't be needed anywhere. Will confirm later tonight.

@triska
Copy link
Contributor Author

triska commented Mar 3, 2020

In other words, the key operation I expect to see somewhere is an actual scan of the heap, starting from an appropriately tagged pointer that points to the start of a string (whether partial or complete), and reading the characters, bytewise, using UTF-8 decoding, as they appear in the string, until a 0-byte is encountered, then inspecting the next cell to determine whether the string is complete ([]) or partial (a variable follows and can be recognized by its tag), or a 0-byte must be added to the string ('.'('\0', ...)), or another (partial or complete) string follows as continuation of the present one (indicated again by an appropriately tagged pointer) etc. The key aspect seems to be to ensure that all low-level primitives, most notably unification, are made aware of this representation of strings/lists.

This is what you are now already working on in tandem with GC? I am very much looking forward to this!

@mthom
Copy link
Owner

mthom commented Mar 3, 2020

There are many opportunities to do batched comparisons of strings against strings, but for now Scryer follows the classic WAM style of unifying a single list element at a time, even on partial and complete strings. That's an optimization I intend to add before moving on to GC. One such trick is to detect patterns like:

foo([b,a,r|L], ...)

and then compile the [b,a,r|L] to code that instantiates a partial string with those contents instead of the usual list of characters. Do I have the right idea there?

For GC, I'm still surveying papers for a suitable algorithm. I'm now leaning toward "Increment copying garbage collection for WAM-based Prolog systems" by Demoen and Vandeginste, but the exposition is unclear on some points. Even if I could say I understood it all, it seems too complex. So I'm open to other suggestions.

@UWN
Copy link

UWN commented Mar 3, 2020

Before looking at algorithms consider first what you want to maintain by GC.

Variable ordering. It's effectively implementation dependent, but many rely on a specific ordering which roughly corresponds to the age of a variable. That is, as long as there is no unification happening, the order is preserved. From a standard's viewpoint this ordering needs to be preserved only during sorting. I.e., within a sort, all variables are ordered the same way. But still, people rely on such an order outside of sorting. For this reason, Morris I (CACM 1978) is used by most collectors to preserve relative variable ordering and permit chronological backtracking. Brief, Mark and Slide. Did you find anything of this mentioned in the paper? I did not on a first scan.

Partial strings. There are some extra challenges. In particular, since the heap can no longer be scanned directly (but then, it is marked first) Also naive copying can destroy the sharing. Think of: PS = "abcde", A1 = PS, A1 = [_|A2], A2 = [_|A3], copy_term(p(A3,A2,A1,[]), X)

Destructive, non-backtrackable updates. Some of these make sense. There was even an attempt to standardize them, but it did not find sufficient acceptance. At least, uses like call_nth/2 should be possible in a safe manner. One simple rule is: A structure whose argument should be destructively changed must contain a variable as argument (to preserve identity).

Sharing of db-terms. A ground fact does not need to copy the ground term at all, it might be shared directly.

Creating partial strings. Partial lists that have characters as elements may now be converted into partial strings. Partial strings themselves might be merged.

Sharing of identical subterms. When terms are ==, they could be shared. Think of an expanded version of a blam list. The general algorithm behind is minimization of finite automata.

p(X) :- p(s(X)).  % sic

?- p(X).  % runs out of space.

?- X = s(X), p(X).   % loops ideally in constant space

mthom added a commit that referenced this issue Mar 17, 2020
@triska
Copy link
Contributor Author

triska commented Apr 12, 2020

This is now brilliantly resolved, so I am closing this issue.

The main remaining takeaway are the points about GC in #276 (comment).

Thank you a lot!

@triska triska closed this as completed Apr 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants