Partial strings: Padding (almost) never needed #385

UWN · 2020-04-22T10:39:09Z

Whenever a partial string is created, its length is most often known. Thus there is no need for padding after the terminating 0-byte.

Except maybe for #251

triska · 2023-07-22T06:28:10Z

Do you mean that, for example, if we have the string "a", then instead of:

     +--------+
     |a0000000|
     +--------+
     |  []/0  |
     +--------+

we can "align" it so that only the terminating 0 is needed, and avoid the remaining padding:

     +--------+
     |......a0|
     +--------+
     |  []/0  |
     +--------+

So, this would free the first 6 bytes in the cell (and at the most it frees 7 bytes for a string), is this would you mean? What would be the advantage of this, can we do anything with these cells? It seems we could store another very short (complete) string there, is this the intention?

UWN · 2023-07-22T07:00:33Z

It is rather that processing such a more compact string becomes a tiny bit more efficient.

triska · 2023-07-22T07:12:04Z

As I understand it, as soon as 0 is encountered, the string is found to be terminated. Does the position within the cell make this more efficient? Intuitively, I expect the aligned RAM access that reads the WAM cell to take the most time when accessing the string, and it should in comparison not matter where the terminating 0 is within the cell, but maybe the 0 is accessed more efficiently if it is in the "lowest" (8-bytes) position of the cell, is that the intent?

UWN · 2023-07-22T09:53:12Z

Just space reduction (and thus cachelines and the like). As a worst case example consider [a,b,c,d,e,f,g|Xs] which requires only two memory cells in this configuration and requires three when some padding occurs.

UWN · 2023-07-22T10:03:58Z

... or as another worst case consider [a|Xs]: Without partial strings, this requires two cells. With partial strings this may require three cells, when the a is at the end of one cell, the next cell is just one zero (plus padding) and the third cell is Xs.

triska · 2024-06-09T16:53:58Z

with partial strings this may require three cells, when the a is at the end of one cell, the next cell is just one zero (plus padding) and the third cell is Xs.

So we have:

     +--------+
     |.......a|
     +--------+
     |00000000|
     +--------+
     |   Xs   |
     +--------+

Is this right? One of my questions is still: Why do we need the padding at all in this case, and in fact in any case? I.e., what about (. means "anything"):

     +--------+
     |.......a|
     +--------+
     |0.......|
     +--------+
     |   Xs   |
     +--------+

Is there any reason to even care about the bytes that follow after 0 in the second cell? In which cases does their content make a difference?

My other question concerns the [a,b,c,d,e,f,g|Xs] case and other longer strings: The most compact representation clearly is:

     +--------+
     |abcdefg0|
     +--------+
     |   Xs   |
     +--------+

An unfortunate representation would be for example:

     +--------+
     |....abcd|
     +--------+
     |efg00000|
     +--------+
     |   Xs   |
     +--------+

However, what exactly is Scryer Prolog expected to do now about strings to avoid padding? The above example seems to indicate that we should start with strings "early" in a cell, so that they do not "reach into" yet another cell at the end. On the other hand, we can also avoid padding by "pushing" strings to the right, so that for example instead of representing "abc" as:

     +--------+
     |abc00000|
     +--------+
     |  []/0  |
     +--------+

we push it to the right and store it as:

     +--------+
     |....abc0|
     +--------+
     |  []/0  |
     +--------+

But is there any advantage in preventing the padding here? It seems we cannot do anything with the "available" bytes in front of the string, so it seems the best strategy is to always start a string at the "start" of a cell, because this will make the most use of the available space in the cell. Is there an example where another strategy yields fewer occupied cells in total?

triska · 2024-06-09T17:03:17Z

Attempting to answer my own question, it seems that yes, we can indeed do something with the available space at the start, but indeed only if we do not "pad" at all, but stop after encountering the first 0 byte: In that case, we can use the space to store other (short) strings, such as "xyz":

     +--------+
     |xyz0abc0|
     +--------+
     |  []/0  |
     +--------+

I somehow doubt this is the intention here though?

UWN · 2024-06-10T09:22:12Z

In that case, we can use the space to store other (short) strings, such as "xyz"

I never thought about that. Think of it:

|a0b0c0d0|
+--------+
|  []/0  |

So two cells now represent what would be otherwise 8. Factor of 4! Nice, but GC might become much more difficult.

Why do we need the padding at all in this case, and in fact in any case?

When you start writing a partial string without knowing the actual length. For longer strings the padding might be a small overhead compared to computing the length first.

(deleted text)

triska · 2024-06-11T16:54:49Z

I summarize the issue as I understand it:

What follows after 0 in the final cell of a string does not matter; in an extreme scenario, even another (short) string can be stored there.
It is an advantage to minimize the padding. That is, for a string of known length ("known" can also be generalized to mean that it can be efficiently computed before storing the string on the heap), it is an advantage to store it "pushed to the right", so that the terminating 0 is the last byte in a cell.

The reason for (2) is subtle and was so far not explained in this issue. To see the advantage, suppose we store the string "abcdefgh" as:

     +--------+
     |abcdefgh|
     +--------+
     |0.......|
     +--------+
     |  []/0  |
     +--------+

It is better to store it instead "pushed to the right", as:

     +--------+
     |.......a|
     +--------+
     |bcdefgh0|
     +--------+
     |  []/0  |
     +--------+

This also takes 3 cells. But now we have an advantage: If the string prefix "a" is no longer needed, then it can be reclaimed by discarding the cell it occupies, because this short prefix now occupies a cell of its own and can be discarded in isolation. The same reasoning holds also for longer prefices: The entire cell can then be sooner discarded when the prefix is no longer needed.

At least that's my understanding of this issue. Is this the intended advantage?

UWN · 2024-06-12T05:43:25Z

What follows after 0 in the final cell of a string does not matter

Yes. But filling it with 0s makes sense, because this helps to detect invalid partial string pointers. That is, a partial string pointer that points to a 0 is invalid. Of course, that should never happen, but it is as useful for d-bugging as the Rust messages are.

UWN mentioned this issue Apr 22, 2020

atom_chars/2 incorrect #379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial strings: Padding (almost) never needed #385

Partial strings: Padding (almost) never needed #385

UWN commented Apr 22, 2020 •

edited

Loading

triska commented Jul 22, 2023

UWN commented Jul 22, 2023

triska commented Jul 22, 2023

UWN commented Jul 22, 2023

UWN commented Jul 22, 2023

triska commented Jun 9, 2024 •

edited

Loading

triska commented Jun 9, 2024

UWN commented Jun 10, 2024 •

edited

Loading

triska commented Jun 11, 2024

UWN commented Jun 12, 2024

Partial strings: Padding (almost) never needed #385

Partial strings: Padding (almost) never needed #385

Comments

UWN commented Apr 22, 2020 • edited Loading

triska commented Jul 22, 2023

UWN commented Jul 22, 2023

triska commented Jul 22, 2023

UWN commented Jul 22, 2023

UWN commented Jul 22, 2023

triska commented Jun 9, 2024 • edited Loading

triska commented Jun 9, 2024

UWN commented Jun 10, 2024 • edited Loading

triska commented Jun 11, 2024

UWN commented Jun 12, 2024

UWN commented Apr 22, 2020 •

edited

Loading

triska commented Jun 9, 2024 •

edited

Loading

UWN commented Jun 10, 2024 •

edited

Loading