Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial strings: Padding (almost) never needed #385

Open
UWN opened this issue Apr 22, 2020 · 10 comments
Open

Partial strings: Padding (almost) never needed #385

UWN opened this issue Apr 22, 2020 · 10 comments

Comments

@UWN
Copy link

UWN commented Apr 22, 2020

Whenever a partial string is created, its length is most often known. Thus there is no need for padding after the terminating 0-byte.

Except maybe for #251

@triska
Copy link
Contributor

triska commented Jul 22, 2023

Do you mean that, for example, if we have the string "a", then instead of:

     +--------+
     |a0000000|
     +--------+
     |  []/0  |
     +--------+

we can "align" it so that only the terminating 0 is needed, and avoid the remaining padding:

     +--------+
     |......a0|
     +--------+
     |  []/0  |
     +--------+

So, this would free the first 6 bytes in the cell (and at the most it frees 7 bytes for a string), is this would you mean? What would be the advantage of this, can we do anything with these cells? It seems we could store another very short (complete) string there, is this the intention?

@UWN
Copy link
Author

UWN commented Jul 22, 2023

It is rather that processing such a more compact string becomes a tiny bit more efficient.

@triska
Copy link
Contributor

triska commented Jul 22, 2023

As I understand it, as soon as 0 is encountered, the string is found to be terminated. Does the position within the cell make this more efficient? Intuitively, I expect the aligned RAM access that reads the WAM cell to take the most time when accessing the string, and it should in comparison not matter where the terminating 0 is within the cell, but maybe the 0 is accessed more efficiently if it is in the "lowest" (8-bytes) position of the cell, is that the intent?

@UWN
Copy link
Author

UWN commented Jul 22, 2023

Just space reduction (and thus cachelines and the like). As a worst case example consider [a,b,c,d,e,f,g|Xs] which requires only two memory cells in this configuration and requires three when some padding occurs.

@UWN
Copy link
Author

UWN commented Jul 22, 2023

... or as another worst case consider [a|Xs]: Without partial strings, this requires two cells. With partial strings this may require three cells, when the a is at the end of one cell, the next cell is just one zero (plus padding) and the third cell is Xs.

@triska
Copy link
Contributor

triska commented Jun 9, 2024

with partial strings this may require three cells, when the a is at the end of one cell, the next cell is just one zero (plus padding) and the third cell is Xs.

So we have:

     +--------+
     |.......a|
     +--------+
     |00000000|
     +--------+
     |   Xs   |
     +--------+

Is this right? One of my questions is still: Why do we need the padding at all in this case, and in fact in any case? I.e., what about (. means "anything"):

     +--------+
     |.......a|
     +--------+
     |0.......|
     +--------+
     |   Xs   |
     +--------+

Is there any reason to even care about the bytes that follow after 0 in the second cell? In which cases does their content make a difference?

My other question concerns the [a,b,c,d,e,f,g|Xs] case and other longer strings: The most compact representation clearly is:

     +--------+
     |abcdefg0|
     +--------+
     |   Xs   |
     +--------+

An unfortunate representation would be for example:

     +--------+
     |....abcd|
     +--------+
     |efg00000|
     +--------+
     |   Xs   |
     +--------+

However, what exactly is Scryer Prolog expected to do now about strings to avoid padding? The above example seems to indicate that we should start with strings "early" in a cell, so that they do not "reach into" yet another cell at the end. On the other hand, we can also avoid padding by "pushing" strings to the right, so that for example instead of representing "abc" as:

     +--------+
     |abc00000|
     +--------+
     |  []/0  |
     +--------+

we push it to the right and store it as:

     +--------+
     |....abc0|
     +--------+
     |  []/0  |
     +--------+

But is there any advantage in preventing the padding here? It seems we cannot do anything with the "available" bytes in front of the string, so it seems the best strategy is to always start a string at the "start" of a cell, because this will make the most use of the available space in the cell. Is there an example where another strategy yields fewer occupied cells in total?

@triska
Copy link
Contributor

triska commented Jun 9, 2024

Attempting to answer my own question, it seems that yes, we can indeed do something with the available space at the start, but indeed only if we do not "pad" at all, but stop after encountering the first 0 byte: In that case, we can use the space to store other (short) strings, such as "xyz":

     +--------+
     |xyz0abc0|
     +--------+
     |  []/0  |
     +--------+

I somehow doubt this is the intention here though?

@UWN
Copy link
Author

UWN commented Jun 10, 2024

In that case, we can use the space to store other (short) strings, such as "xyz"

I never thought about that. Think of it:

|a0b0c0d0|
+--------+
|  []/0  |

So two cells now represent what would be otherwise 8. Factor of 4! Nice, but GC might become much more difficult.

Why do we need the padding at all in this case, and in fact in any case?

When you start writing a partial string without knowing the actual length. For longer strings the padding might be a small overhead compared to computing the length first.

(deleted text)

@triska
Copy link
Contributor

triska commented Jun 11, 2024

I summarize the issue as I understand it:

  1. What follows after 0 in the final cell of a string does not matter; in an extreme scenario, even another (short) string can be stored there.
  2. It is an advantage to minimize the padding. That is, for a string of known length ("known" can also be generalized to mean that it can be efficiently computed before storing the string on the heap), it is an advantage to store it "pushed to the right", so that the terminating 0 is the last byte in a cell.

The reason for (2) is subtle and was so far not explained in this issue. To see the advantage, suppose we store the string "abcdefgh" as:

     +--------+
     |abcdefgh|
     +--------+
     |0.......|
     +--------+
     |  []/0  |
     +--------+

It is better to store it instead "pushed to the right", as:

     +--------+
     |.......a|
     +--------+
     |bcdefgh0|
     +--------+
     |  []/0  |
     +--------+

This also takes 3 cells. But now we have an advantage: If the string prefix "a" is no longer needed, then it can be reclaimed by discarding the cell it occupies, because this short prefix now occupies a cell of its own and can be discarded in isolation. The same reasoning holds also for longer prefices: The entire cell can then be sooner discarded when the prefix is no longer needed.

At least that's my understanding of this issue. Is this the intended advantage?

@UWN
Copy link
Author

UWN commented Jun 12, 2024

What follows after 0 in the final cell of a string does not matter

Yes. But filling it with 0s makes sense, because this helps to detect invalid partial string pointers. That is, a partial string pointer that points to a 0 is invalid. Of course, that should never happen, but it is as useful for d-bugging as the Rust messages are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants