Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TaggedRef{T} / TaggedPtr{T} / ImmediateOrRef{T} #47735

Open
fingolfin opened this issue Nov 29, 2022 · 24 comments
Open

Add TaggedRef{T} / TaggedPtr{T} / ImmediateOrRef{T} #47735

fingolfin opened this issue Nov 29, 2022 · 24 comments
Labels
performance Must go faster

Comments

@fingolfin
Copy link
Contributor

One optimization I miss in Julia compared to other GC-based language implementations are "tagged pointers".

The problem

Consider the BigInt type: it is a lot slower than Int and takes up more memory; computing something like x * y where x=big(0); y=big(1); requires dereferencing pointers, multiple loads from this indirect data, then an allocation and more writes. In contrast when x,y are Int, 1-2 machine instructions suffice on most architectures.

Tagged pointers / tagged reference

GC languages that work a lot with multi precision integers often compensate for this using tagged pointers: the idea is to store small instances of the type inside the pointer. To distinguish between a proper pointer, and a fake pointer containing data, a "tag" is added to the pointer. E.g. on modern architectures, pointers usually must be even, if not a multiple of a machine word; so divisible by 8 on a 64 bit architectures. (Well, at least pointers to objects; but also in general it tends to preferable to use such aligned pointers for performance reasons). The idea then is to use bit 0 (LSB) to indicate a tagged value: if it is 1 (thus the "pointer" has an odd value), we assume this is not an actual pointer, but rather it contains data. That way one can store up to 63 bites resp. 31 bits of "immediate" data.

There are many other applications for this besides multi precision integers:

  • BigFloat
  • Rational: can store numerator and denominator directly if they are small, else pointer to full object
  • strings: up to 7 ASCII characters can be stored in a tagged pointer, plus a 'length"
  • other arithmetic objects can also benefit, just by using tagged values to encode a few special values; e.g. for polynomials being able to encode small constant polynomials in a tagged pointer can be used to improve efficiency of semi-dense matrices over these polynomials

A possible Julia user interface to tagged pointers

My suggestion would be to add a new parametric type TaggedRef{T} (requiring some kernel support, see below). Functionally I naively imagine it would behave similarly to a Union{Ref{T}, UInt} e.g. for purposes of type inference.

Methods that should exist (likely with different/better names -- I mostly care about the functionality that should be there, so that I can sketch example usage), where tr isa TaggedRef{T}:

  • istagged(tr) return true if the ref is tagged, else false
  • tr[] acts like [] for a normal Ref{T} if it is not tagged; if it is tagged, throw an exception (as with a ref that is not assigned)
  • getrawvalue(tr) returns the raw data underlying tr, including the tag bit
    • probably throws an error if tr is an actual pointer? though I am not sure that's needed (most likely it is faster if it doesn't do that)
    • I deliberately do not want to mask away or shift away or whatever the tag bit, this should be left to the code using this immediate data
    • so ideally this can be a single machine instruction
  • setrawvalue(tr, val::UInt) -- set the raw value. Possibly should enforce that the tag bit is set, by using val | 1

An example

Using a TaggedRef one can now implement fast(er) bigints. Say like this:

struct FastBigInt
  val::TaggedRef{BigInt}
  function FastBigInt(x::Int = 0)
    # test if x fits into 63 bits; this can be tested
    # efficiently by checking the top two bits are equal,
    # I am just too lazy to try to spell it out right now
    if x_is_small_enough
      r = TaggedRef{BigInt}()
      setrawvalue(r, reinterpret(UInt, x) << 1 | 1)
    else
      r = TaggedRef{BigInt}(big(x))
    end
    return new(r)
  end
  function FastBigInt(x::BigInt)
    # TODO: actually if x is small then perhaps convert it into
    # a tagged value?
    return new(TaggedRef{BigInt}(x))
  end
end

# conversion functions... e.g.
function Base.Int(x::FastBigInt)
  if istagged(x)
    rawx = getrawvalue(x)
    return reinterpret(Int, xor(rawx, 1)) >> 1
  end
  return Int(x[]) # delegate to BigInt conversion code
end

function Base.:+(x::FastBigInt, y::FastBigInt)
   if istagged(x) && istagged(y)
      rawx = getrawvalue(x) # get tagged data, with tag bit
      rawy = getrawvalue(y) # ditto
      # ... insert code which adds rawx, rawy, signed, and checks
      # for overflow etc.
      ...
   elseif istagged(x)
      return FastBigInt(Int(x) + y[])
   elseif istagged(y)
      return FastBigInt(x[] + Int(y))
   else
      return FastBigInt(x[] + y[])
   end
end

Thoughts on implementation

Some kernel support for this would be needed (else I wouldn't post this here but just implement it).

  1. The GC would need to know about it, and follow the "pointer" stored in the tagged ref only when it is not tagged. On a purely technical level it shouldn't be too hard to do this: just add another special case for jl_tagged_ref_type in the GC loop and in there, use the low bit to decide whether or not to queue the referenced object. Of course adding yet another top-level special case to that check isn't great
  2. The (de)serialization code would need to know about it
  3. not sure what / if anything else?
@Seelengrab
Copy link
Contributor

Rational

Rational is, depending on the type parameter, already stored inline - surely that is faster than forcing an indirection via a pointer? What's the usecase?

BigInt & BigFloat

I have not checked since it's GPL source code, but I'd be very surprised if GMP (which we use for BigInt) doesn't already use tagged pointers or similar optimizations internally (though speeding that case up seems dubious to me either way - if most numbers are smaller than a minimum for BigInt to be efficient, wouldn't UInt128 or similar be more appropriate for that particular application?). Since MPFR uses GMP internally, making that work with tagged pointers seems hard, especially since we then can't easily take advantage of hardware instructions for quickly multiplying floating points, due to having to unpack them from the pointer first. There's no natural point where we can just use a bit for a different purpose in IEEE after all.

polynomials

What's the advantage over a bitfield or a primitive type to encode these?


How would transparent decay from a FastBigInt to a regular BigInt work, without type instability? Or are you proposing to replace BigInt with FastBigInt..?

@fingolfin
Copy link
Contributor Author

Rational

Rational is, depending on the type parameter, already stored inline - surely that is faster than forcing an indirection via a pointer? What's the usecase?

E.g. Rational{BigInt} encodes big(1)//big(2) as essentially two pointers (inlined, of course, as you point out -- but pointers anyway), this can be encoded as a tagged value.

However, I am not suggesting that Rational{T} be switched to use TaggedRef, but rather that with TaggedRef, anyone who needs to deal with these things all the time (e.g. people like me working on abstract algebra) could write their own rational type using TaggedRef.

There is a reason I am proposing TaggedRef in the title of this issue not "a list of type using TaggedRef" -- this thing is meant to be a tool and I just listed a bunch of real-world example applications ("real world" as in: there exist software systems out there which use these techniques, and have been using them often for decades, and for good reasons).

BigInt & BigFloat

I have not checked since it's GPL source code, but I'd be very surprised if GMP (which we use for BigInt) doesn't already use tagged pointers or similar optimizations internally

You are mistaken; no, it does not. This is one of multiple reasons why we don't use Julia's BigInt, but rather our own type in Nemo.jl and Oscar.jl and other packages dealing with abstract algebra.

Besides, even if GMP did something of this kind, it would not help as much as a "native" tagged pointer type, as long as Julia invokes generic functions like mpz_add, which can't be inlined. This is a major bottleneck (I spent a decade dealing with these things on a low level).

Note: this technique is not something I made up. E.g. LISP implementations have used this for decades. Several computer algebra systems I work on or am familiar with use this technique (FLINT, GAP, Singular, ...). I know that Ruby and Java implementations use variations of this. Certainly many more.

(One thing, though: I just checked Wikipedia, and they use "Tagged pointer" slightly differently than what I do here. Perhaps a different named would be better, e.g. ImmediateValueRef{T}, to indicate that this is a ref/pointer which can encode certain "immediate values").

(though speeding that case up seems dubious to me either way - if most numbers are smaller than a minimum for BigInt to be efficient, wouldn't UInt128 or similar be more appropriate for that particular application?).

No. There are tons of applications where "most of the time" you have fairly small numbers, but "sometimes" they can get big, and you can't anticipate easily when which happens. Naive approaches like "first do all computations with Int and if you detect an overflow, switch to BigInt / redo everything with BigInt" usually are not practical nor efficient

Besides, even in code that "knows" big values will appear, having optimizations for small values can have tremendous payoff.

Since MPFR uses GMP internally, making that work with tagged pointers seems hard, especially since we then can't easily take advantage of hardware instructions for quickly multiplying floating points, due to having to unpack them from the pointer first. There's no natural point where we can just use a bit for a different purpose in IEEE after all.

I am sure tagged pointers could be used for MPFR; but I am not interested in those myself, so I won't spend time debating this :-).

polynomials

What's the advantage over a bitfield or a primitive type to encode these?

How would you use a bitfield or primitive type to encode a polynomial over a number field or over an arbitrary commutative ring?

How would transparent decay from a FastBigInt to a regular BigInt work, without type instability? Or are you proposing to replace BigInt with FastBigInt..?

Realistically, I am pretty sure FastBigInt would be preferable for 99% of code currently using BigInt. But that's something only time will tell; at this point I am not proposing to replace anything. I merely gave an example to showcase how TaggedRef can be used in a practical setting, to build useful new things. And perhaps eventually this could be useful to improve BigInt, or to implement a better alternative (it may not be possible to ever "replace" BigInt due to compatibility concerns).

@Seelengrab
Copy link
Contributor

However, I am not suggesting that Rational{T} be switched to use TaggedRef, but rather that with TaggedRef, anyone who needs to deal with these things all the time (e.g. people like me working on abstract algebra) could write their own rational type using TaggedRef.

Gotcha, I misunderstood your proposal then. I was under the impression this should be used for all things BigInt internally (and perhaps more..?), since that's what you mentioned (thus my mention of MPFR, which julia uses for BigFloat). That's also where my confusion about Rational came from, since using such ImmediateValueRef for Rational in general would lead to regressions for e.g. Rational{Int}, which I was actually thinking of with "not being able to be inlined anymore". I don't think Rational{BigInt} can be special cased without modification of BigInt, hence a new type would be needed.

You are mistaken; no, it does not. This is one of multiple reasons why we don't use Julia's BigInt, but rather our own type in Nemo.jl and Oscar.jl and other packages dealing with abstract algebra.

Ah, I see :/ That's surprising.

Note: this technique is not something I made up. E.g. LISP implementations have used this for decades. Several computer algebra systems I work on or am familiar with use this technique (FLINT, GAP, Singular, ...). I know that Ruby and Java implementations use variations of this. Certainly many more.

I'm not suggesting you did, I'm certainly aware of existing work featuring these sorts of things :) I'm mostly aware of the linux kernel using a similar technique with the upper bits of pointers for page management. I just understood your proposal as a deeper modification than it truly is, that's all. I'm no stranger to the challenges a BigInt library has to face when there's any kind of speed to be had (that's not to say I actually know what I'm talking about though, to be clear 😅 ). I think the biggest motivation for using that technique is to avoid a cache miss on the deref when only few (or just a single) elements would be read, is that right?

How would you use a bitfield or primitive type to encode a polynomial over a number field or over an arbitrary commutative ring?

My mistake, I only thought of polynomials with binary coefficients. The idea was to encode the existence of any given exponent in the polynomial as a bit in the bitfield (or a primitive type of the appropriate size, if we're in modular arithmetic), but that is of course exactly what large polynomials would use a BigInt for. I'd then store the coefficients as a sparse vector, indexed by the bit positions of the set bits in that bitfield (perhaps there's a clever variation on this using a dense-array backed octree?).

Pure conjecture though, feel free to disregard - you've probably already explored that :)


Out of curiosity, is having a julia wrapper type implementing this without special GC support not an option? As far as I'm aware, you're pretty much allowed to do whatever you want with Ptr typed fields and GC does not track any memory associated with that if you allocate it yourself (my gut says your response is about wanting to use julia-allocated memory for this, which would indeed be a problem).

Regardless, I think this would be more appropriate as part of an effort for custom allocators - from what I've read in discussions in other PRs, your usecase mostly deals with using such tagged pointers created outside of julia, correct? I personally think it would make sense to have a custom allocator responsible for memory management of these objects, but I'm admittedly out of my depth in terms of feasability of having that in julia (and/or it may just be me wanting that). My gut also says this shouldn't really need special GC support and it should work with a primitive type of the native pointer size, for which you define the appropriate unsafe_load etc. you mentioned above.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 29, 2022

I like it. I think you can assume at least 3 tag bits also

For 32 bit, I assume we would also make this object be 64-bit sized, to be most useful?

@fingolfin
Copy link
Contributor Author

However, I am not suggesting that Rational{T} be switched to use TaggedRef, but rather that with TaggedRef, anyone who needs to deal with these things all the time (e.g. people like me working on abstract algebra) could write their own rational type using TaggedRef.

Gotcha, I misunderstood your proposal then.

Most likely that means it was not phrased as well as it should have been :-/. So it's good you asked, and hopefully things are a bit clearer now :-). Please keep asking.

I'm no stranger to the challenges a BigInt library has to face when there's any kind of speed to be had

Ah, very nice, and interesting. Well, I think it'll be difficult to match performance even for just, say, addition, in pure Julia -- one really wants access to machine instructions like add-with-carry. (Perhaps there is a way to get those using some advanced LLVM magic? That'd be quite cool.) But it definitely would be quite interesting to try how could one can get!

I think the biggest motivation for using that technique is to avoid a cache miss on the deref when only few (or just a single) elements would be read, is that right?

That's a major part, yes, but it's more (talking specifically about large integer handling here): on the fastpath (were the value is immediate), often a few fast CPU instructions can handle all. E.g. addition of two "immediate integers" can be done like this if we use that we have two bits for tagging, and assume bit 0 is 1 and bit 1 is 0: (using pseudo assembler)

   and r1, x, y
   bittest r1, 0  # bit 0 is set in both -> both were immediate
   jump-if-false   slowpath
   add r2, x, y
   jump-if-carry  slowpath   # could also use the result, but why bother
   xor r2, 3  # fix up the tag bits
   return r2
slowpath:
   ...

How would you use a bitfield or primitive type to encode a polynomial over a number field or over an arbitrary commutative ring?

My mistake, I only thought of polynomials with binary coefficients. The idea was to encode the existence of any given exponent in the polynomial as a bit in the bitfield (or a primitive type of the appropriate size, if we're in modular arithmetic), but that is of course exactly what large polynomials would use a BigInt for. I'd then store the coefficients as a sparse vector, indexed by the bit positions of the set bits in that bitfield (perhaps there's a clever variation on this using a dense-array backed octree?).

Typically multivariate polynomials are indeed store sparsely, for those the "tagged pointer" / "immediate value" optimization is not that useful.

The idea of using bitmasks for exponent vectors is indeed being used in the best code bases for multivariate polynomials, in various ways, e.g. also to speed up divisibility tests.

But univariates often are stored densely. And also for matrices there are situations were spares storage is not useful (the matrix is "too dense"), yet many entries in the matrix are 0 or 1 (but of course *as polynomials), and here it can be beneficial if these values are stored "directly".

Pure conjecture though, feel free to disregard - you've probably already explored that :)

Out of curiosity, is having a julia wrapper type implementing this without special GC support not an option? As far as I'm aware, you're pretty much allowed to do whatever you want with Ptr typed fields and GC does not track any memory associated with that if you allocate it yourself (my gut says your response is about wanting to use julia-allocated memory for this, which would indeed be a problem).

Exactly: I want to benefit from the GC, and be able to do these things with pure Julia code, too.

Indeed, right now important tons of data types from C libraries, and use Ptr for these library, and even use tagging techniques there, too. But this is limited in what it can do; and it all comes with a price as well: we need to use finalizers extensively to manage the memory, and those are not free either.

Since Julia's BigInt uses Ptr, too, in principle it would be possible to teach it to use "tagged" values, too, but it'd still be a major change as it couldn't call functions like mpz_add directly anymore. Plus the Julia kernel has dedicated code for (de)serializing BigInt values; if those Ptr suddenly were tagged, this serializing code would need to be adapted.

Regardless, I think this would be more appropriate as part of an effort for custom allocators - from what I've read in discussions in other PRs, your usecase mostly deals with using such tagged pointers created outside of julia, correct?

No, not really. Well, perhaps that, too; but for me it's mostly about thinking forward and coming up with ways to replace some of that native / external code with pure Julia code. But I keep coming back to the lack of tagged pointers / immediate values. Without those, all kinds of optimizations are off the table.

And I am not talking about something small in some cases. Take this admittedly somewhat silly and artificial example (I have real examples but they wouldn't make sense w/o explaining a lot of theory first):

function foo(s::BigInt, x::BigInt)
  for i in 1:x
    s += i
  end
  return s
end

function foo2(s::BigInt, x::BigInt)
  for i in 1:x
    GMP.MPZ.add!(s,s,i)
  end
  return s
end

x = big(2)^20

@time foo(big(0), x)
@time foo(big(2)^60, x)
@time foo(big(2)^160, x)

@time foo2(big(0), x)
@time foo2(big(2)^60, x)
@time foo2(big(2)^160, x)

Output:

julia> @time foo(big(0), x)
  0.340893 seconds (10.49 M allocations: 224.000 MiB, 29.50% gc time)
549756338176

julia> @time foo(big(2)^60, x)
  0.338118 seconds (10.49 M allocations: 224.000 MiB, 29.11% gc time)
1152922054363185152

julia> @time foo(big(2)^160, x)
  0.324207 seconds (10.49 M allocations: 240.000 MiB, 27.39% gc time)
1461501637330902918203684832716283020205688881152

julia> @time foo2(big(0), x)
  0.276934 seconds (8.39 M allocations: 176.000 MiB, 28.10% gc time)
549756338176

julia> @time foo2(big(2)^60, x)
  0.281817 seconds (8.39 M allocations: 176.000 MiB, 28.36% gc time)
1152922054363185152

julia> @time foo2(big(2)^160, x)
  0.279814 seconds (8.39 M allocations: 176.000 MiB, 28.68% gc time)
1461501637330902918203684832716283020205688881152

Now compare this to a similar program in GAP, an interpreted language which uses an "immediate integer optimization":

foo:=function(s, x)
  local i;
  for i in [1..x] do
    s := s + i;
  od;
  return s;
end;

x := 2^20;
foo(0, 2^20); time;
foo(2^60, 2^20); time;
foo(2^160, 2^20); time;

results in the following (the given timings are in milliseconds):

gap> foo(0, 2^20); time;
549756338176
5
gap> foo(2^60, 2^20); time;
1152922054363185152
26
gap> foo(2^160, 2^20); time;
1461501637330902918203684832716283020205688881152
26

So this blows Julia out of the water, being 100 times faster, even when compared to Julia code which supposedly uses in-place addition. Admittedly, it is still that much faster when it can't resort to immediate integers for all values (the last example starts with an accumulator value of 2^160, which of course does not fit into the tagged pointer).

Also, I was a bit unfair: you may very well argue "but you added small integers, so you should have used add_ui! in Julia. Indeed, if I do that, I get similar performance:

julia> function foo3(s::BigInt, x::Int)
         for i in 1:x
           GMP.MPZ.add_ui!(s,s,i)
         end
         return s
       end
foo3 (generic function with 1 method)

julia> @time foo3(big(0), 2^20)
  0.002998 seconds (3 allocations: 48 bytes)
549756338176

julia> @time foo3(big(0), 2^20)
  0.003058 seconds (3 allocations: 48 bytes)
549756338176

julia> @time foo3(big(2)^160, 2^20)
  0.003070 seconds (5 allocations: 128 bytes)
1461501637330902918203684832716283020205688881152

But this only because I knew beforehand that the values were small. In general code I don't have that luxury and must assume that in the worst case, all values are "big".

I personally think it would make sense to have a custom allocator responsible for memory management of these objects, but I'm admittedly out of my depth in terms of feasability of having that in julia (and/or it may just be me wanting that). My gut also says this shouldn't really need special GC support and it should work with a primitive type of the native pointer size, for which you define the appropriate unsafe_load etc. you mentioned above.

Well, any existing implementation I know of has dedicated support in the GC, because the GC needs to know what is a pointer and what is not.

I've thought long and hard about this in the past two years, and I really don't see how to achieve this without support by the Julia kernel -- though it's very well possible I missed something, and I'd be happy to learn about a concrete way how to do it differently. We already got "foreign type" supported added to the Julia kernel of course, but those don't seem sufficient to me -- of course inside a foreign object I can do whatever I want, including storing tagged pointer. But to get to the foreign object, Julia already had to dereference a pointer. (Another problem is the lack of support for serializing foreign objects, see issue #46214).

@Seelengrab
Copy link
Contributor

Seelengrab commented Nov 29, 2022

Most likely that means it was not phrased as well as it should have been :-/. So it's good you asked, and hopefully things are a bit clearer now :-). Please keep asking.

You're welcome! Always happy to look like a fool when it makes things clearer for the next person :)

Ah, very nice, and interesting. Well, I think it'll be difficult to match performance even for just, say, addition, in pure Julia -- one really wants access to machine instructions like add-with-carry. (Perhaps there is a way to get those using some advanced LLVM magic? That'd be quite cool.) But it definitely would be quite interesting to try how could one can get!

Well, my naive version gets within 2x performance of GMP - I sadly don't have GAP installed, so I can't test that on my machine :/ I just pushed my local version to github, so a comparison of this specific benchmark on your machine would be welcome :)

julia> function foo(a, b)
        for x in 1:b
            BigNums.add!(a, x)
        end
        s
    end


julia> @time foo(ArbUInt(0), UInt(2^20))
  0.007725 seconds (2 allocations: 80 bytes)
0x8000080000

julia> @time foo(ArbUInt(2)^160, UInt(2^20))
  0.008077 seconds (32 allocations: 1.797 KiB)
0x10000000000000000000000000000008000080000

julia> @time foo2(big(2)^160, UInt(2^20)) # the same but with the GMP intrinsic hardcoded
  0.004596 seconds (5 allocations: 128 bytes)
1461501637330902918203684832716283020205688881152

julia> big"0x10000000000000000000000000000008000080000"
1461501637330902918203684832716283020205688881152

so it seems like you've found the performance problem in my multiplication - thank you! Though to be fair, I have not optimized the library too extensively, it's just a "let me try to play with this" kind of deal :) Seems like I really do need to take a look at this hot loop with MCAnalyzer.jl, to figure out how to schedule the additions better and use a better add_with_carry than naively promoting to a larger type! I don't have ranges implemented yet, so no ArbUInt + ArbUInt test, I'm afraid.. I could imagine having a "fat array" kind of deal, with the first word being stored explicitly outside of the array.. Should have the same advantage as an immediate value pointer, but without the unpacking stuff 🤔 I'll have to tinker a bit once I find the time.

Regardless, now that my misunderstanding is cleared up, this seems like a good idea 👍

@brenhinkeller brenhinkeller added the performance Must go faster label Nov 29, 2022
@StefanKarpinski
Copy link
Sponsor Member

I agree with @vtjnash that this is a nice API proposal and fills a big GAP for us. Making it UInt64 everywhere seems appealing since 64-bit integers aren't that terrible on 32-bit systems and otherwise it's hard to spare the bits there. I'm thinking about the other use for tagged pointers, which would be a string type that is effectively Union{String15, String} and wondering if we could use this to implement that as well.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 30, 2022

String7, but yes

@StefanKarpinski
Copy link
Sponsor Member

Right, our pointers aren't that fat (yet).

@c42f
Copy link
Member

c42f commented Mar 13, 2023

I want this! Having short strings encoded as immediate values would be huge.

Perhaps a different named would be better, e.g. ImmediateValueRef{T}, to indicate that this is a ref/pointer which can encode certain "immediate values").

I like this name better. Another possibility ImmediateOrRef{T}? I don't like TaggedRef because it refers to the implementation rather than the semantic of this type.

Or it may be generalized to ImmediateOrRef{J,T}, where J is a bits type and T a pointer. (Can't do bits types with 63 bits though...)

But seeing this, it's so close conceptually to Union{J,T}, that it's tempting to want this optimization for all possible fields of type Union{J,T}, hoping that the following could be a new, faster representation of String:

struct String
    data::Union{Tuple{UInt8,7}, _StringBuf}
end

Viewing it as a storage optimization for Union does have problems, specifically that Union{UInt8, Integer} is simplified to Integer so I guess it might be infeasible. But the similarity is tantalizing.


Here's another speculative use case I was thinking about yesterday, before I saw this issue:

I'm experimenting with adding a single new field to Expr for tracking the source location; I only want this to be a single pointer in size, but I want to be able to store one of either

  • An index into a source map
  • A pointer to some form of lossless syntax tree like JuliaSyntax.GreenNode

This would allow us to store more detailed source location information and have it loaded lazily without overly bloating Expr

@vchuravy
Copy link
Member

Viewing it as a storage optimization for Union does have problems, specifically that Union{UInt8, Integer} is simplified to Integer so I guess it might be infeasible. But the similarity is tantalizing.

Yeah, I am wondering if we want it instead as field semantics, similar to @atomic.

@c42f
Copy link
Member

c42f commented Mar 13, 2023

I am wondering if we want it instead as field semantics, similar to @atomic.

I like this idea. Then we get to express it in the code as a Union, which it basically is. But we sidestep any issues of changing the ABI for all structs containing unions.

On the other hand... what if fields with a Union of a small-enough bits type and a pointer could have this optimization turned on by default? This would change the ABI for all such structs and have big consequences in the C API I guess. Seems kind of breaking, but it may unlock free performance for a wide range of user code?

@Tokazama
Copy link
Contributor

Could we tie this approach into an efficient Option type and/or an enum type backed with something like a tagged union?

@Seelengrab
Copy link
Contributor

Seelengrab commented Mar 17, 2023

Not sure whether an efficient Option (I guess you're thinking of the sort of optimizations Rust is doing with Option<NonZeroU64>)/specializing the compiler on that is really necessary - I've recently experimented with porting these examples to naive julia and except for the last one, a basic

struct None end

struct Some{T}
    x::T
end

const Option{T} = Union{Some{T}, None}

performed the same as raw UInt. The only case it didn't like was with primitive type NonZeroUInt64 64 end returning None when the given number was 0, but I think that's due to our compiler not wanting to merge two paths that produce the same singleton None for some reason. I can link a MWE in a gist if you like.

@Tokazama
Copy link
Contributor

There's already an issue about how that version of Option is slower than it needs to be (#45079). I can share more benchmarks to show how Union{Some{T}, Nothing} is unnecessarily slow, but I think unions are a well known issue and having a simple method like getfield_oftype(obj, field, ExpectedType, default) with tagged bits would be one way to avoid allocating loads just to check a type and potentially help reduce the number of union issues.

@Seelengrab
Copy link
Contributor

Seelengrab commented Mar 17, 2023

That's a bit of a different case than I'm thinking of, since introducing the Union-typed field removes any possibility for dropping the Union entirely, if it's statically inferrable not to be needed - let me type the naive port back up again and I'll link it, to explain what I mean.

@fingolfin fingolfin changed the title Add TaggedRef{T} / TaggedPtr{T} Add TaggedRef{T} / TaggedPtr{T} / ImmediateOrRef{T} Mar 22, 2023
@StefanKarpinski
Copy link
Sponsor Member

On the other hand... what if fields with a Union of a small-enough bits type and a pointer could have this optimization turned on by default? This would change the ABI for all such structs and have big consequences in the C API I guess. Seems kind of breaking, but it may unlock free performance for a wide range of user code?

Fields with Union type are already pretty unintelligible from C code since it's not a C-compatible object layout. Yes, this will currently be represented as a pointer to a tagged Julia object, which C code could conceivably work on, but it seems like people should be using a C API for that kind of field access anyway, which could be made to work with the changed layout. In other words, I kind of doubt this would affect much code and for code that it does affect and I really don't think this is part of anything that we've promised not to change.

@c42f
Copy link
Member

c42f commented Apr 10, 2023

I feel like there's some low level overlap between this and the discussion at #48883 - both cases seek to extend the Julia object model so that the GC can distinguish bits from pointers based on data which doesn't overlap with the bits part.

One could think of ImmediateOrRef{T} as a sum type of T and UInt63 (if that existed), along with a layout optimization which packs the sum type's tag into the unused bits of the pointer.

@Seelengrab
Copy link
Contributor

Another thought: bitpacking things into pointers is definitely not portable across architectures (AVR for one doesn't require pointers to be even, as far as I know - you can happily load single bytes by incrementing a pointer by 1), so implementing this as some form of a primitive type with explicit semantics instead of just wrapping a specially-treated Ptr{T} or somesuch seems better to me. This is not necessarily blocking for this proposal overall though.

@Tokazama
Copy link
Contributor

Another thought: bitpacking things into pointers is definitely not portable across architectures (AVR for one doesn't require pointers to be even, as far as I know - you can happily load single bytes by incrementing a pointer by 1)

This is bumping up against the limits of my knowledge, so forgive me if this is a dumb question. I thought we use something similar for marking pointers in the garbage collector, so how does Julia work at all on such platforms?

@Seelengrab
Copy link
Contributor

Seelengrab commented Apr 10, 2023

I thought we use something similar for marking pointers in the garbage collector, so how does Julia work at all on such platforms?

It doesn't, they are not supported :) Maybe they are some day in the far future. Even if they were, they wouldn't use our existing GC for memory management anyway. Just thought it would be good to mention that here, so someone looking through old issues at a later date has a breadcrumb of why a specific feature doesn't work in some settings (IF that ever comes to be).

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 10, 2023

bitpacking into Julia references (the constraint on them being part of the Julia object model and not naked pointers is important here) is fine on all architectures. It is simply a constraint imposed by the Julia object model that must be respected by all implementations of the runtime.

@Seelengrab
Copy link
Contributor

Do julia references have a defined size/model to work with? The issue I can imagine is that if we assume the lower e.g. 4 bits are free for usage in GC, we'd throw away about a quarter of the addressable space on AVR, since pointers there are only 16 bits wide.

@fingolfin
Copy link
Contributor Author

Julia won't support a 16 bit arch like AVR. I don't think discussing that any further is not helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants