Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove extra word from object header (better approach to alignment) #10898

Closed
JeffBezanson opened this issue Apr 19, 2015 · 17 comments
Closed

remove extra word from object header (better approach to alignment) #10898

JeffBezanson opened this issue Apr 19, 2015 · 17 comments
Labels
GC Garbage collector performance Must go faster priority This should be addressed urgently
Milestone

Comments

@JeffBezanson
Copy link
Member

Currently every object has an extra word (added by 0d8cec3) to make the data area 16-byte aligned. This is not ok. The extra word should be removed, and alignment should instead be done by offsetting the first object in a page, and putting extra space between objects as necessary.

I believe the alignment rules should be (sizes not including tag):

  • 64-bit platforms: objects < 16 bytes aligned 8, all others aligned 16
  • 32-bit platforms: objects < 8 bytes aligned 4, otherwise same as 64-bit

cc @vtjnash @carnaval

Clarification: this issue asks that we fix the win64 issue without adding a word to every object. Implementing the above alignment scheme can be done if convenient, but is not itself required to close this.

@JeffBezanson JeffBezanson added performance Must go faster priority This should be addressed urgently GC Garbage collector labels Apr 19, 2015
@JeffBezanson JeffBezanson added this to the 0.4.0 milestone Apr 19, 2015
@pao
Copy link
Member

pao commented Apr 19, 2015

If this plays into the C interop of structures, then there's an exception for x86 Linux--doubles are only aligned 4 by default on that platform & architecture.

If you need to reliably match native alignment, jl_native_alignment() accesses the appropriate LLVM API.

@vtjnash
Copy link
Member

vtjnash commented Apr 19, 2015

the C-interop story here would be based on the semantics of malloc, which are as Jeff described above.

@vtjnash
Copy link
Member

vtjnash commented Apr 19, 2015

i disagree that this is priority or v0.4 target. v0.4.x seems more reasonable to me

@JeffBezanson
Copy link
Member Author

This has to be fixed immediately. You can't just add an extra word to every object and then shrug and say we don't have time to fix it.

@JeffBezanson
Copy link
Member Author

It's particularly egregious that boxed Float64s and Int64s (etc.) are now 50% bigger for no reason, because they don't end up 16 aligned anyway, and they don't need to be. This is a major performance regression that affects a large amount of code, to fix a relatively narrow issue.

@carnaval
Copy link
Contributor

So all those ABI considerations are still pretty vague for me. Given 64bit arch, and the fact that type tags and data have to be contiguous, we in fact are forced to waste 8 bytes per 16 bytes object to satisfy those constraints right ?
But this is only needed for vector types load/stores and setjmp right ?

@carnaval
Copy link
Contributor

(I meant 16 bytes objects as 16 bytes without the tag)

@JeffBezanson
Copy link
Member Author

Yes, we would still waste 8 bytes for some objects, but that's a lot better than wasting 8 bytes for all objects.

It would be fine with me to fix this as narrowly as possible, and only ensure alignment for jmp_buf where it is needed. We can leave vector type alignment for another day.

@tkelman
Copy link
Contributor

tkelman commented Apr 19, 2015

this probably should've been fixed before #10579 (comment) was merged at all

@JeffBezanson JeffBezanson changed the title better approach to object alignment remove extra word from object header (better approach to alignment) Apr 19, 2015
@carnaval
Copy link
Contributor

So for example, on my linux/x64 box, a jmp_buf is 200 bytes, which bring a jl_task_t to 320 bytes (including tag). 320 is both a multiple of 16 and (conveniently) an available pool size, so even if we apparently don't need it on this ABI, we actually are guaranteed to have task->ctx be 16 bytes aligned (since its offset inside the struct is 80 bytes).
(all of this is without the current "fat tag" hack if I'm not mistaken).
If we could just arrange so that the same situation arises on windows (by tweaking pool sizes and task_t padding, maybe reordering the fields) aren't we done ? At least until we wan't SIMD aligned loads and stores.

@vtjnash
Copy link
Member

vtjnash commented Apr 19, 2015

the compiler will not make that easy, since it is already guaranteeing that jmp_buf is at a 16-byte offset from the start of the struct

@carnaval
Copy link
Contributor

Oh. Now I get it. So with careful packing pragma and manual padding this should be doable right ?

@ScottPJones
Copy link
Contributor

I would also agree with @JeffBezanson that this should be fixed for 0.4.0, not some 0.4.x.
Memory usage has a huge effect on performance, esp. when you are dealing with lots and lots of processes on a machine...

@vtjnash
Copy link
Member

vtjnash commented Apr 25, 2015

Implementing the above alignment scheme can be done if convenient, but is not itself required to close this.

as it turns out, this was required by one of the dsp tests. i'm just waiting for travis to greenlight the i686 code to merge this.

vtjnash added a commit that referenced this issue Apr 27, 2015
mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
@ScottPJones
Copy link
Contributor

I've been having some crazy ideas, about alignment and allocation in Julia... may be all wet...
Say the box takes 8 bytes, and you have 8 bytes for either a pointer or data value. Those only need 8-byte alignment. But if you have an UInt128/Int128, (i.e. like ByteVec), it is better if you can keep that part always aligned. If you have a structure that is <= 56 bytes, it would be good to try to ensure that it is always in the same cache line as its box.
You could intertwine pools, to get best alignment, without wasting space for example, for one cache line:
[box] [box] [16-bytes aligned] [box] [box] [16-bytes aligned]
[box] [16-bytes unaligned] [box] [32-bytes, 32-byte aligned]
[box][box] [48-bytes 16-byte aligned]
etc.
It doesn't seem like julia currently is very cache line aware... but I very well may be mistaken...

@vtjnash
Copy link
Member

vtjnash commented Jun 19, 2015

what is a box? why does your example not seem to have the data in the box?

modern allocators have generally found that it is more efficient to segregate allocations by size. this wastes some space on odd-sized allocations, but is generally much faster at allocation and walking the pool (since it is constant size). and it saves a byte on each allocation to store the size of the subsequent data field.

if you've hit the memory allocator, you've already missed the fast-path of staying entirely in registers / on the stack with an extra couple function calls, data copies.

@ScottPJones
Copy link
Contributor

OK, I'm still learning about julia's allocator, but it looked like for many things like strings at least, there was 8 bytes of "box"ing, and then either a value or 8-byte pointer.

With my idea for intertwined pools, depending on how they are allocated, the pool still looks like it is a constant size, it simply has an offset to the next one larger than the element size.
In the above example, you'd have aligned 16-byte values, at a 64-byte cache-line offset + 48, with the tag information at offset 40, and the next aligned one in the pool would be in the next cache line.
There is no extra "byte" needed.

Staying in registers/stack is great, for the objects you are currently working on, yes... but if you've got lots of 128 bit or 256 bit fields or even 512 bit fields, you really want those to be optimally aligned
(i.e. for AVX, AVX-256, AVX-512 instructions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster priority This should be addressed urgently
Projects
None yet
Development

No branches or pull requests

6 participants