Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deque subscript _modify accessor heap allocates #164

Closed
2 tasks done
Lukasa opened this issue Aug 22, 2022 · 13 comments · Fixed by #165
Closed
2 tasks done

Deque subscript _modify accessor heap allocates #164

Lukasa opened this issue Aug 22, 2022 · 13 comments · Fixed by #165
Labels
bug Something isn't working
Milestone

Comments

@Lukasa
Copy link

Lukasa commented Aug 22, 2022

Consider the following simple function (the operation is not useful, it's just intended to be complex enough to defeat being optimized away and simple enough to make the generated assembly simple):

func twiddle(index: Int, twiddle: UInt8, deque: inout Deque<UInt8>) {
    deque[index] ^= twiddle
}

When compiled in release mode using swift-collections 1.0.2 and current main (c3fdcf7), this generates the following assembly on my arm64 Mac:

_$s18alloc_modify_repro10DequeReproV7twiddle5indexAD5dequeySi_s5UInt8V0D6Module0D0VyAHGztFZTf4nnnd_n:        // function signature specialization <Arg[3] = Dead> of static alloc_modify_repro.DequeRepro.twiddle(index: Swift.Int, twiddle: Swift.UInt8, deque: inout DequeModule.Deque<Swift.UInt8>) -> ()
sub        sp, sp, #0x40                               ; CODE XREF=_$s18alloc_modify_repro10DequeReproV4mainyyFZTf4d_n+208
stp        x20, x19, [sp, #0x20]
stp        fp, lr, [sp, #0x30]
add        fp, sp, #0x30
mov        x19, x1
mov        x1, x0                                      ; argument #2 for method _$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5
mov        x0, sp                                      ; argument #1 for method _$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5
mov        x20, x2
bl         _$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5     ; generic specialization <Swift.UInt8> of DequeModule.Deque.subscript.modify : (Swift.Int) -> A
mov        x8, x0
ldrb       w9, [x1]
eor        w9, w9, w19
strb       w9, [x1]
mov        x0, sp
mov        w1, #0x0
blr        x8
ldp        fp, lr, [sp, #0x30]
ldp        x20, x19, [sp, #0x20]
add        sp, sp, #0x40
ret

As you can see, _modify has not been inlined but it has been specialized. As a result, the call to the continuation has been split and delayed (visible as blr x8).

_modify has compiled to:

_$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5:        // generic specialization <Swift.UInt8> of DequeModule.Deque.subscript.modify : (Swift.Int) -> A
stp        x24, x23, [sp, #-0x40]!                     ; CODE XREF=_$s18alloc_modify_repro10DequeReproV7twiddle5indexAD5dequeySi_s5UInt8V0D6Module0D0VyAHGztFZTf4nnnd_n+32
stp        x22, x21, [sp, #0x10]
stp        x20, x19, [sp, #0x20]
stp        fp, lr, [sp, #0x30]
add        fp, sp, #0x30
mov        x21, x1
mov        x22, x0
mov        w0, #0x28                                   ; argument "size" for method imp___stubs__malloc
bl         imp___stubs__malloc                         ; malloc
str        x0, [x22]
stp        x21, x20, [x0]
tbnz       x21, 0x3f, loc_100012600

mov        x19, x0
ldr        x23, [x20]
mov        x0, #0x0
bl         _$sSo10HeapObjectVMa                        ; type metadata accessor for __C.HeapObject
ldr        x8, [x23, #0x18]
cmp        x8, x21
b.le       loc_100012604

add        x22, x19, #0x20
mov        x0, x23                                     ; argument "arg0" for method imp___stubs__swift_isUniquelyReferenced_nonNull_native
bl         imp___stubs__swift_isUniquelyReferenced_nonNull_native ; swift_isUniquelyReferenced_nonNull_native
tbnz       w0, 0x0, loc_1000125b0

bl         _$s11DequeModule0A0V8_StorageV15_makeUniqueCopyyyFs5UInt8V_Tg5 ; generic specialization <Swift.UInt8> of DequeModule.Deque._Storage._makeUniqueCopy() -> ()

loc_1000125b0:
ldr        x8, [x20]                                   ; CODE XREF=_$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5+88
ldr        x9, [x8, #0x20]!
str        x9, [x19, #0x10]
add        x9, x9, x21
ldur       x10, [x8, #-0x10]
str        x10, [x19, #0x18]
cmp        x9, x10
csel       x10, xzr, x10, lt
sub        x9, x9, x10
add        x8, x8, x9
ldrb       w8, [x8, #0x8]
strb       w8, [x19, #0x20]
adr        x0, #0x100012608
nop
mov        x1, x22
ldp        fp, lr, [sp, #0x30]
ldp        x20, x19, [sp, #0x20]
ldp        x22, x21, [sp, #0x10]
ldp        x24, x23, [sp], #0x40
ret

loc_100012600:
brk        #0x1                                        ; CODE XREF=_$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5+44

loc_100012604:
brk        #0x1                                        ; CODE XREF=_$s11DequeModule0A0VyxSiciMs5UInt8V_Tg5+72

Notice the call to malloc at the start of the function. This call drastically dominates the performance of _modify, and it makes Deque extremely difficult to use in performance sensitive code.

Information

  • Package version: 1.0.2 or main (c3fdcf7)
  • Platform version: 21G83
  • Swift version: swift-driver version: 1.45.2 Apple Swift version 5.6.1 (swiftlang-5.6.0.323.66 clang-1316.0.20.12) Target: arm64-apple-macosx12.0

Checklist

  • If possible, I've reproduced the issue using the main branch of this package.
  • I've searched for existing GitHub issues.

Steps to Reproduce

Use the above code sample in an empty Swift project that depends on swift-collections.

Expected behavior

No allocations in _modify.

Actual behavior

A direct call to malloc and a subsequent call to free.

@Lukasa Lukasa added the bug Something isn't working label Aug 22, 2022
@Lukasa
Copy link
Author

Lukasa commented Aug 22, 2022

The latest nightly build on Linux successfully inlines the _modify accessor into the calling code, and as a result the malloc is optimised out. This seems pretty fragile to me based on the current code.

@Lukasa
Copy link
Author

Lukasa commented Aug 22, 2022

(Latest nightly on Linux is swift-DEVELOPMENT-SNAPSHOT-2022-08-18-a.

@Lukasa
Copy link
Author

Lukasa commented Aug 22, 2022

The same is true in the latest nightly on macOS, the _modify accessor inlines.

@Lukasa
Copy link
Author

Lukasa commented Aug 22, 2022

The 5.7 nightlies, however, still allocate.

@xwu
Copy link
Contributor

xwu commented Aug 22, 2022

I wonder what change in the compiler made it capable of inlining this and optimizing out the malloc. Might be worth making sure that there was an adequate test case added in whatever PR that was in the compiler repo to ensure this optimization doesn’t regress.

@lorentey
Copy link
Member

We have very limited options here to work around this. Would marking the accessor @inline(__always) help here?

@lorentey
Copy link
Member

Hm, force-inlining the accessor does appear to help in Release builds. The code size implications are somewhat worrying, but I'm not sure what else can we do on this end.

I can try moving the preparations & cleanup into separate inlinable functions, perhaps that would be enough.

@lorentey
Copy link
Member

Yep, that seems to help:

  @inlinable
  public subscript(index: Int) -> Element {
    ...
    @inline(__always) // https://github.com/apple/swift-collections/issues/164
    _modify {
      precondition(index >= 0 && index < count, "Index out of bounds")
      var (slot, value) = _prepareForModify(at: index)
      defer {
        _finalizeModify(slot, value)
      }
      yield &value
    }
  }

  @inlinable
  internal mutating func _prepareForModify(at index: Int) -> (_Slot, Element) {
    _storage.ensureUnique()
    // We technically aren't supposed to escape storage pointers out of a
    // managed buffer, so we escape a `(slot, value)` pair instead, leaving
    // the corresponding slot temporarily uninitialized.
    return _storage.update { handle in
      let slot = handle.slot(forOffset: index)
      return (slot, handle.ptr(at: slot).move())
    }
  }

  @inlinable
  internal mutating func _finalizeModify(_ slot: _Slot, _ value: Element) {
    _storage.update { handle in
      handle.ptr(at: slot).initialize(to: value)
    }
  }
DequeTests`specialized DequeTests.twiddle(index:twiddle:deque:):
    0x105d07fcc <+0>:   sub    sp, sp, #0x40
    0x105d07fd0 <+4>:   stp    x22, x21, [sp, #0x10]
    0x105d07fd4 <+8>:   stp    x20, x19, [sp, #0x20]
    0x105d07fd8 <+12>:  stp    x29, x30, [sp, #0x30]
    0x105d07fdc <+16>:  add    x29, sp, #0x30
    0x105d07fe0 <+20>:  tbnz   x0, #0x3f, 0x105d0803c    ; <+112> [inlined] Swift runtime failure: precondition failure at <compiler-generated>
    0x105d07fe4 <+24>:  mov    x20, x2
    0x105d07fe8 <+28>:  mov    x19, x1
    0x105d07fec <+32>:  mov    x21, x0
    0x105d07ff0 <+36>:  ldr    x22, [x2]
    0x105d07ff4 <+40>:  mov    x0, #0x0
    0x105d07ff8 <+44>:  bl     0x105d0345c               ; type metadata accessor for __C.HeapObject
    0x105d07ffc <+48>:  ldr    x8, [x22, #0x18]
    0x105d08000 <+52>:  cmp    x8, x21
    0x105d08004 <+56>:  b.le   0x105d08040               ; <+116> [inlined] Swift runtime failure: precondition failure at <compiler-generated>
    0x105d08008 <+60>:  add    x0, sp, #0x8
    0x105d0800c <+64>:  mov    x1, x21
    0x105d08010 <+68>:  bl     0x105d0f4ec               ; generic specialization <Swift.UInt8> of DequeModule.Deque._prepareForModify(at: Swift.Int) -> (DequeModule._DequeSlot, A) at <compiler-generated>
->  0x105d08014 <+72>:  ldrb   w8, [sp, #0x8]
    0x105d08018 <+76>:  ldr    x9, [x20]
    0x105d0801c <+80>:  eor    w8, w8, w19
    0x105d08020 <+84>:  add    x9, x9, x0
    0x105d08024 <+88>:  strb   w8, [x9, #0x28]
    0x105d08028 <+92>:  ldp    x29, x30, [sp, #0x30]
    0x105d0802c <+96>:  ldp    x20, x19, [sp, #0x20]
    0x105d08030 <+100>: ldp    x22, x21, [sp, #0x10]
    0x105d08034 <+104>: add    sp, sp, #0x40
    0x105d08038 <+108>: ret    
    0x105d0803c <+112>: brk    #0x1
    0x105d08040 <+116>: brk    #0x1

@tbkka
Copy link

tbkka commented Aug 31, 2022

CC: @eeckstein

lorentey added a commit to lorentey/swift-collections that referenced this issue Sep 1, 2022
…a performance issue

Resolves apple#164.

(cherry picked from commit 98cd753)
@eeckstein
Copy link
Member

Not inlined co-routines almost always allocate. There is not much we can do here.
Except to give co-routines high priority for inlining.

@Lukasa
Copy link
Author

Lukasa commented Sep 1, 2022

I'm inclined to say we should investigate that idea: the cost of the coroutine allocations ends up being really painful.

@jckarter
Copy link
Member

jckarter commented Sep 1, 2022

Coroutine lowering is currently pretty naive and allocates a lot more than it needs to. We have a whole bunch of issues tracking various improvements to the lowering that would reduce the likelihood of allocation without relying on inlining.

@eeckstein
Copy link
Member

But still, the inline buffer size (which is 2 or 3 words) is a hard limit. Unless we increase that, it's still very likely that a co-routine will allocate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants