You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to find an explanation for a certain kind of trace aborts which are among the main causes for the blacklisting of loops by the compiler in my use cases.
As an example, I'm using code from the apps.ipv6.fragmenter app which is currently waiting to be upstreamed (#1383 and #1384). The actual code doesn't matter, i.e. I see the same effect all the time with various pieces of code. I did not find a reasonably small program that would exhibit the issue in a reproducible manner.
Here is the relevant part of the push() method of this particular version of the fragmenter app
for _ = 1, link.nreadable(input) do -- Code line 313
local pkt = link.receive(input)
local mtu = self.mtu
if self.pmtud then
local h = ffi.cast(ether_ipv6_header_ptr_t, pkt.data)
local entry = self.dcache:lookup_ptr(h.ipv6.dst_ip)
if entry then
mtu = entry.value.mtu
end
end
-- FIXME: assumes that there is always room to store the MTU at
-- the end of the payload.
ffi.cast("uint16_t *", pkt.data + pkt.length)[0] = mtu
if pkt.length <= mtu + ether_header_len then
-- No need to fragment; forward it on.
counter.add(self.shm["out-ipv6-frag-not"])
link.transmit(output, pkt)
else
-- Packet doesn't fit into MTU; need to fragment.
link.transmit(input, pkt)
end
end
for _ = 1, link.nreadable(input) do -- Code line 336
local pkt = link.receive(input)
local mtu = ffi.cast("uint16_t *", pkt.data + pkt.length)[0]
local next_header =
ffi.cast(ether_ipv6_header_ptr_t, pkt.data).ipv6.next_header
pkt_box[0] = pkt
self:fragment_and_transmit(next_header, pkt_box, mtu)
packet.free(pkt_box[0])
end
The line numbers that are relevant in the following JIT dumps are 313 (the start of the first loop) and 336 (the start of the second loop).
In the JIT dump, there are multiple attempts to compile the loop starting at 313 but they all fail and the loop is eventually getting blacklisted. All of these attempts come in one of exactly two flavors. The first one looks like this:
What's happening here is that the interpreter is currently processing an iteration of the loop when the FORL byte code becomes hot (there must have been at least one iteration at this point, because otherwise the FORL byte code would not be executed at all). The FORL instruction is not part of the trace but it is evaluated, which means that first the loop variable is incremented and checked against the upper limit (which was determined by link.nreadable(input) earlier when the FORI byte code was executed). If the loop condition still holds, the interpreter would jump to the loop body and recording would continue.
However, in this case it must have turned out that the loop condition no longer holds (i.e. the last iteration run before recording started was the last one) and the interpreter records the first instruction that follows the loop, which happens to be the loading of the lower bound of the loop starting at line 336, corresponding to the byte code KSHORT 5 1. The recorder notices that this instruction is not within the body of the loop and aborts.
The second flavour of aborted traces related to that loop is this:
In this case, execution of the FORL instruction results in the recording of an actual iteration of the loop. The trace passes through the then branches of both if statements, executes link.transmit(output, pkt) and finally arrives at the FORL instruction of the first loop (which is located at the end of the loop body). Why does it abort there? The trace recorder recognizes that this iteration happened to have been the last one for this evaluation of the loop. Therefore, it cannot record the actual loop itself and must abort.
Here is the complete sequence of aborted traces that eventually lead to blacklisting of the FORL byte code of the loop starting at line 313
What all of this means is, I guess, that this loop appears to have exactly either one or two iterations left whenever it becomes hot until it is blacklisted. But how can that be?
The loop starts out with the default hot counter (it is counted down and the value of zero triggers the recorder). The first thing to note is that the loop byte codes don't have individual hop counters but use a pretty small hash table of counters indexed by the address of the byte code (essentially). That means that a particular loop can be traced earlier than expected, but that does no harm and seems not related to the issue under discussion.
Each time the recording is aborted, the hot counter is reset and a penalty value is added to it so it takes more iterations until the next recording is attempted. The penalty value is not constant but includes some pseudo-randomness.
Given all that, how can it be that recording always happens for the last one or two iterations? The only reasonable explanation seems to be that the loop actually never has more than two packets to process over an extended period of time. I guess this can, in fact, happen occasionally but I'm seeing the same behavior over and over with all sorts of loops that makes this explanation look pretty improbable to me.
I'm looking for a deeper understanding of this phenomenon because it seems to be at the core of some of the performance issues I see when my programs are subjected to certain changes of the workload.
The text was updated successfully, but these errors were encountered:
I have to read the JIT code and think about this but please indulge me in a hot-take :-)
Is this just a bug in lj_record.c? If the JIT is waiting for the program counter to return to the start of the loop in order to complete the trace, and this is not happening because the loop is terminating, then perhaps the JIT should instead complete the trace when it reaches the FORL i.e. "pretend" for the purpose of code generation that the branch back into the loop is taken? This seems straightforward if the taken branch was about to immediately complete the loop without recording any further instructions.
Generally I am bugged whenever heuristics like "leaving loop in root trace" lead to blacklistings. Just shouldn't happen IMHO.
I'm trying to find an explanation for a certain kind of trace aborts which are among the main causes for the blacklisting of loops by the compiler in my use cases.
As an example, I'm using code from the
apps.ipv6.fragmenter
app which is currently waiting to be upstreamed (#1383 and #1384). The actual code doesn't matter, i.e. I see the same effect all the time with various pieces of code. I did not find a reasonably small program that would exhibit the issue in a reproducible manner.Here is the relevant part of the
push()
method of this particular version of thefragmenter
appThe line numbers that are relevant in the following JIT dumps are 313 (the start of the first loop) and 336 (the start of the second loop).
In the JIT dump, there are multiple attempts to compile the loop starting at 313 but they all fail and the loop is eventually getting blacklisted. All of these attempts come in one of exactly two flavors. The first one looks like this:
What's happening here is that the interpreter is currently processing an iteration of the loop when the
FORL
byte code becomes hot (there must have been at least one iteration at this point, because otherwise theFORL
byte code would not be executed at all). TheFORL
instruction is not part of the trace but it is evaluated, which means that first the loop variable is incremented and checked against the upper limit (which was determined bylink.nreadable(input)
earlier when theFORI
byte code was executed). If the loop condition still holds, the interpreter would jump to the loop body and recording would continue.However, in this case it must have turned out that the loop condition no longer holds (i.e. the last iteration run before recording started was the last one) and the interpreter records the first instruction that follows the loop, which happens to be the loading of the lower bound of the loop starting at line 336, corresponding to the byte code
KSHORT 5 1
. The recorder notices that this instruction is not within the body of the loop and aborts.The second flavour of aborted traces related to that loop is this:
In this case, execution of the
FORL
instruction results in the recording of an actual iteration of the loop. The trace passes through thethen
branches of bothif
statements, executeslink.transmit(output, pkt)
and finally arrives at theFORL
instruction of the first loop (which is located at the end of the loop body). Why does it abort there? The trace recorder recognizes that this iteration happened to have been the last one for this evaluation of the loop. Therefore, it cannot record the actual loop itself and must abort.Here is the complete sequence of aborted traces that eventually lead to blacklisting of the
FORL
byte code of the loop starting at line 313After the abort of trace 613, the
FORL
is replaced by aIFROL
and all traces passing through it from then on abort with something likeWhat all of this means is, I guess, that this loop appears to have exactly either one or two iterations left whenever it becomes hot until it is blacklisted. But how can that be?
The loop starts out with the default hot counter (it is counted down and the value of zero triggers the recorder). The first thing to note is that the loop byte codes don't have individual hop counters but use a pretty small hash table of counters indexed by the address of the byte code (essentially). That means that a particular loop can be traced earlier than expected, but that does no harm and seems not related to the issue under discussion.
Each time the recording is aborted, the hot counter is reset and a penalty value is added to it so it takes more iterations until the next recording is attempted. The penalty value is not constant but includes some pseudo-randomness.
Given all that, how can it be that recording always happens for the last one or two iterations? The only reasonable explanation seems to be that the loop actually never has more than two packets to process over an extended period of time. I guess this can, in fact, happen occasionally but I'm seeing the same behavior over and over with all sorts of loops that makes this explanation look pretty improbable to me.
I'm looking for a deeper understanding of this phenomenon because it seems to be at the core of some of the performance issues I see when my programs are subjected to certain changes of the workload.
The text was updated successfully, but these errors were encountered: