Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk ingestion performance regression because of compaction logic #8101

Open
rgolovanov opened this issue Nov 12, 2024 · 11 comments · Fixed by #8103
Open

Chunk ingestion performance regression because of compaction logic #8101

rgolovanov opened this issue Nov 12, 2024 · 11 comments · Fixed by #8103
Assignees
Labels
🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc project-many-entities-perf 🦟 regression A thing that used to work in an earlier release

Comments

@rgolovanov
Copy link
Contributor

rgolovanov commented Nov 12, 2024

Describe the bug
The latency in the Viewer is rapidly growing when dozens of scalar values are logged. The performance has degraded between 0.17.0 and 0.19.0 (or maybe even 0.18.1)

To Reproduce
Steps to reproduce the behavior:

  1. pip install rerun-sdk==0.17.0
  2. python rerun_latency_test.py (see code below)
  3. Check latency in performance metrics in Viewer
  4. Repeat steps (1) - (3) for version 0.19.0 and see how quickly the latency goes up and above a few seconds.

Expected behavior
The latency stays within 10-15ms range as it was in earlier versions.

Screenshots
image

Test script rerun_latency_test.py

probably, numbers can be different depending on the machine, but in my case the problem is visible when I log 10x5x3=150 scalar values.

import rerun as rr
import time
import random
import threading
import itertools

def log_scalar_values():
    rr.init(f"TestLatency", spawn=True, recording_id="default")
    while not stop_event.is_set():
        for i, j, k in itertools.product(range(0, 10), range(0, 5), range(0, 3)):
            scalar_value = random.uniform(0, 1)  # Generate a random scalar value
            rr.log(f"AAA/{i}/{j}/{k}/value", rr.Scalar(scalar_value))
            print(f"Log {scalar_value}")
        time.sleep(0.001)  # Sleep for 10ms to achieve approximately 100Hz


stop_event = threading.Event()

logging_thread = threading.Thread(target=log_scalar_values)
logging_thread.start()

input("Press Enter to stop logging...\n")
stop_event.set()
logging_thread.join()
print("Logging stopped.")

Desktop (please complete the following information):

  • OS: Windows 11

Rerun version
Detected in 0.19.0 but likely appeared in 0.18.0 or 0.18.1

@rgolovanov rgolovanov added 👀 needs triage This issue needs to be triaged by the Rerun team 🪳 bug Something isn't working labels Nov 12, 2024
@teh-cmc teh-cmc self-assigned this Nov 12, 2024
@teh-cmc teh-cmc added 🚀 performance Optimization, memory use, etc 🦟 regression A thing that used to work in an earlier release and removed 👀 needs triage This issue needs to be triaged by the Rerun team labels Nov 12, 2024
@teh-cmc teh-cmc added this to the 0.20 - Maps, H.264 milestone Nov 12, 2024
@teh-cmc
Copy link
Member

teh-cmc commented Nov 12, 2024

Thanks for the neat little repro @rgolovanov, that made it a breeze 👍

@rgolovanov
Copy link
Contributor Author

@teh-cmc Thanks for the extremely fast fix 🙏 We will be looking forward for new release 😊

@rgolovanov
Copy link
Contributor Author

@teh-cmc I installed rerun_sdk-0.20.0a4+dev-cp38-abi3-win_amd64.whl from latest development build which includes your fix. For me, it still shows growing latency...
image

@teh-cmc
Copy link
Member

teh-cmc commented Nov 13, 2024

@teh-cmc I installed rerun_sdk-0.20.0a4+dev-cp38-abi3-win_amd64.whl from latest development build which includes your fix. For me, it still shows growing latency...

Both 0.17 and the latest development wheel seem to have the same latencies for me @rgolovanov, i.e. they can both sustain the above script at 100Hz but fall behind at 1kHz.

Note that the script in the issue description has a typo in it -- it's running at 1kHz, not 100Hz:

- time.sleep(0.001)  # Sleep for 10ms to achieve approximately 100Hz
+ time.sleep(0.010)  # Sleep for 10ms to achieve approximately 100Hz

@rgolovanov
Copy link
Contributor Author

Yes, frequency matters of course. I think I started local testing with 100Hz and then switched to 1KHz because it was easier and faster way to reveal the issue. With time.sleep(0.010) intervals the latency also starts growing but takes more time. In my case, it reached 40s latency when recorded data was ~130MiB...

@teh-cmc
Copy link
Member

teh-cmc commented Nov 13, 2024

In my case, it reached 40s latency when recorded data was ~130MiB...

The amount of data already present shouldn't have any impact on the ingestion speed.

What I assume you're experiencing here is another issue: the fact that ingestion and rendering both happen on the same thread as of today:

By the time you've ingested 130MiB worth of scalars, and assuming your plot view is configured to display them all, rendering times will skyrocket and therefore ingestion will start lagging behind because the thread is being hogged by the renderer.
Try hiding the plot view, and you should see the issue go away (or use the selection panel to set its time bounds).

The reason this might appear to be faster on 0.17 is simply because scalars consumes at least 1 order of magnitude more memory than on 0.18+, so 130MiB worth of scalars on 0.17 is way, way less scalars than on 0.20, and therefore takes far less time to render.

@rgolovanov
Copy link
Contributor Author

@teh-cmc I updated to new official release 0.20.0 and can confirm that the performance is still low. the latency is rapidly growing even if only 10sec of recording (current time with offset) is shown in the view. As a workaround we use older version =0.16.1.

I know the description of the issue mentions 0.17.0 but could you profile performance difference between 0.16.1 and 0.20.0?
Perhaps there is something else which could be optimized?

@teh-cmc
Copy link
Member

teh-cmc commented Nov 18, 2024

@teh-cmc I updated to new official release 0.20.0 and can confirm that the performance is still low. the latency is rapidly growing even if only 10sec of recording (current time with offset) is shown in the view. As a workaround we use older version =0.16.1.

I know the description of the issue mentions 0.17.0 but could you profile performance difference between 0.16.1 and 0.20.0? Perhaps there is something else which could be optimized?

😞 I'll have a look.

@emilk
Copy link
Member

emilk commented Nov 20, 2024

Here is a profile dump:

Screenshot 2024-11-20 at 09 37 13

I see two separate problems

Slow ingestion

We spend 10ms ingesting input (which is our limit). This is slow because there are so many entities, i.e. this is part of this problem:

zooming in ona ingestion of a single chunk:
Screenshot 2024-11-20 at 09 37 45

Slow plots

There are several reasons why the plots are slow.

Many plot lintes

One is that there are so many lines. Each line needs processing:

Screenshot 2024-11-20 at 09 38 55

Slow egui rendering

The second problem is that they are rendered using egui, and apparently using way too many line primitives leading to slow tessellation:

Screenshot 2024-11-20 at 09 40 17

And also a lot of primitives to upload to the GPU:

Screenshot 2024-11-20 at 09 41 33

Conclusions

The slow ingestion is a real problem, and needs to be fixed with:

The slow rendering is just an effect of rendering this crazyness:

Screenshot 2024-11-20 at 09 43 05

This is not a realistic plot view of real data. This is madnenss :)

As soon as I zoom in, the performance is good again:
Screenshot 2024-11-20 at 09 43 51

A potential solution is to just aggregate the line plots a lot more aggressively, i.e. render them more coarsely.

@emilk emilk changed the title Performance degradation in Viewer when log multiple scalar values Chunk ingestion performance regression because of compaction logic Nov 20, 2024
@teh-cmc
Copy link
Member

teh-cmc commented Nov 20, 2024

TL;DR:

  • The slow rendering of the plot itself is nothing new -- there is always room to improve plot rendering, and we should always strive to do so, but this is not needed for 0.20.1.
  • The slow ingestion is because the reproduction script in the top-comment represents the worst possible case when logging (i.e.: lots of different entities with just enough time spent sleeping that the client-side micro-batcher never has any opportunity to actually batch anything (when using the default settings, at least)).
    This leads to a regression in ingestion time in that case, because all these tiny tiny chunks that couldnt be compacted on the client-side now have to be compacted on the server-side.
    I'll try and see if there are some very very low-hanging fruits to make that faster for 0.20.1, but more likely than not this will require some more serious work and will have to wait for 0.21.

@emilk
Copy link
Member

emilk commented Nov 21, 2024

It is likely that some of this ingestion overhead can be mitigated with more aggressive SDK-side chunk compaction. This can be enabled by changiung the environment variable RERUN_FLUSH_TICK_SECS, e.g. RERUN_FLUSH_TICK_SECS=0.2. This will trade a little bit of latency (in this case 0.2 seconds) for larger chunks, that should ingest much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 🚀 performance Optimization, memory use, etc project-many-entities-perf 🦟 regression A thing that used to work in an earlier release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants