`storescu.py` is ~35x slower than DCMTK's `storescu` when saving ultrasound studies #650

mself · 2021-06-18T10:46:24Z

Describe the bug
storescu.py is ~35x slower than DCMTK's storescu when saving ultrasound studies.

Expected behavior
I am exporting a 124-frame uncompressed ultrasound DICOM (208MB) to OsiriX Lite running locally on a Mac.
When I use storescu.py, it takes 19 sec, but when I use storescu, it only takes 0.5 sec.

Steps To Reproduce

% time python storescu.py localhost 11112 data/tmp_export/2/00001
python storescu.py localhost 11112 data/tmp_export/2/00001  2.38s user 1.78s system 21% cpu 19.201 total

% time storescu localhost 11112 data/tmp_export/2/00001   
storescu localhost 11112 data/tmp_export/2/00001  0.06s user 0.07s system 23% cpu 0.546 total

I did some profiling, and 96% of the time is being spent in _thread.lock.acquire() called from DIMSEServiceProvider::get_msg():

python -m cProfile -s cumtime storescu.py localhost 11112 data/tmp_export/2/00001
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    554/1    0.002    0.000   19.011   19.011 {built-in method builtins.exec}
        1    0.010    0.010   19.011   19.011 storescu.py:5(<module>)
        1    0.018    0.018   18.726   18.726 storescu.py:212(main)
        1    0.000    0.000   18.590   18.590 association.py:1570(send_c_store)
    12741   18.187    0.001   18.187    0.001 {method 'acquire' of '_thread.lock' objects}
        5    0.000    0.000   18.184    3.637 threading.py:264(wait)
        3    0.000    0.000   18.184    6.061 queue.py:153(get)
        1    0.000    0.000   18.167   18.167 dimse.py:171(get_msg)

It looks like calls to _thread.lock.acquire() take 1.4ms, on average, which seems very slow.

Any ideas on whether this is a configuration issue? Or maybe using collections.deque instead of Queue.Queue could help? I have read that it can be up to 20x faster (https://newbedev.com/queue-queue-vs-collections-deque).

Thank you!

Your environment

Darwin-20.5.0-x86_64-i386-64bit
Python  3.7.9 (default, Mar  1 2021, 16:48:13) 
[Clang 12.0.0 (clang-1200.0.32.29)]
pydicom  1.4.2
pynetdicom  1.5.3

The text was updated successfully, but these errors were encountered:

scaramallion · 2021-06-18T12:45:27Z

See #620 for some optimisations as well as the SCP tutorial. Also storescp.py is written more as an example than a fully optimised SCP.

Also there's some info in #639

I'll take a look and see what effect deque has, but it might not be suitable

mself · 2021-06-19T16:20:55Z

After reading the notes in #531, I tried adjusting the value of DULServiceProvider::_run_loop_delay. Reducing the loop delay makes a huge improvement in performance!
- _run_loop_delay = 0.001 -> 2.54s user 1.62s system 21% cpu 19.195 total
- _run_loop_delay = 0.0001 -> 1.00s user 1.37s system 74% cpu 3.192 total
- _run_loop_delay = 0.00001 -> 0.99s user 1.36s system 135% cpu 1.726 total
- _run_loop_delay = 0.000001 -> 1.08s user 1.42s system 145% cpu 1.722 total

Decreasing _run_loop_delay from 1ms to 10μs increases the performance by 11x!

Going from 10μs to 1μs did not make a noticeable difference. This on a 2020 MacBook Air with the M1 processor.

With this change, storescu.py is only 3.6x slower than storescu rather than 35x, which is a huge improvement.

mself · 2021-06-21T07:21:51Z

I also tested this with findscu.py. Decreasing _run_loop_delay from 1ms to 10μs also increased the performance of findscu.py by ~2x when doing a MWL query that returns 1,000 query results.

Is there any downside to making DULServiceProvider::_run_loop_delay significantly smaller? Or at least making it a configurable setting?

scaramallion · 2021-06-21T08:10:32Z

Higher CPU usage when messages aren't being exchanged.

The medium term plan is to move away from threading so it's probably not worth me adding (another) config option, especially since it is configurable already (if "private").

mself · 2021-06-21T08:43:44Z

OK, got it. I only use SCU mode, so messages are essentially always being exchanged. I did not notice any increase in CPU usage for my application.

I also did a quick test of replacing Queue.queue with collections.deque. It was easy to do for DULServiceProvider::event_queue and to_provider_queue, since those never do any blocking get() calls. You can just change put() -> append() and get() -> popleft() everywhere (and also queue.Empty to IndexError).

However, this only made a small speed improvement (5%?). But it seems like a simpler design since you don't need any of the locking features of Queue.queue and this is exactly what collections.deque is made for.

Good luck on the new non-threading design!

mself · 2021-06-21T18:04:57Z

I submitted PR #651 to address this without having to set _run_loop_delay to a smaller value. The speedup is even better than setting _run_loop_delay to 1μs and without an impact to idle CPU usage.

stefanoostwegel · 2021-07-12T10:05:52Z

I have managed to resolve performance issues by changing the windows time resolution.
May-be this could be worth something for you?
https://github.com/pydicom/pynetdicom/issues/531#issuecomment-878139375

mself · 2021-07-12T10:53:50Z

Thank you, I did review that change. I am running on Mac, so I don't think this issue applies in my case.

The PR that I submitted removes the need to sleep at all when additional events are in the queue ready to be processed, so it is better than making the sleep interval shorter.

stefanoostwegel · 2021-07-12T11:10:18Z

Ohw, did not realiser this is a fix probably for windows only...
What would happen if we were to skip to sleepers at all?
Is it truely important to have them sleep?

mself · 2021-07-12T11:37:42Z

The DUL "reactor" processes events that are sent between the SCU and the SCP. When there are no events in the queue, you need to wait for an event to arrive over the network. The _run_loop_delay parameter determines how long to sleep, which is a tradeoff between (1) how much CPU the system uses while waiting, and (2) how fast the system responds to new events from the network.

But, when sending (or receiving) a large DICOM, the DICOM is processed as a large number of separate events. If there are more events waiting in the queue to be processed, there is no need to sleep before processing them. Currently, the code sleeps before every event. PR #651 modifies the reactor to only sleep when the event queue is empty.

scaramallion added the performance label Jun 18, 2021

mself mentioned this issue Jun 21, 2021

[MRG] Improve DUL performance #651

Merged

scaramallion closed this as completed Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`storescu.py` is ~35x slower than DCMTK's `storescu` when saving ultrasound studies #650

`storescu.py` is ~35x slower than DCMTK's `storescu` when saving ultrasound studies #650

mself commented Jun 18, 2021 •

edited

Loading

scaramallion commented Jun 18, 2021 •

edited

Loading

mself commented Jun 19, 2021

mself commented Jun 21, 2021

scaramallion commented Jun 21, 2021

mself commented Jun 21, 2021 •

edited

Loading

mself commented Jun 21, 2021

stefanoostwegel commented Jul 12, 2021 •

edited

Loading

mself commented Jul 12, 2021

stefanoostwegel commented Jul 12, 2021

mself commented Jul 12, 2021

storescu.py is ~35x slower than DCMTK's storescu when saving ultrasound studies #650

storescu.py is ~35x slower than DCMTK's storescu when saving ultrasound studies #650

Comments

mself commented Jun 18, 2021 • edited Loading

scaramallion commented Jun 18, 2021 • edited Loading

mself commented Jun 19, 2021

mself commented Jun 21, 2021

scaramallion commented Jun 21, 2021

mself commented Jun 21, 2021 • edited Loading

mself commented Jun 21, 2021

stefanoostwegel commented Jul 12, 2021 • edited Loading

mself commented Jul 12, 2021

stefanoostwegel commented Jul 12, 2021

mself commented Jul 12, 2021

`storescu.py` is ~35x slower than DCMTK's `storescu` when saving ultrasound studies #650

`storescu.py` is ~35x slower than DCMTK's `storescu` when saving ultrasound studies #650

mself commented Jun 18, 2021 •

edited

Loading

scaramallion commented Jun 18, 2021 •

edited

Loading

mself commented Jun 21, 2021 •

edited

Loading

stefanoostwegel commented Jul 12, 2021 •

edited

Loading