Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storescu.py is ~35x slower than DCMTK's storescu when saving ultrasound studies #650

Closed
mself opened this issue Jun 18, 2021 · 10 comments
Closed

Comments

@mself
Copy link
Contributor

mself commented Jun 18, 2021

Describe the bug
storescu.py is ~35x slower than DCMTK's storescu when saving ultrasound studies.

Expected behavior
I am exporting a 124-frame uncompressed ultrasound DICOM (208MB) to OsiriX Lite running locally on a Mac.
When I use storescu.py, it takes 19 sec, but when I use storescu, it only takes 0.5 sec.

Steps To Reproduce

% time python storescu.py localhost 11112 data/tmp_export/2/00001
python storescu.py localhost 11112 data/tmp_export/2/00001  2.38s user 1.78s system 21% cpu 19.201 total

% time storescu localhost 11112 data/tmp_export/2/00001   
storescu localhost 11112 data/tmp_export/2/00001  0.06s user 0.07s system 23% cpu 0.546 total

I did some profiling, and 96% of the time is being spent in _thread.lock.acquire() called from DIMSEServiceProvider::get_msg():

python -m cProfile -s cumtime storescu.py localhost 11112 data/tmp_export/2/00001
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    554/1    0.002    0.000   19.011   19.011 {built-in method builtins.exec}
        1    0.010    0.010   19.011   19.011 storescu.py:5(<module>)
        1    0.018    0.018   18.726   18.726 storescu.py:212(main)
        1    0.000    0.000   18.590   18.590 association.py:1570(send_c_store)
    12741   18.187    0.001   18.187    0.001 {method 'acquire' of '_thread.lock' objects}
        5    0.000    0.000   18.184    3.637 threading.py:264(wait)
        3    0.000    0.000   18.184    6.061 queue.py:153(get)
        1    0.000    0.000   18.167   18.167 dimse.py:171(get_msg)

It looks like calls to _thread.lock.acquire() take 1.4ms, on average, which seems very slow.

Any ideas on whether this is a configuration issue? Or maybe using collections.deque instead of Queue.Queue could help? I have read that it can be up to 20x faster (https://newbedev.com/queue-queue-vs-collections-deque).

Thank you!

Your environment

Darwin-20.5.0-x86_64-i386-64bit
Python  3.7.9 (default, Mar  1 2021, 16:48:13) 
[Clang 12.0.0 (clang-1200.0.32.29)]
pydicom  1.4.2
pynetdicom  1.5.3
@scaramallion
Copy link
Member

scaramallion commented Jun 18, 2021

See #620 for some optimisations as well as the SCP tutorial. Also storescp.py is written more as an example than a fully optimised SCP.

Also there's some info in #639

I'll take a look and see what effect deque has, but it might not be suitable

@mself
Copy link
Contributor Author

mself commented Jun 19, 2021

After reading the notes in #531, I tried adjusting the value of DULServiceProvider::_run_loop_delay. Reducing the loop delay makes a huge improvement in performance!
- _run_loop_delay = 0.001 -> 2.54s user 1.62s system 21% cpu 19.195 total
- _run_loop_delay = 0.0001 -> 1.00s user 1.37s system 74% cpu 3.192 total
- _run_loop_delay = 0.00001 -> 0.99s user 1.36s system 135% cpu 1.726 total
- _run_loop_delay = 0.000001 -> 1.08s user 1.42s system 145% cpu 1.722 total

Decreasing _run_loop_delay from 1ms to 10μs increases the performance by 11x!

Going from 10μs to 1μs did not make a noticeable difference. This on a 2020 MacBook Air with the M1 processor.

With this change, storescu.py is only 3.6x slower than storescu rather than 35x, which is a huge improvement.

@mself
Copy link
Contributor Author

mself commented Jun 21, 2021

I also tested this with findscu.py. Decreasing _run_loop_delay from 1ms to 10μs also increased the performance of findscu.py by ~2x when doing a MWL query that returns 1,000 query results.

Is there any downside to making DULServiceProvider::_run_loop_delay significantly smaller? Or at least making it a configurable setting?

@scaramallion
Copy link
Member

Higher CPU usage when messages aren't being exchanged.

The medium term plan is to move away from threading so it's probably not worth me adding (another) config option, especially since it is configurable already (if "private").

@mself
Copy link
Contributor Author

mself commented Jun 21, 2021

OK, got it. I only use SCU mode, so messages are essentially always being exchanged. I did not notice any increase in CPU usage for my application.

I also did a quick test of replacing Queue.queue with collections.deque. It was easy to do for DULServiceProvider::event_queue and to_provider_queue, since those never do any blocking get() calls. You can just change put() -> append() and get() -> popleft() everywhere (and also queue.Empty to IndexError).

However, this only made a small speed improvement (5%?). But it seems like a simpler design since you don't need any of the locking features of Queue.queue and this is exactly what collections.deque is made for.

Good luck on the new non-threading design!

@mself
Copy link
Contributor Author

mself commented Jun 21, 2021

I submitted PR #651 to address this without having to set _run_loop_delay to a smaller value. The speedup is even better than setting _run_loop_delay to 1μs and without an impact to idle CPU usage.

@stefanoostwegel
Copy link

stefanoostwegel commented Jul 12, 2021

I have managed to resolve performance issues by changing the windows time resolution.
May-be this could be worth something for you?
https://github.com/pydicom/pynetdicom/issues/531#issuecomment-878139375

@mself
Copy link
Contributor Author

mself commented Jul 12, 2021

Thank you, I did review that change. I am running on Mac, so I don't think this issue applies in my case.

The PR that I submitted removes the need to sleep at all when additional events are in the queue ready to be processed, so it is better than making the sleep interval shorter.

@stefanoostwegel
Copy link

Ohw, did not realiser this is a fix probably for windows only...
What would happen if we were to skip to sleepers at all?
Is it truely important to have them sleep?

@mself
Copy link
Contributor Author

mself commented Jul 12, 2021

The DUL "reactor" processes events that are sent between the SCU and the SCP. When there are no events in the queue, you need to wait for an event to arrive over the network. The _run_loop_delay parameter determines how long to sleep, which is a tradeoff between (1) how much CPU the system uses while waiting, and (2) how fast the system responds to new events from the network.

But, when sending (or receiving) a large DICOM, the DICOM is processed as a large number of separate events. If there are more events waiting in the queue to be processed, there is no need to sleep before processing them. Currently, the code sleeps before every event. PR #651 modifies the reactor to only sleep when the event queue is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants