Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readNativeFrames Performance Optimizations #161

Open
suyashkumar opened this issue Dec 19, 2020 · 18 comments · Fixed by #160
Open

readNativeFrames Performance Optimizations #161

suyashkumar opened this issue Dec 19, 2020 · 18 comments · Fixed by #160
Assignees

Comments

@suyashkumar
Copy link
Owner

This is an issue to track various investigations and changes to optimize readNativeFrames implementation, which I noticed in pprof profiles seemed to take up a majority of the CPU time in parsing DICOMs with Native pixel data.

More comprehensive tests and benchmarks were added in #160 to aid with this.

@suyashkumar suyashkumar self-assigned this Dec 19, 2020
suyashkumar added a commit that referenced this issue Dec 19, 2020
This change reduces the number of allocations in `readNativeFrames`. Instead of allocating a slice for each pixel's samples, a single flat buffer slice is allocated upfront of the size `pixelsPerFrame*samplesPerPixel`. Later, ranges in that slice are referred to in the larger 2D slice. This leads to there only being two calls to `make`, leading to significant performance gains.

On my machine running `make bench-diff`:
```
name           old time/op  new time/op  delta
Parse/1.dcm-8  2.29ms ± 1%  1.95ms ± 3%  -15.00%  (p=0.008 n=5+5)
Parse/2.dcm-8  2.27ms ± 1%  1.94ms ± 0%  -14.41%  (p=0.008 n=5+5)
Parse/3.dcm-8  2.10ms ± 0%  1.81ms ± 2%  -13.77%  (p=0.008 n=5+5)
Parse/4.dcm-8   240ms ± 1%   196ms ± 4%  -18.27%  (p=0.008 n=5+5)
Parse/5.dcm-8  33.6ms ± 1%  27.9ms ± 0%  -17.00%  (p=0.008 n=5+5)
```

We see similar results in the [GitHub action benchmark](https://github.com/suyashkumar/dicom/pull/157/checks?check_run_id=1574909456#step:5:26).

Presumably the percentage performance gains would be even higher for DICOMs with more Native PixelData (e.g. more frames, pixels per frame, samples per pixel, etc).

This helps address #161.
@suyashkumar
Copy link
Owner Author

#157 led to some significant performance improvements by reducing slice allocations.

@suyashkumar
Copy link
Owner Author

Also, on this subject, @kaxap has observed that eliminating calls to binary.Read may reduce slice allocations and improve performance a bit, so that's something we should try. In particular, doing something like the following to read uint8, uint16, uint32s, etc:

buf := make([]byte, bitsAllocated/8)
for pixel := 0; pixel < pixelsPerFrame*samplesPerPixel; pixel++ {
  _ := io.ReadFull(d, buf)
  val := int(binary.LittleEndian.Uint16(buf))
}

The main advantage here is reusing the buffer slice (since bitsAllocated should be constant), instead of binary.Read creating one within.

@kaxap
Copy link
Contributor

kaxap commented Dec 20, 2020

it is also possible to lazily load frames from file which will result in smaller memory footprint (some dicom files can be gigabytes in size)

@suyashkumar
Copy link
Owner Author

suyashkumar commented Dec 20, 2020

This is somewhat along the lines of #162--which is allowing interested parties to not read the pixel data, or perhaps, to pause reading before actually parsing the PixelData (which can be done with the Parser API, but not at the frame by frame level, but at the tag by tag level). We can also read in the bytes and do nothing with unpacking them until a user wants the image.Image, which is almost what we do today. I'm leaning towards something similar to current behavior + an option to not read PixelData (or any blocklist of tags potentially). We also already have a "streaming" option where frames can be consumed by a consumer as they are parsed, instead of having to wait for the full set of frames (or the whole dicom) to be parsed.

What do you think? Also, would love to learn more about your use cases if you have any particular ones. Cheers!

@kaxap
Copy link
Contributor

kaxap commented Dec 20, 2020

Most of the cases all I want to do is to get some tags and a single frame rendered (for a thumbnail or a preview). Lazy mode (when we save offsets in file for each frame instead of reading all of them into RAM) looks very appealing in this case.

@bbeckrest
Copy link

bbeckrest commented May 18, 2021

Regarding memory usage, we've recently had our server (32GB RAM) run out of memory while attempting to parse a 1GB zipfile containing dicoms.

Using pprof, I've benchmarked parsing a 55MB zipfile containing 222 dicom images (each image 526KB, total ~117MB unzipped) and noticed readNativeFrames() ballooning memory usage to ~1.74GB.

Is this high memory usage (15x) expected?

Screen Shot 2021-05-18 at 12 28 45 PM

@suyashkumar
Copy link
Owner Author

Interesting, thank you for reporting! I don't think this is expected. Out of curiosity, what version of the library are you using? Does it include some of the recent optimizations included in this issue?

I'll take a closer look tonight and dive deep if needed this weekend.

@bbeckrest
Copy link

bbeckrest commented May 18, 2021

Thanks for looking into this! I don't have much experience profiling performance so I appreciate the help.

Our prod server that ran out of memory is using v0.4.5, which obviously does not have the recent optimizations.

However, my benchmark is using v1.0.3. We were hoping to see that upgrading the library would solve the issue.

@suyashkumar
Copy link
Owner Author

Thanks! So, something that sticks out to me is that we are packing smaller integers into the possibly larger go int type in the NativeFrame. This is done in order to have a clean API with the same static types for the caller no matter what the underlying integer type might be (e.g. 8-bit, 16-bit unsigned integers). I think this is useful from the user's perspective, so they have one NativeFrame type to deal with, and they can use regular 'ol ints.

However, this could easily cause large increases in memory compared to the size of a DICOM on disk--for example an underlying set of 8-bit integers in the DICOM would be read into a slice of normal Go ints--which might be 32 or 64bits wide (8 times more memory).

This is a situation in which generics in Go would be very helpful, so we could have a NativeFrame<intLikeType>.

Simple option?

One temporary and somewhat simple solution could be using uint32 in this API instead of int, if I can verify that DICOMs don't ever have 64 bits of data in them for NativeFrames (this saves 32 bits per int on 64-bit wide systems).

More involved option

Another option is exploring having a different NativeFrame type per size of integer. While it's not the cleanest as just being able to access the underlying data directly, I could envision something like the below, where we make NativeFrame an interface and not a struct. And, we have 3 structs that implement the interface that wrap the underlying int types.

type NativeFrame interface {
    Rows() int
    Cols() int
    BitsPerSample int
    GetPixel(x, y int) int
}

type nativeFrame8 struct {
    underlyingData uint8
}
...
type nativeFrame16 struct {
    underlyingData uint16
}
...

In the absense of generics, this will allow our memory representation to be as small as possible, while still exposing a consistent integer API for any entities that want to use the NativeFrame data. That being said, if a caller is copying out the data from the underlying NativeFrame, and grabbing it as ints, we'll run into the same problem of expanding it in memory anyway, just downstream from here (e.g. when turning this into an image later for example, if it goes through an intermediate int stage)...so might need to look at a couple of our use cases to see.

Let me know if folks have any thoughts or if you can shed light on any use case specific details.

Also, this might not explain all of the memory overhead--will look into other possible options soon too.

We should be able to use the BenchmarkReadNativeFrames to see the impact of various possible changes on the memory profile at least in that fn.

@bbeckrest
Copy link

bbeckrest commented May 27, 2021

Reading through the DICOM spec, it appears that allocating more than 32 bits is rare. If I'm understanding correctly, 64 bits are only used for Double Float Pixel Data.

FWIW, I've looked at ~90 chest CT scans that have been uploaded to our web app by doctors and they all allocate 16 bits.

Our use case involves determining all viable series & a single "best" series within a CT scan with 1-N series, which is eventually segmented. For each series, we parse the header to immediately filter out series that don't meet our criteria (not enough slices, too large slice thickness, etc.). After getting a list of viable series, we parse & store the pixel data for every slice in these series (this can take up a lot of memory if the scan has many viable series). After creating a 3D volume from the series' pixel data, we save jpg images as slices through the volume for each anatomical plane (coronal, axial, sagittal). Saving the images doesn't require much memory since we write each pixel sequentially, converting it from int to uint16 (similar to NativeFrame.GetImage()).

We are in the process of refactoring code for our use case described above so that we're only storing pixel data for one series at a time, which will cut down our memory usage. Obviously, it'd still be nice to reduce footprint while parsing.

If you don't mind, I'm going to take a stab at implementing the NativeFrame interface. I don't think we should worry too much about downstream memory usage with this implementation. The user can drop bits by converting the int returned by GetPixel() to whatever uint type they wish so int isn't stored long-term. However, we should still be mindful of the performance implications from this conversion.

@suyashkumar
Copy link
Owner Author

suyashkumar commented May 28, 2021

@bbeckrest That is my understanding as well re: the size of the pixel data.

Regarding the NativeFrame interface I proposed in #161 (comment) I think it's reasonable but keeping in mind my performance comment.

With my performance comment, what I meant was calling GetPixel() int in certain circumstances will allocate a fullsize integer. In some cases, that integer may only be temporary, and be sent for garbage collection--like if iterating and processing pixel by pixel--this is the scenario in which we will see a drastic reduction in memory. In other scenarios we will not see a reduction in memory footprint, at the cost of a slightly more API. That might be an okay tradeoff, but it is probably still a narrow set of use cases. On the otherhand, not using int in the API, and instead using int32 always will always help in 64 bit systems--but the user being able to ask for any size int from the API probably helps here, but we need to be careful about overflow. Maybe we should have APIs for other int sizes, but if there's overflow, we reutrn an error. e.g. GetAsUInt16() (uint16, error)

Overall I think my proposal is ok #161 (comment), and I think it should be a good tradeoff in the interim, espicially if it will help you with your usecase. I'm not 100% sure if it will solve all of your problems, if during the creation of the 3D Volume you ever hold the whole series in memory as full ints, but you of course have a better sense of that than I! I think that generics in Go will help a lot here so we can have PixelData<IntType> types (or something like that).

Please let me know if you'd like to send a PR for this, and let me know if there are any questions! No worries if you don't have time, I can knock this out at some point soon too. 😄

@bbeckrest
Copy link

@suyashkumar Yeah the proposal in #161 (comment) would be helpful for my use-case since the volume could be stored as uint16.

Yesterday, I attempted to implement it but got stuck and realized I didn't really know what I was doing (here's my attempt). It's a decently large & important change so I'll probably defer to you here. I was having an issue implementing the NativeFrames interface while minimizing the API change.

@suyashkumar
Copy link
Owner Author

hi @bbeckrest! So there is another issue that may somehow be related to this (though I'm not exactly sure how or why). For files that have a weird bytesAllocated, we ended up not reading the pixel data correctly, which I'm addressing in #210. Another user of the library was experiencing around 1.7GB of memory being allocated (that wasn't being cleaned up properly by the Go gc) and their dicom was triggering this condition. It could still very well be an independent issue. Just figured I'd mention it!

Anyway, I do think I will circle back to my proposed solution above, and will take a look at your attempt as well!

@ducquangkstn
Copy link
Contributor

though I'm not exactly sure how or why

We have a DICOM (33Mb) with bitAllocated = 1

  • using pydicom to read the pixel data results into an array with type=uint8 251 Mb ~ 33 Mb * 8
  • using this library results into about 2Gb ~ 251 Mb * 8 memory located, the output array is []int (64bit)

I guess the issue is from data type representation.

@suyashkumar
Copy link
Owner Author

Yep, in the case you describe that is the case (though in the comment above I was referring to another issue).

#161 (comment) is where I lay out a possible interface for handling this in as clean of a way as possible. I can probably also add a getData interface{} that people can type assert to the right underlying type too if they just want the underlying data.

Though now that go has generics in 1.18 we may be able to adopt those to have a NativeFrame, but if we do, then folks will need to have a hard dep on at least go 1.18 and we would have generics exposed in a public API (which may not be that bad at this point...)

@suyashkumar
Copy link
Owner Author

suyashkumar commented May 27, 2024

I have been playing around with some loose hacking on what a multi-go-integer NativeFrame might look like here: #315. It needs a lot of clean up and extra testing, but feel free to play with it if you have dicoms that were previously causing you issues.

Looking at the frame APIs again, I do think they need to be revisited in general, but this is something that should get the desired effect at least for now.

@suyashkumar
Copy link
Owner Author

So with the exploratory changes in #315, I've learned of another large contributor to this memory ballooning, and it's not the uint16->int64 expansion interestingly iiuc.

Right now, the way frames parse the data is a 2D array something like [][]uint16, with the first dimension being the pixel, and the inner dimension being the number of values per pixel (as a slice). So the way this is setup, we have 1 []uint16 slice per pixel. Well, it looks like an empty slice descriptor in Go itself is 24 bytes (!) (initial ref), so this means we're allocating 24 bytes per pixel to often times store 1-3 values per pixel (that are each 2 bytes only in the case of uint16)!

So, at a minimum, we might want to store a flat underlying representation of the values in the frame, and then offer APIs that help you navigate by pixel easily as needed. (That is, if we even want to process NativePixelData at all during processing).

Either way, even if we process it later, we'll want to be mindful of this. I haven't verified that this is the issue, but I'm guessing much of the memory is actually going to the slice headers as opposed to the uint16->int64 expansion!

@suyashkumar
Copy link
Owner Author

Some updates: in #315 (comment) you'll see an implementation that yields significant memory and cpu improvements for native pixeldata processing. I still need to iterate and clean up some tests / api though. Please let me know if you have any thoughts in general here though. This will be a pretty significant change to how Native PixelData is represented and interacted with.

suyashkumar added a commit that referenced this issue Aug 9, 2024
…and supporting generic data slices. (#315)

This change refactors the Go NativeFrame representation to vastly reduce the memory footprint (by __90%__ in some cases) and CPU usage (~30% in some cases). It does this by:

- Storing data in a flat slice, instead of a nested one (e.g. `[][]uint16`). This helps because the 24 byte slice header for Go slices is very significant here for the inner small slices (e.g. when storing 1-3 uint16s, which is often the case). More details in my comment here: #161 (comment)
- Using generics to allow the Go NativeFrame Data slice use an appropriately sized Go integer based on the bitsAllocated (e.g. have the underlying slice be `uint8`, `uint16`, etc instead of always a full-size `int`).

__This is an API breaking change.__ I'm not sure I love the INativeFrame API and naming, but introducing an interface likely makes it easier to interact with the NativeFrame, without forcing propagation of generic signatures everywhere. It also still preserves access to the underlying concrete integer slice if needed. Having a more efficient default NativeFrame representation is important, so lets get this in and iterate if needed before the next release.

This fixes #291 and addresses a big part of #161.

## Benchmarks

See the PR GitHub Action [here](https://github.com/suyashkumar/dicom/actions/runs/10294852102/job/28493571486?pr=315#step:5:69), with an example run from one of the runs screenshotted below:

Memory reduction:
<img width="1149" alt="Screenshot 2024-08-07 at 10 24 56 PM" src="https://github.com/user-attachments/assets/625ac3fd-e947-400f-9969-9a42ff1539b3">

CPU reduction:
<img width="1185" alt="Screenshot 2024-08-07 at 10 29 16 PM" src="https://github.com/user-attachments/assets/415f4b55-e069-402b-91d7-200bee0babb9">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants