Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most efficient region reading from tiled multi-resolution tiff images #100

Closed
nd26 opened this issue May 19, 2019 · 26 comments
Closed

Most efficient region reading from tiled multi-resolution tiff images #100

nd26 opened this issue May 19, 2019 · 26 comments
Labels

Comments

@nd26
Copy link

nd26 commented May 19, 2019

Hi!

This is not really an issue, but more of advice seeking. The context is that I am working with histopathological images (known as whole slide images) that have a typical size of 100k x 100k pixels, and have therefore dedicated libraries to perform reading and writing. The most popular example is openslide, and then there is also asap. However, following a recent acquisition of an NVME disk with >3gb/s read speeds, I've noticed zero to negligible improvements in reading regions from such images, using these libraries. Therefore, I had to delve deeper to figure out the reasons. And this has led me to lower-level libraries, i.e. libvips.

Reading regions from these images is the general bottleneck of my methodology (as it happens many times per second) and as such, I need to speed it up as much as possible, up until another (probably CPU-bound) bottleneck appears. What i've done so far is that I benchmarked the "Image.new_from_file" method by reading the same large tiff image from three different disks, an HDD, an SSD, and then the two new NVMes.

image

image

So here are the questions: - Is this the most efficient way to do the reading, i.e. do we really need to read the whole image before reading the region? - Since these are pyramidal tiffs, is there a way to access a specific level/resolution? - Generally, do you think these times are appropriate for disks with the following read speeds and given that the image has a size of 2.4 GB (2,402,766,863 bytes)?

HDD: ~100MB/s
SSD: ~500MB/s
NVMe: ~2.8 GB/s

@jcupitt
Copy link
Member

jcupitt commented May 19, 2019

Hello @nd26,

WSI files are usually compressed with jpeg2000, so almost all your runtime will be spent inside your jp2k decoder. The speed of the disc won't really have any effect.

You'd need to share your benchmark for me to comment on the speed. Is this for machine learning? pyvips 2.1.6 / libvips 8.8 have a new region feature which can give a nice speedup for this type of application (generating 1000s of eg. 32x32 pixel areas). If it's for image display, libvips has a fast path for this already, have a look at vipsdisp. Yes, you can read out levels from the pyramid, have a look at the vips_openslideload() docs.

https://github.com/jcupitt/vipsdisp

https://libvips.github.io/libvips/API/current/VipsForeignSave.html#vips-openslideload

OpenSlide uses OpenJPEG for decode, so you could investigate ways of speeding that up. There's been some talk of GPU acceleration for decode, and there are patches for AVX as well. They should help.

In terms of hardware, a huge CPU with many cores will help too.

@jcupitt
Copy link
Member

jcupitt commented May 19, 2019

For a simple benchmark, I'd try:

$ time vips avg CMU-1.svs 
233.716228

real	0m38.827s
user	1m22.160s
sys	0m2.017s

ie. find the average pixel value. That should max your CPU. If it doesn't, there's some more speed to squeeze from your IO system.

@nd26
Copy link
Author

nd26 commented May 20, 2019

Hi @jcupitt,

I appreciate your help. Thank you!

You'd need to share your benchmark for me to comment on the speed.

SSD

(pytorch_env) nd26@volta:~/Downloads$ time vips avg CMU-1.svs
233.716228

real	0m56.508s
user	0m57.908s
sys	0m3.824s

NVME

(pytorch_env) nd26@volta:~/Downloads$ time vips avg /nvme1/CMU-1.svs
233.716228

real	0m56.817s
user	0m57.760s
sys	0m4.100s

HDD

(pytorch_env) nd26@volta:~/Downloads$ time vips avg /media/nd26/data/CMU-1.svs 
233.716228

real	0m56.622s
user	0m57.984s
sys	0m4.076s

Is this for machine learning?

Yes. Just to clarify, I do not need to simply preprocess the whole slide images a priori into tiles, but instead read patches on the go while the network is training, which is why I am trying to make the reading part as efficient as possible.

pyvips 2.1.6 / libvips 8.8 have a new region feature which can give a nice speedup for this type of application (generating 1000s of eg. 32x32 pixel areas).

Is this what it's been used when I employ the "crop" method?

OpenSlide uses OpenJPEG for decode, so you could investigate ways of speeding that up. There's been some talk of GPU acceleration for decode, and there are patches for AVX as well. They should help.

Could you elaborate on this? I am more than interested.

In terms of hardware, a huge CPU with many cores will help too.

Any recommendations?

Thanks again!

@jcupitt
Copy link
Member

jcupitt commented May 20, 2019

You wrote:

real 0m56.622s
user 0m57.984s

That's curious, you should see user about 2x real on a two-core machine. What processor are you using?

libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.

Here's a benchmark:

#!/usr/bin/python3
  
import sys
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1])

patch_size = 64
n_across = image.width // patch_size
n_down = image.height // patch_size
x_max = n_across - 1
y_max = n_down - 1

n_patches = 0
for y in range(0, n_down):
    print("row {} ...".format(y))
    for x in range(0, n_across):
        patch = image.crop(x * patch_size, y * patch_size,
                           patch_size, patch_size)
        patch_f = patch.fliphor()

        patch0 = patch.write_to_memory()
        patch1 = patch.rot90().write_to_memory()
        patch2 = patch.rot180().write_to_memory()
        patch3 = patch.rot270().write_to_memory()

        patch4 = patch_f.write_to_memory()
        patch5 = patch_f.rot90().write_to_memory()
        patch6 = patch_f.rot180().write_to_memory()
        patch7 = patch_f.rot270().write_to_memory()

        n_patches += 8

print("{} patches generated".format(n_patches))

So this is building a new pipeline for every patch, executing it, and shutting it down. On this machine, it generates 12,512 patches in 41s from CMU-1-Small-region.svs.

Here's a version using the new libvips 8.8 fetch feature:

#!/usr/bin/python3
  
import sys
import pyvips

def fetch(region, patch_size, x, y):
    return region.fetch(patch_size * x, patch_size * y, patch_size, patch_size)

image[0] = pyvips.Image.new_from_file(sys.argv[1])
image[1] = image0.rot90()
image[2] = image0.rot180()
image[3] = image0.rot270()

image[4] = image[0].fliphor()
image[5] = image[4].rot90()
image[6] = image[4].rot180()
image[7] = image[4].rot270()

reg = [pyvips.Region.new(x) for x in image]

patch_size = 64
n_across = image0.width // patch_size
n_down = image0.height // patch_size
x_max = n_across - 1
y_max = n_down - 1

n_patches = 0
for y in range(0, n_down):
    print("row {} ...".format(y))
    for x in range(0, n_across):
        patch0 = fetch(reg[0], patch_size, x, y)
        patch1 = fetch(reg[1], patch_size, y_max - y, x)
        patch2 = fetch(reg[2], patch_size, x_max - x, y_max - y)
        patch3 = fetch(reg[3], patch_size, y, x_max - x)

        patch4 = fetch(reg[4], patch_size, x_max - x, y)
        patch5 = fetch(reg[5], patch_size, y_max - y, x_max - x)
        patch6 = fetch(reg[6], patch_size, x, y_max - y)
        patch7 = fetch(reg[7], patch_size, y, x)

        n_patches += 8

print("{} patches generated".format(n_patches))

Now this is making 8 pipelines and fetching patches from them. It's not rebuilding the pipeline each time, so it's much quicker. This version makes 12,512 patches in 0.5s from CMU-1-Small-region.svs.

@jcupitt
Copy link
Member

jcupitt commented May 20, 2019

re. OpenJPEG, have a look at at the mailing list, there has been some discussion of AVX and OpenCL acceleration, plus some branches with code.

@nd26
Copy link
Author

nd26 commented May 21, 2019

That's curious, you should see user about 2x real on a two-core machine. What processor are you using?

description: CPU
          product: Intel(R) Core(TM) i3-8350K CPU @ 4.00GHz

libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.

I generate patches with a resolution that is a multiple of 224x224. 448x448 and 896x896 pixels are the most common. It's a hyperparameter.

Here's a version using the new libvips 8.8 fetch feature:

I've ran into the "libvips too old" error while trying to test this. Installation seems to have went well, but not integrated into anaconda. I've downloaded "vips-8.8.0-rc3.tar.gz" from https://github.com/libvips/libvips/releases/tag/v8.8.0-rc3 and followed the typical configure-make-install procedure with no errors. Do you know why this might be the case?

re. OpenJPEG, have a look at at the mailing list, there has been some discussion of AVX and OpenCL acceleration, plus some branches with code.

Where can I find their mailing list? Sorry, I couldn't find it.

@jcupitt
Copy link
Member

jcupitt commented May 21, 2019

I'd guess you have several libvipses installed and it's picking up the wrong one. Try:

import logging
logging.basicConfig(level=logging.DEBUG)
import pyvips

And you'll see some diagnostics.

ojg mailing list:

https://groups.google.com/forum/#!forum/openjpeg/join

@nd26
Copy link
Author

nd26 commented Jul 16, 2019

Hi @jcupitt,

I just had the chance to get back to this. I've managed to run your code, and indeed, the latter (with fetch) is impressively faster. Correct me if I am wrong, but I think the following way is the only way I can take advantage of the fetch method for my problem.

If I am not mistaken, the speedup is due to having the pipelines set up once, instead of every time a patch is needed. However, in my application, a single patch is retrieved from a single whole slide image. This happens around 10 (.. the batch size) times a second from different whole slide images. I cannot know a priori which patches will need to be extracted, and since patches are extracted sequentially from all whole slide images, the only way I can see the above working is to have a pipeline setup for each whole slide image before the training commences, and then when a patch needs to be extracted, to simply retrieve the pipeline and use the fetch function instead of "read_region". Is this correct?

Empirically, do you think this will speedup the reading operation? Note that the patches have 896 x 896 pixels.

@jcupitt
Copy link
Member

jcupitt commented Jul 16, 2019

Those are big patches!

How many slide images will you be working from?

@nd26
Copy link
Author

nd26 commented Jul 16, 2019

Those are big patches!

Indeed!

How many slide images will you be working from?

Around 650!

@jcupitt
Copy link
Member

jcupitt commented Jul 16, 2019

650 isn't so many. I would open them all (ie. make 650 pipelines) and then use fetch to pull out tiles.

You could try using crop as well, but fetch would probably be quicker.

Hopefully you have loads of memory!

@nd26
Copy link
Author

nd26 commented Jul 16, 2019

650 isn't so many. I would open them all (ie. make 650 pipelines) and then use fetch to pull out tiles.

You could try using crop as well, but fetch would probably be quicker.

Hopefully you have loads of memory!

Alright, I will do that once the current experiment ends. I have 32GB RAM, and 24GB GPU VRAM. Hopefully, it will be enough.

By the way, you said:

That's curious, you should see user about 2x real on a two-core machine. What processor are you using?

Is this an issue/suboptimality given the following CPU?

description: CPU
product: Intel(R) Core(TM) i3-8350K CPU @ 4.00GHz

@jcupitt
Copy link
Member

jcupitt commented Jul 16, 2019

Try --vips-progress, it gives some more info about computation:

$ time vips avg CMU-2.svs --vips-progress
vips temp-1: 78000 x 30462 pixels, 8 threads, 128 x 128 tiles, 256 lines in buffer
vips temp-1: done in 26.8s          
230.618127

real	0m26.848s
user	2m3.476s
sys	0m1.818s

So I have 4 cores, 8 threads and see ~4x speedup.

@andreped
Copy link

@nd26 did you manage to get any boost in reading performance using pyvips instead of openslide in the batch generator? I'm studying the same thing right now, but I'm not finding openslide to be that slow tbh. Of course storing patches in a hdf5-file and reading from that directly is faster, but not more than about 2x. And isn't openslide already using libvips?

@Abhinav-Telukunta
Copy link

Abhinav-Telukunta commented Mar 24, 2021

You wrote:

real 0m56.622s
user 0m57.984s

That's curious, you should see user about 2x real on a two-core machine. What processor are you using?

libvips has quite a high set-up cost for pipelines, but when set up, runs quickly. If you are generating images of over 512x512 or so, it's fine, but for 32x32 patches or whatever, pipeline cost will dominate.

Here's a benchmark:

#!/usr/bin/python3
  
import sys
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1])

patch_size = 64
n_across = image.width // patch_size
n_down = image.height // patch_size
x_max = n_across - 1
y_max = n_down - 1

n_patches = 0
for y in range(0, n_down):
    print("row {} ...".format(y))
    for x in range(0, n_across):
        patch = image.crop(x * patch_size, y * patch_size,
                           patch_size, patch_size)
        patch_f = patch.fliphor()

        patch0 = patch.write_to_memory()
        patch1 = patch.rot90().write_to_memory()
        patch2 = patch.rot180().write_to_memory()
        patch3 = patch.rot270().write_to_memory()

        patch4 = patch_f.write_to_memory()
        patch5 = patch_f.rot90().write_to_memory()
        patch6 = patch_f.rot180().write_to_memory()
        patch7 = patch_f.rot270().write_to_memory()

        n_patches += 8

print("{} patches generated".format(n_patches))

So this is building a new pipeline for every patch, executing it, and shutting it down. On this machine, it generates 12,512 patches in 41s from CMU-1-Small-region.svs.

Here's a version using the new libvips 8.8 fetch feature:

#!/usr/bin/python3
  
import sys
import pyvips

def fetch(region, patch_size, x, y):
    return region.fetch(patch_size * x, patch_size * y, patch_size, patch_size)

image[0] = pyvips.Image.new_from_file(sys.argv[1])
image[1] = image0.rot90()
image[2] = image0.rot180()
image[3] = image0.rot270()

image[4] = image[0].fliphor()
image[5] = image[4].rot90()
image[6] = image[4].rot180()
image[7] = image[4].rot270()

reg = [pyvips.Region.new(x) for x in image]

patch_size = 64
n_across = image0.width // patch_size
n_down = image0.height // patch_size
x_max = n_across - 1
y_max = n_down - 1

n_patches = 0
for y in range(0, n_down):
    print("row {} ...".format(y))
    for x in range(0, n_across):
        patch0 = fetch(reg[0], patch_size, x, y)
        patch1 = fetch(reg[1], patch_size, y_max - y, x)
        patch2 = fetch(reg[2], patch_size, x_max - x, y_max - y)
        patch3 = fetch(reg[3], patch_size, y, x_max - x)

        patch4 = fetch(reg[4], patch_size, x_max - x, y)
        patch5 = fetch(reg[5], patch_size, y_max - y, x_max - x)
        patch6 = fetch(reg[6], patch_size, x, y_max - y)
        patch7 = fetch(reg[7], patch_size, y, x)

        n_patches += 8

print("{} patches generated".format(n_patches))

Now this is making 8 pipelines and fetching patches from them. It's not rebuilding the pipeline each time, so it's much quicker. This version makes 12,512 patches in 0.5s from CMU-1-Small-region.svs.

@jcupitt can you write this same code in C language for generating patches using libvips functions: vips_image_new_from_file(),vips_crop() ,vips_image_write_to_file()?

@jcupitt
Copy link
Member

jcupitt commented Mar 24, 2021

It's pretty easy, just build the pipeline, then use vips_region_fetch() to pull pixels out of it:

https://libvips.github.io/libvips/API/current/VipsRegion.html#vips-region-fetch

@Abhinav-Telukunta
Copy link

Abhinav-Telukunta commented Mar 25, 2021

It's pretty easy, just build the pipeline, then use vips_region_fetch() to pull pixels out of it:

https://libvips.github.io/libvips/API/current/VipsRegion.html#vips-region-fetch

@jcupitt Thanks for helping me out. I have used libvips to generate patches. But,when I included the function vips_region_fetch, it says implicit declaration of vips_region_fetch function.
Here is the error:

test.c:16:18: warning: implicit declaration of function ‘vips_region_fetch’; did you mean ‘vips_region_black’? [-Wimplicit-function-declaration]
   VipsPel *patch=vips_region_fetch(region,0,0,width,height,width*height);
                  ^~~~~~~~~~~~~~~~~
                  vips_region_black
test.c:16:18: warning: initialization makes pointer from integer without a cast [-Wint-conversion]
/tmp/cc6OF4Sh.o: In function `main':
/content/test.c:16: undefined reference to `vips_region_fetch'
collect2: error: ld returned 1 exit status

In which header file vips_region_fetch() function is defined?

@jcupitt
Copy link
Member

jcupitt commented Mar 25, 2021

It's part of the standard API from libvips 8.8, I think. Perhaps you have an old version installed?

@Abhinav-Telukunta
Copy link

I have installed libvips 8.4.5.

@jcupitt
Copy link
Member

jcupitt commented Mar 25, 2021

You'll need to update -- your libvips is five years old.

@Abhinav-Telukunta
Copy link

Yeah!! I have updated it now to libvips 8.8 and its working fine. Thank you.

@wenqf11
Copy link

wenqf11 commented Feb 20, 2022

@jcupitt I applied fetch method to 4 types of WSI for 2048*2048 patch, mrxs, tiff, ndpi, svs, they both use JPEG compression schema. their fetch time for one single patch as follows:
mrxs: average = 0.271734s
tiff: average = 0.852996s
ndpi: average = 0.318353s
svs: average = 0.329577s
Why is tiff slower than the others?

@jcupitt
Copy link
Member

jcupitt commented Feb 20, 2022

Hi @wenqf11,

fetch is quick for small tiles and random reads, especially if you have several pipelines you want to run at once (the example above generates the 8 rotates and flips for each 32x32 patch). This is often what you'd use to train a classifier. If you want large tiles (eg. your 2048x2048), then crop is much quicker.

This is because of how they work. fetch skips all the libvips pipeline setup and teardown, so it can help if only a small number of pixels are going to be computed. But because there's no pipeline, it can't do much parallelization. crop builds a full pipeline for each tile, so start and stop are slower, but it can compute pixels in parallel.

Here's a benchmark:

#!/usr/bin/python3

import random
import time
import sys

import pyvips

if len(sys.argv) != 4:
    print("usage: ./fetch-vs-crop.py IMAGE SIZE N-TILES")

size = int(sys.argv[2])
n_tiles = int(sys.argv[3])
print(f"{n_tiles} tiles, each {size}x{size} pixels")

image = pyvips.Image.new_from_file(sys.argv[1]) 

print(f"random read")

start = time.time()
for _ in range(n_tiles):
    x = random.randint(0, image.width - size)
    y = random.randint(0, image.height - size)
    tile = image.crop(x, y, size, size)
    # necessary to force evaluation
    avg = tile.avg()
end = time.time()
print(f"crop: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")

start = time.time()
region = pyvips.Region.new(image)
for _ in range(n_tiles):
    x = random.randint(0, image.width - size)
    y = random.randint(0, image.height - size)
    p = region.fetch(x, y, size, size)
end = time.time()
print(f"fetch: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")

print(f"sequential read")

image = pyvips.Image.new_from_file(sys.argv[1])

start = time.time()
tile_number = 0
# most tile formats are column-major internally
for y in range(0, image.height - size, size):
    for x in range(0, image.width - size, size):
        tile = image.crop(x, y, size, size)
        # necessary to force evaluation
        avg = tile.avg()

        tile_number += 1
        if tile_number > n_tiles:
            break

    if tile_number > n_tiles:
        break
end = time.time()
print(f"crop: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")

start = time.time()
tile_number = 0
region = pyvips.Region.new(image)
for y in range(0, image.height - size, size):
    for x in range(0, image.width - size, size):
        x = random.randint(0, image.width - size)
        y = random.randint(0, image.height - size)
        p = region.fetch(x, y, size, size)

        tile_number += 1
        if tile_number > n_tiles:
            break

    if tile_number > n_tiles:
        break
end = time.time()
print(f"fetch: time per tile = {1000 * (end - start) / n_tiles:.2f}ms")

I see:

$ ./fetch-vs-crop.py ~/pics/openslide/CMU-1.svs 32 1000
1000 tiles, each 32x32 pixels
random read
crop: time per tile = 2.70ms
fetch: time per tile = 2.01ms
sequential read
crop: time per tile = 1.94ms
fetch: time per tile = 1.92ms

So for one small tile (no flips and rotates), fetch is quicker for random reads. They are the same speed for sequential reads, since the cache helps both equally. fetch would be faster if there were flips as well.

The situation changes with large tiles:

$ ./fetch-vs-crop.py ~/pics/openslide/CMU-1.svs 2048 100
100 tiles, each 2048x2048 pixels
random read
crop: time per tile = 21.40ms
fetch: time per tile = 130.32ms
sequential read
crop: time per tile = 19.11ms
fetch: time per tile = 129.85ms

Now fetch is very slow because it can't parallelize. There's little difference between sequential and random, since the libvips pixel caches won't help.

tiff is slower for parallel JPEG read since the libtiff JPEG decoder is single-threaded.

@wenqf11
Copy link

wenqf11 commented Feb 20, 2022

@jcupitt ok, thanks a lot!

@lafith
Copy link

lafith commented Jan 24, 2023

patch0 = patch.write_to_memory()
        patch1 = patch.rot90().write_to_memory()
        patch2 = patch.rot180().write_to_memory()
        patch3 = patch.rot270().write_to_memory()

        patch4 = patch_f.write_to_memory()
        patch5 = patch_f.rot90().write_to_memory()
        patch6 = patch_f.rot180().write_to_memory()
        patch7 = patch_f.rot270().write_to_memory()

Hi,
I am trying to use this code for Deep Learning Training. How to convert these patches, say patch0, to a numpy array?

@jcupitt
Copy link
Member

jcupitt commented Jan 24, 2023

Hi @lafith,

Please open a new issue for new questions -- it'll make it easier for other people to find answers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants