[Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. #13196

pashu123 · 2023-04-20T16:16:52Z

What happened?

[VULKAN] ! Validation Error: [ VUID-vkAllocateMemory-pAllocateInfo-01713 ] Object 0: handle = 0x55ba3e384b30, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xe9a2b96f | vkAllocateMemory: attempting to allocate 1698693120 bytes from heap 2,but size of that heap is only 257949696 bytes. The Vulkan spec states: pAllocateInfo->allocationSize must be less than or equal to VkPhysicalDeviceMemoryProperties::memoryHeaps[memindex].size where memindex = VkPhysicalDeviceMemoryProperties::memoryTypes[pAllocateInfo->memoryTypeIndex].heapIndex as returned by vkGetPhysicalDeviceMemoryProperties for the VkPhysicalDevice that device was created from (https://vulkan.lunarg.com/doc/view/1.3.239.0/linux/1.3-extensions/vkspec.html#VUID-vkAllocateMemory-pAllocateInfo-01713)

Steps to reproduce your issue

Model IR: https://storage.googleapis.com/shark-public/prashant/unet_upcast/unet.mlir

Compile command:
iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan -iree-vulkan-target-triple=ampere-rtx3090-linux --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_check.mlir -o out.vmfb

Run command:
iree-run-module --device=vulkan --function=forward --input=2x4x96x96xf16=0.5 --input=1xf16=1.0 --input=2x77x1024xf16=0.5 --module=out.vmfb --vulkan_debug_utils=true --vulkan_debug_verbosity=4 --vulkan_validation_layers=true

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

allieculp · 2023-04-20T17:43:40Z

@antiagainst Please help to assign priority here.

powderluv · 2023-04-24T04:09:53Z

@antiagainst / @benvanik any thoughts on how we should handle this ? We run into this when we go from Stable Diffusion 512x512 to SD 768 base model since the weights are larger (@pashu123 ?). Running the base SD 512x512 model at a 768x768 resolution works ok.

We will soon have to support 1024x1024 models (https://stable-diffusion-art.com/sdxl-beta/) so any guidance appreciated on this issue.

antiagainst · 2023-05-03T18:35:31Z

This is actually a different issue than the one we discussed internally about 4GB storage buffer allocation limit. This reads like we are allocating a larger-than-allowed device local + host visible buffer, with a 257,949,696 byte limit. I recall some previous generation NVIDIA card had such 256MB limit.

For context, note that 4GB is a specification limit on how large a storage buffer can go. Although VkMemoryAllocateInfo uses VkDeviceSize (uint64_t) for allocationSize, To specify storage buffer descriptors, we need VkDescriptorBufferInfo, whose range field is required to be less than maxStorageBufferRange. maxStorageBufferRange is inside VkPhysicalDeviceLimits, and it has a type of uint32_t. That caps it as 4GB. To really have allocations larger than 4GB, we'd need to push for spec change.

For storage buffer in this particular case, I checked the IR generated at stream level. ScheduleAllocation packs all transient buffers into one allocation (https://github.com/openxla/iree/blob/a88bfe9167da4832725f2efc26efbabc75138588/compiler/src/iree/compiler/Dialect/Stream/Transforms/ScheduleAllocation.cpp#L1007), causing us to see a large 6,926,017,728 bytes buffer: https://gist.github.com/antiagainst/07e3bffc314ace011f9175fc3182dab6. Sorting the transient buffers packed together we an see the largest one was 3,397,386,240 bytes, so that's less than 4GB threshold. (Albeit not far away too.) So for this case, we should still be good w.r.t. transient storage buffers, given that I'd assume slices used for descriptors are still within 4GB range.

benvanik · 2023-05-03T18:40:30Z

#stream.resource_config can be used to control the packing
e.g.

#splitResourceConstantsConfig = #stream.resource_config<{
  max_allocation_size = 16,
  min_buffer_offset_alignment = 16,
  max_buffer_range = 1073741824,
  min_buffer_range_alignment = 16,
  index_bits = 32
}>

you can set this as a compiler flag: --iree-stream-resource-max-allocation-size=

in this case we shouldn't be allocating either transients or constants as host-visible - that sounds like a bug if we are - only staging buffers and external buffers should be host visible (today)

antiagainst · 2023-05-03T18:46:14Z

@benvanik: IIUC #stream.resource_config only controls PackAllocations, but not ScheduleAllocations where the transient buffers are initially packed together? We may need to connect it to ScheduleAllocations too.

benvanik · 2023-05-03T18:46:25Z

(also, would be good to look into the model - needing a 3.3gb transient tensor is weird unless this is training)

benvanik · 2023-05-03T18:48:41Z

ah yeah, it's mostly used for constants today - doing it for allocations is harder as they're dynamic - I think the imminent fix here is to make sure this memory is not host-visible (it shouldn't be)

antiagainst · 2023-05-03T18:51:38Z

Yeah. There are actually two issues mixed together. This particular issue has title about 4GB limit but the validation error was not for that. We were discussing another issue internally that is about 4GB limit with the following validation error:

[VULKAN] ! Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00333 ] Object 0: handle = 0x980f360000000011, type = VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT; | MessageID = 0xf2fc081c | vkCmdPushDescriptorSetKHR() VkWriteDescriptorSet[1] failed update validation: Write update to Push Descriptors defined with VkDescriptorSetLayout 0x980f360000000011[] binding #1 failed with error message: Attempted write update to buffer descriptor failed due to: For buffer VkBuffer 0x8f226e000000025f[] VkDescriptorBufferInfo range is 6926017728 which is greater than this device's maxStorageBufferRange (4294967295). The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the range member of each element of pBufferInfo, or the effective range if range is VK_WHOLE_SIZE, must be less than or equal to VkPhysicalDeviceLimits::maxStorageBufferRange (https://vulkan.lunarg.com/doc/view/1.3.243.0/linux/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00333)

So it's confusing here.

antiagainst · 2023-05-03T19:03:22Z

For the validation error originally reported in this issue, I cannot find allocations with a size of 1698693120 when --compile-to=hal. So I'm not sure this is still an issue. @pashu123 please double check and see whether that's still relevant.

pashu123 · 2023-05-29T16:59:00Z

Sorry for the confusion - I also see the same error on A100 [VULKAN] ! Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00333 ] Object 0: handle = 0x967dd1000000000e, type = VK_OBJECT_TYPE_DESCRIPTOR_SET_LAYOUT; | MessageID = 0xf2fc081c | vkCmdPushDescriptorSetKHR() VkWriteDescriptorSet[1] failed update validation: Write update to Push Descriptors defined with VkDescriptorSetLayout 0x967dd1000000000e[] binding #1 failed with error message: Attempted write update to buffer descriptor failed due to: For buffer VkBuffer 0x891e2c0000000284[] VkDescriptorBufferInfo range is 6937997888 which is greater than this device's maxStorageBufferRange (4294967295). The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the range member of each element of pBufferInfo, or the effective range if range is VK_WHOLE_SIZE, must be less than or equal to VkPhysicalDeviceLimits::maxStorageBufferRange (https://vulkan.lunarg.com/doc/view/1.3.243.0/linux/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00333)

pashu123 · 2023-05-29T17:03:49Z

Since I was running the above problem on RTX 3090 and hence they are giving different validation errors.

pashu123 · 2023-05-29T17:04:45Z

@antiagainst Let me know if you need more info.

powderluv · 2023-06-01T23:19:54Z

Can we use https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/enabling_buffer_device_address.html ? There were some references to it from KhronosGroup/Vulkan-Docs#1016

pashu123 added the bug 🐞 Something isn't working label Apr 20, 2023

github-project-automation bot added this to (Deprecated) IREE Apr 20, 2023

github-project-automation bot moved this to Inbox in (Deprecated) IREE Apr 20, 2023

ScottTodd added hal/vulkan Runtime Vulkan GPU HAL backend codegen/spirv SPIR-V code generation compiler backend labels Apr 20, 2023

powderluv assigned antiagainst Apr 20, 2023

powderluv added this to the Collab: Nod.ai milestone Apr 20, 2023

allieculp moved this from Inbox to Needs Scheduling in (Deprecated) IREE Apr 20, 2023

benvanik changed the title ~~[VULKAN] Creation of buffer larger than 4GB.~~ [Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. May 3, 2023

powderluv mentioned this issue Jun 5, 2023

[vulkan] Support VK_KHR_buffer_device_address and PhysicalStorageBuffer #13945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. #13196

[Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. #13196

pashu123 commented Apr 20, 2023

allieculp commented Apr 20, 2023

powderluv commented Apr 24, 2023

antiagainst commented May 3, 2023

benvanik commented May 3, 2023

antiagainst commented May 3, 2023

benvanik commented May 3, 2023

benvanik commented May 3, 2023

antiagainst commented May 3, 2023

antiagainst commented May 3, 2023

pashu123 commented May 29, 2023

pashu123 commented May 29, 2023

pashu123 commented May 29, 2023

powderluv commented Jun 1, 2023

[Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. #13196

[Vulkan] Unexpected creation of buffer larger than 4GB failing at runtime. #13196

Comments

pashu123 commented Apr 20, 2023

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

allieculp commented Apr 20, 2023

powderluv commented Apr 24, 2023

antiagainst commented May 3, 2023

benvanik commented May 3, 2023

antiagainst commented May 3, 2023

benvanik commented May 3, 2023

benvanik commented May 3, 2023

antiagainst commented May 3, 2023

antiagainst commented May 3, 2023

pashu123 commented May 29, 2023

pashu123 commented May 29, 2023

pashu123 commented May 29, 2023

powderluv commented Jun 1, 2023