Remove temporary buffer from AddConvexPolyFilled #1811

G-glop · 2018-05-13T08:15:06Z

Math stays the same, only other change is the early exit, to prevent reading a zero sized array when initializing n0.

When i have it in my head: could it also auto detect winding order?
Currently it assumes clockwise winding, which is ok, but isn't documented anywhere.

Long overdue update:

Firt i want to say that your code has pretty much optimal performace, as in number of operations and throughput. That's why my first proposal failed abysmally, it both needed to compute one more sqrt, which matters when half of the inputs have 3 or 4 points, and was painfully serial.

So an intermediate buffer is necassary. But what about alloca? Well the key realization is that while another buffer is needed, more memory is not. That's what i did last year, re-use the vertex buffer to also store the normals. What i didn't do, and which prevented me from matching the original performace, was this:
ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;
Which is again reusing the vertex buffer, but now it's nice, transparent and tightly packed.

My question to you is: "Should ImGui do this type of memory aliasing?".
(and, ugh, how to do it in a portable way - just say that ImVec2 is always 4 byte aligned?)

I also tried reusing the vertex buffer in the same way, and just inlining all the vector operators in the original.

Performance testing:
Using example_null as a base, a loop of NewFrame, 3x ShowDemoWindow, Render. Additonaly ImDrawList::AddPolyline and ImFont::RenderText are stubbed out to amplify the effect of AddConvexPolyFilled. The loop runs until 5 seconds have passed (batched to avoid the overhead of getting current time). Both Visual Studio 2017 and 2015 are tested (v141 and v140)

Signed,
an annoying GitHub user.

ocornut · 2018-05-13T13:37:24Z

Hello,

Thanks for submitting this.
What are you trying to solve? Assuming it is meant to be a performance improvement: Do you have a process and measurements to share comparing the speed of this version and the current one? I typically compare performance on both fully optimized and debug builds.

EDIT What was your intent with removing the alloca() ? alloca is probably one of the fastest possible thing one could possibly to do (it's 1 addition to increment the stack pointer) the data is in the stack, and the second loop will read back from data freshly written to the cache. However your single loop would also probably be a beneficial thing to explore.

As a crude performance test, I have bound the two variants of the code on the CTRL key (adding if (ImGui::GetIO().KeyCtrl) {} in that function to select the algorithm) and modified the ShowExampleAppCustomRendering() to keep only the filled primitives and display everything 200 times, then measured average time over 6 seconds:

200 iterations of rendering all the filled shapes:

Win32 Release Optimized:
Before: 1.48 ~ 1.50 ms
After: 1.66 ~ 1.69 ms (slower)

Win64 Release Optimized:
Before: 1.21 ~ 1.24 ms
After: 1.35 ~ 1.38 ms (slower)

Win32 Debug Non-Optimized (with MSVC debug cruft)
Before: 7.27 ~ 7.31 ms
After: 7.70 ~ 7.74 ms (slower)

etc.
So this would need to be reworked because right now unless I messed up my tests this is slower every time.

avoid the modulo in the loop like the original code did
for the Debug perf, avoiding calling ImVec2() constructors would probably also be beneficial
the test for the last point should be moved outside the loop (the [2] [4] [5] operators would need to be offset probably by points_count >= 2 ? -9 : -6 (untested)
etc.

Also smaller/other things that would need be fixed later:

use imgui coding style (braces, spaces)
squash the dozens commits into one (14 commits?!)

Thanks,
Omar

…ockwise order path. (#1811)

ocornut · 2018-05-13T13:58:15Z

About this:

When i have it in my head: could it also auto detect winding order?
Currently it assumes clockwise winding, which is ok, but isn't documented anywhere.

I have added a comment about anti-aliased filling requiring a path in clockwise order.
I don't think we can detect clockwiseness without a perf hit. We could eventually provide different primitives to either fill a CCW path or optional detect and apply AA properly in all situation but I don't think there is a need for it now.

G-glop · 2018-05-14T09:38:09Z

Removing alloca is mostly for compatibility. (it's only other use is in AddPolyline, which can be modified in the same way)

Performance shouldn't have changed. Doing register-register ops ins't a win when pipelining hides the latency of reading memory.

I made a benchmark of processing a rounded rectangle. (not yet pushed) And MSVC's sampling profiler shows 10% of the whole run time is spent on if(scale < 100.0f), but only in some configurations, otherwise performance is identical.

But i'm only measuring the pure run time of the function itself, maybe the fact that the fill triangles are interleaved with the outline degrades performance? If so, it's easy to de-interleave them, it also may be worthwhile to also use a different algorithm that doesn't produce long, thin triangles.

ocornut · 2018-05-14T09:45:58Z

Removing alloca is mostly for compatibility. (it's only other use is in AddPolyline, which can be modified in the same way)

Which compatibility issue do you have with alloca ?
I don't mind changing the rendering function especially if we get a small boost or a simplification, but if it's for compatibility we need to also state what the compatibility issue is.

But i'm only measuring the pure run time of the function itself, maybe the fact that the fill triangles are interleaved with the outline degrades performance? If so, it's easy to de-interleave them,

I guess it's more reasonable and representative of real use to keep them interleaved. Regardless, a simple manufactured/looping test would probably fit in the instruction cache anyway even with multiple functions running.

it also may be worthwhile to also use a different algorithm that doesn't produce long, thin triangles.

What's the problem with long thin triangles?

Mildly unrelated, but if you fancy looking at it, the non-AA thick strokes are broken (see #288, one of the oldest PR alive, it was submitted just at the time we switched to AA primitives).

G-glop · 2018-05-29T11:03:56Z

Sorry for taking so long, i got sidetracked trying (and failing) to install vTune, to see what's really slowing it down.

But reading the disassembly i think i know what's going on. In the old version the normal computing loop got unrolled to call sqrtf 5 times each iteration (nevermind that sqrtf is a wrapper around the AVX sqrtf instruction which computes 8 sqrtfs at once). Meanwhile my version has a critical path of computing the normals sequentially.

Without doing explicit SIMD (intrinsics/ispc) i don't think i can get a (reliable) performance boost. Maybe a compromise with a small fixed size buffer will yield the best result.

About the compatibility issue; alloca itself is fine (tho someone will be surprised when using a small stack and drawing complex polygons), the issue reolves around the implied __chkstk_ms. Windows only reserves one page above the stack to commit in when the stack grows. This all works fine untilyou have stack frames larger than 4kB, then you could access memory beyond the reserved page and cause a seqfault, so chkstk is inserted to sequentially touch the pages.

MinGW has (had?) a problem with properly handling it, causing link errors.

And lastly the short version why long thin traingles aren't so great:

GPU's first determine which 8x8 tiles a triangle overlaps, in those tiles it applies the same process but for 2x2 groups of pixels, if the traingle is overlapping a group then the shader is executed for all the pixels in that group and pixels outside the triangle are discarded.

Additinaly mobile/newer generation GPUs use tile renderers, which add another level with even larger tiles designed to fit in cache.

long source

So long thin triangles hit a lot of tiles and groups, but contribute little pixels to the output.

G-glop · 2018-05-30T07:58:25Z

Ok, i think i will leave it at this. Using the output as a buffer gets slightly worser performance when it's the only thing in cache and slightly better performace when there's cache pressure.

Should i do the same for AddPolyLine?

ocornut · 2021-11-08T15:45:10Z

Long overdue update:
Firt i want to say that your code has pretty much optimal performace, as in number of operations and throughput. That's why my first proposal failed abysmally, it both needed to compute one more sqrt, which matters when half of the inputs have 3 or 4 points, and was painfully serial.
So an intermediate buffer is necassary. But what about alloca? Well the key realization is that while another buffer is needed, more memory is not. That's what i did last year, re-use the vertex buffer to also store the normals. What i didn't do, and which prevented me from matching the original performace, was this:
ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;
Which is again reusing the vertex buffer, but now it's nice, transparent and tightly packed.

While wandering into old issues I noticed you post an interesting update in Dec 2018 which was made as an edit of the original post therefore without any notification coming. We'll investigate that new idea you posted in your edit.

G-glop · 2021-11-08T20:07:52Z

I forgot to explain the different plotted quantities!

Quantity	Description
initial inline	Original function but with everything inlined, essentially what `9ad3419` did
joined	Compute normals on the fly (initial idea of the PR)
vtx reuse safe	Store normals in `ImDrawVert::pos`, iteration order avoids clobbering
idx reuse	Store normals in a buffer typecast from `_IdxWritePtr`
vtx reuse	Store normals in a buffer typecast from `_VtxWritePtr` and shifted such that writing to `_VtxWritePtr` doesn't clobber it too soon (what the PR currently does, modulo meaningless formatting changes)

More on compatibility: using alloca apparently requires a preprocessor include dance and silencing 3 warnings. Take your pick if it's better or worse than some cheeky buffer reuse, with or without typecasting.

I need to merge master into my branch and re-run the benchmarks. But overall I like the "idx reuse" strategy the most because it can't as easily suffer clobbering. Just generate all the vertices first and then the indexes. It's almost a drop-in replacement even for the much more complicated AddPolyline.

Thank you for taking a look at almost a 3 year idle PR!

rokups · 2022-04-06T13:36:35Z

I profiled it with Tracy. Profiler instrumentation was added only around two loops that use temp_normals. I do not see a meaningful difference here.

This trace is original code.
External trace is code from this PR.

Debug

Release

Unsure how to implement total time metrics, maybe they are different because profiling sessions were not exactly equal, but average execution times are nearly same.

…ts. (#5704, #1811)

commit 5884219 Author: cfillion <cfillion@users.noreply.github.com> Date: Wed Sep 28 23:37:39 2022 -0400 imgui_freetype: Assert if bitmap size exceed chunk size to avoid buffer overflow. (ocornut#5731) commit f2a522d Author: ocornut <omarcornut@gmail.com> Date: Fri Sep 30 15:43:27 2022 +0200 ImDrawList: Not using alloca() anymore, lift single polygon size limits. (ocornut#5704, ocornut#1811) commit cc5058e Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 21:59:32 2022 +0200 IO: Filter duplicate input events during the AddXXX() calls. (ocornut#5599, ocornut#4921) commit fac8295 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 21:31:36 2022 +0200 IO: remove ImGuiInputEvent::IgnoredAsSame (revert part of 839c310), will filter earlier in next commit. (ocornut#5599) Making it a separate commit as this leads to much indentation change. commit 9e7f460 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 20:42:45 2022 +0200 Fixed GetKeyName() for ImGuiMod_XXX values, made invalid MousePos display in log nicer. (ocornut#4921, ocornut#456) Amend fd408c9 commit 0749453 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 19:51:54 2022 +0200 Menus, Nav: Fixed not being able to close a menu with Left arrow when parent is not a popup. (ocornut#5730) commit 9f6aae3 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 19:48:27 2022 +0200 Nav: Fixed race condition pressing Esc during popup opening frame causing crash. commit bd2355a Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 19:25:26 2022 +0200 Menus, Nav: Fixed using left/right navigation when appending to an existing menu (multiple BeginMenu() call with same names). (ocornut#1207) commit 3532ed1 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 18:07:35 2022 +0200 Menus, Nav: Fixed keyboard/gamepad navigation occasionally erroneously landing on menu-item in parent when the parent is not a popup. (ocornut#5730) Replace BeginMenu/MenuItem swapping g.NavWindow with a more adequate ImGuiItemFlags_NoWindowHoverableCheck. Expecting more subtle issues to stem from this. Note that NoWindowHoverableCheck is not supported by IsItemHovered() but then IsItemHovered() on BeginMenu() never worked: fix should be easy in BeginMenu() + add test is IsItemHovered(), will do later commit d5d7050 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 29 17:26:52 2022 +0200 Various comments As it turns out, functions like IsItemHovered() won't work on an open BeginMenu() because LastItemData is overriden by BeginPopup(). Probably an easy fix. commit e74a50f Author: Andrew D. Zonenberg <azonenberg@drawersteak.com> Date: Wed Sep 28 08:19:34 2022 -0700 Added GetGlyphRangesGreek() helper for Greek & Coptic glyph range. (ocornut#5676, ocornut#5727) commit d17627b Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 17:38:41 2022 +0200 InputText: leave state->Flags uncleared for the purpose of backends emitting an on-screen keyboard for passwords. (ocornut#5724) commit 0a7054c Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 17:00:45 2022 +0200 Backends: Win32: Convert WM_CHAR values with MultiByteToWideChar() when window class was registered as MBCS (not Unicode). (ocornut#5725, ocornut#1807, ocornut#471, ocornut#2815, ocornut#1060) commit a229a7f Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 16:57:09 2022 +0200 Examples: Win32: Always use RegisterClassW() to ensure windows are Unicode. (ocornut#5725) commit e0330c1 Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 14:54:38 2022 +0200 Fonts, Text: Fixed wrapped-text not doing a fast-forward on lines above the clipping region. (ocornut#5720) which would result in an abnormal number of vertices created. commit 4d4889b Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 12:42:06 2022 +0200 Refactor CalcWordWrapPositionA() to take on the responsability of minimum character display. Add CalcWordWrapNextLineStartA(), simplify caller code. Should be no-op but incrementing IMGUI_VERSION_NUM just in case. Preparing for ocornut#5720 commit 5c4426c Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 12:22:34 2022 +0200 Demo: Fixed Log & Console from losing scrolling position with Auto-Scroll when child is clipped. (ocornut#5721) commit 12c0246 Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 28 12:07:43 2022 +0200 Removed support for 1.42-era IMGUI_DISABLE_INCLUDE_IMCONFIG_H / IMGUI_INCLUDE_IMCONFIG_H. (ocornut#255) commit 73efcec Author: ocornut <omar@miracleworld.net> Date: Tue Sep 27 22:21:47 2022 +0200 Examples: disable GL related warnings on Mac + amend to ignore list. commit a725db1 Author: ocornut <omarcornut@gmail.com> Date: Tue Sep 27 18:47:20 2022 +0200 Comments for flags discoverability + add to debug log (ocornut#3795, ocornut#4559) commit 325299f Author: ocornut <omarcornut@gmail.com> Date: Wed Sep 14 15:46:27 2022 +0200 Backends: OpenGL: Add ability to #define IMGUI_IMPL_OPENGL_DEBUG. (ocornut#4468, ocornut#4825, ocornut#4832, ocornut#5127, ocornut#5655, ocornut#5709) commit 56c3eae Author: ocornut <omarcornut@gmail.com> Date: Tue Sep 27 14:24:21 2022 +0200 ImDrawList: asserting on incorrect value for CurveTessellationTol (ocornut#5713) commit 04316bd Author: ocornut <omarcornut@gmail.com> Date: Mon Sep 26 16:32:09 2022 +0200 ColorEdit3: fixed id collision leading to an assertion. (ocornut#5707) commit c261dac Author: ocornut <omarcornut@gmail.com> Date: Mon Sep 26 14:50:46 2022 +0200 Demo: moved ShowUserGuide() lower in the file, to make main demo entry point more visible + fix using IMGUI_DEBUG_LOG() macros in if/else. commit 51bbc70 Author: ocornut <omarcornut@gmail.com> Date: Mon Sep 26 14:44:26 2022 +0200 Backends: SDL: Disable SDL 2.0.22 new "auto capture" which prevents drag and drop across windows, and don't capture mouse when drag and dropping. (ocornut#5710) commit 7a9045d Author: ocornut <omarcornut@gmail.com> Date: Mon Sep 26 11:55:07 2022 +0200 Backends: WGPU: removed Emscripten version check (currently failing on CI, ensure why, and tbh its redundant/unnecessary with changes of wgpu api nowadays) commit 83a0030 Author: ocornut <omarcornut@gmail.com> Date: Mon Sep 26 10:33:44 2022 +0200 Added ImGuiMod_Shortcut which is ImGuiMod_Super on Mac and ImGuiMod_Ctrl otherwise. (ocornut#456) commit fd408c9 Author: ocornut <omarcornut@gmail.com> Date: Thu Sep 22 18:58:33 2022 +0200 Renamed and merged keyboard modifiers key enums and flags into a same set:. ImGuiKey_ModXXX -> ImGuiMod_XXX and ImGuiModFlags_XXX -> ImGuiMod_XXX. (ocornut#4921, ocornut#456) Changed signature of GetKeyChordName() to use ImGuiKeyChord. Additionally SetActiveIdUsingAllKeyboardKeys() doesn't set ImGuiKey_ModXXX but we never need/use those and the system will be changed in upcoming commits. # Conflicts: # docs/CHANGELOG.txt # imgui.h # imgui_demo.cpp

ocornut · 2024-08-23T12:40:19Z

I was curious to have another look at this today, but it has become very confusing to follow because:

posts have been edited + proposed PR have been changed multiple times, it's hard to make sense of any measurement and comparisons done.
imgui code itself has been changed (we don't use alloca anymore)

As of today if I look at the leftover diff, the only difference becomes:

-- ImVec2* temp_normals = (ImVec2*)alloca(points_count * sizeof(ImVec2))
++ ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;

But this doesn't apply the same to current code.

Considering the last posts from from April 2022 suggested no measurable difference I think we can close this.
At this point after this has been simplified down to moving that temporary buffer merely to a different location, I don't believe it would make any difference. Let me know if you don't agree.

ocornut added a commit that referenced this pull request May 13, 2018

Comments about AddConvexPolyFilled(), PathFillConvex() requiring a cl…

f8ca7f4

…ockwise order path. (#1811)

G-glop force-pushed the convexpoly branch from a7230b3 to d914a39 Compare May 13, 2018 19:40

ocornut force-pushed the master branch from a74014e to 2a2bb89 Compare May 13, 2018 20:31

ocornut added the drawing/drawlist label May 14, 2018

G-glop added 4 commits December 31, 2018 10:34

remove buffer from AddConvexPolyFilled

90b04ed

use the vertex output as a buffer

85fcab0

update to newest

939d53b

alias end of the vertex buffer as a tempoary array

f38dd95

G-glop force-pushed the convexpoly branch from 6e89be6 to f38dd95 Compare December 31, 2018 09:43

ocornut force-pushed the master branch from 008954d to 16c0a02 Compare January 31, 2019 12:45

ocornut force-pushed the master branch from 2d5eb47 to 6db0766 Compare April 24, 2019 15:40

ocornut force-pushed the master branch from 1b84280 to aca6ee1 Compare May 11, 2019 09:34

ocornut force-pushed the master branch from c426094 to 54c49b5 Compare July 2, 2019 16:34

ocornut force-pushed the master branch from 51085b5 to baae057 Compare July 23, 2019 20:40

ocornut force-pushed the master branch from 09caf50 to 13258f5 Compare November 12, 2020 11:20

ocornut force-pushed the master branch 2 times, most recently from 0c1e5bd to bb6a60b Compare August 27, 2021 19:10

ocornut added the optimization label Nov 8, 2021

ocornut force-pushed the master branch 2 times, most recently from 8b83e0a to d735066 Compare December 13, 2021 11:31

ocornut force-pushed the master branch from 9aa764c to b0a6cd6 Compare January 3, 2022 20:47

ocornut force-pushed the master branch from 51c64d8 to b3b85d8 Compare January 17, 2022 14:13

ocornut force-pushed the master branch from b3b85d8 to 0755767 Compare January 17, 2022 14:21

ocornut force-pushed the master branch 3 times, most recently from c817acb to 8d39063 Compare February 15, 2022 16:25

ocornut force-pushed the master branch from 8e10b7a to fc203c7 Compare April 12, 2022 13:30

ocornut mentioned this pull request Sep 26, 2022

ImDrawList::AddPolyline stack overflow on large polylines #5704

Closed

ocornut added a commit that referenced this pull request Sep 30, 2022

ImDrawList: Not using alloca() anymore, lift single polygon size limi…

f2a522d

…ts. (#5704, #1811)

ocornut force-pushed the master branch from 14d4b36 to c9aef16 Compare January 2, 2023 15:25

ocornut force-pushed the master branch from 4812791 to fac19e1 Compare February 7, 2023 18:29

ocornut changed the title ~~Remove tempoary buffer from AddConvexPolyFilled~~ Remove temporary buffer from AddConvexPolyFilled Apr 7, 2023

ocornut closed this Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove temporary buffer from AddConvexPolyFilled #1811

Remove temporary buffer from AddConvexPolyFilled #1811

G-glop commented May 13, 2018 •

edited

Loading

ocornut commented May 13, 2018 •

edited

Loading

ocornut commented May 13, 2018

G-glop commented May 14, 2018

ocornut commented May 14, 2018

G-glop commented May 29, 2018 •

edited

Loading

G-glop commented May 30, 2018 •

edited

Loading

ocornut commented Nov 8, 2021

G-glop commented Nov 8, 2021 •

edited

Loading

rokups commented Apr 6, 2022

ocornut commented Aug 23, 2024

Remove temporary buffer from AddConvexPolyFilled #1811

Remove temporary buffer from AddConvexPolyFilled #1811

Conversation

G-glop commented May 13, 2018 • edited Loading

Long overdue update:

ocornut commented May 13, 2018 • edited Loading

ocornut commented May 13, 2018

G-glop commented May 14, 2018

ocornut commented May 14, 2018

G-glop commented May 29, 2018 • edited Loading

G-glop commented May 30, 2018 • edited Loading

ocornut commented Nov 8, 2021

G-glop commented Nov 8, 2021 • edited Loading

rokups commented Apr 6, 2022

ocornut commented Aug 23, 2024

G-glop commented May 13, 2018 •

edited

Loading

ocornut commented May 13, 2018 •

edited

Loading

G-glop commented May 29, 2018 •

edited

Loading

G-glop commented May 30, 2018 •

edited

Loading

G-glop commented Nov 8, 2021 •

edited

Loading