Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove temporary buffer from AddConvexPolyFilled #1811

Closed
wants to merge 4 commits into from

Conversation

G-glop
Copy link

@G-glop G-glop commented May 13, 2018

Math stays the same, only other change is the early exit, to prevent reading a zero sized array when initializing n0.

When i have it in my head: could it also auto detect winding order?
Currently it assumes clockwise winding, which is ok, but isn't documented anywhere.

Long overdue update:

Firt i want to say that your code has pretty much optimal performace, as in number of operations and throughput. That's why my first proposal failed abysmally, it both needed to compute one more sqrt, which matters when half of the inputs have 3 or 4 points, and was painfully serial.

So an intermediate buffer is necassary. But what about alloca? Well the key realization is that while another buffer is needed, more memory is not. That's what i did last year, re-use the vertex buffer to also store the normals. What i didn't do, and which prevented me from matching the original performace, was this:
ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;
Which is again reusing the vertex buffer, but now it's nice, transparent and tightly packed.

My question to you is: "Should ImGui do this type of memory aliasing?".
(and, ugh, how to do it in a portable way - just say that ImVec2 is always 4 byte aligned?)

I also tried reusing the vertex buffer in the same way, and just inlining all the vector operators in the original.

Performance testing:
Using example_null as a base, a loop of NewFrame, 3x ShowDemoWindow, Render. Additonaly ImDrawList::AddPolyline and ImFont::RenderText are stubbed out to amplify the effect of AddConvexPolyFilled. The loop runs until 5 seconds have passed (batched to avoid the overhead of getting current time). Both Visual Studio 2017 and 2015 are tested (v141 and v140)

Signed,
an annoying GitHub user.

@ocornut
Copy link
Owner

ocornut commented May 13, 2018

Hello,

Thanks for submitting this.
What are you trying to solve? Assuming it is meant to be a performance improvement: Do you have a process and measurements to share comparing the speed of this version and the current one? I typically compare performance on both fully optimized and debug builds.

EDIT What was your intent with removing the alloca() ? alloca is probably one of the fastest possible thing one could possibly to do (it's 1 addition to increment the stack pointer) the data is in the stack, and the second loop will read back from data freshly written to the cache. However your single loop would also probably be a beneficial thing to explore.

As a crude performance test, I have bound the two variants of the code on the CTRL key (adding if (ImGui::GetIO().KeyCtrl) {} in that function to select the algorithm) and modified the ShowExampleAppCustomRendering() to keep only the filled primitives and display everything 200 times, then measured average time over 6 seconds:

200 iterations of rendering all the filled shapes:

Win32 Release Optimized:
Before: 1.48 ~ 1.50 ms
After: 1.66 ~ 1.69 ms (slower)

Win64 Release Optimized:
Before: 1.21 ~ 1.24 ms
After: 1.35 ~ 1.38 ms (slower)

Win32 Debug Non-Optimized (with MSVC debug cruft)
Before: 7.27 ~ 7.31 ms
After: 7.70 ~ 7.74 ms (slower)

etc.
So this would need to be reworked because right now unless I messed up my tests this is slower every time.

  • avoid the modulo in the loop like the original code did
  • for the Debug perf, avoiding calling ImVec2() constructors would probably also be beneficial
  • the test for the last point should be moved outside the loop (the [2] [4] [5] operators would need to be offset probably by points_count >= 2 ? -9 : -6 (untested)
  • etc.

Also smaller/other things that would need be fixed later:

  • use imgui coding style (braces, spaces)
  • squash the dozens commits into one (14 commits?!)

Thanks,
Omar

ocornut added a commit that referenced this pull request May 13, 2018
@ocornut
Copy link
Owner

ocornut commented May 13, 2018

About this:

When i have it in my head: could it also auto detect winding order?
Currently it assumes clockwise winding, which is ok, but isn't documented anywhere.

I have added a comment about anti-aliased filling requiring a path in clockwise order.
I don't think we can detect clockwiseness without a perf hit. We could eventually provide different primitives to either fill a CCW path or optional detect and apply AA properly in all situation but I don't think there is a need for it now.

@G-glop
Copy link
Author

G-glop commented May 14, 2018

Removing alloca is mostly for compatibility. (it's only other use is in AddPolyline, which can be modified in the same way)

Performance shouldn't have changed. Doing register-register ops ins't a win when pipelining hides the latency of reading memory.

I made a benchmark of processing a rounded rectangle. (not yet pushed) And MSVC's sampling profiler shows 10% of the whole run time is spent on if(scale < 100.0f), but only in some configurations, otherwise performance is identical.

But i'm only measuring the pure run time of the function itself, maybe the fact that the fill triangles are interleaved with the outline degrades performance? If so, it's easy to de-interleave them, it also may be worthwhile to also use a different algorithm that doesn't produce long, thin triangles.

@ocornut
Copy link
Owner

ocornut commented May 14, 2018

Removing alloca is mostly for compatibility. (it's only other use is in AddPolyline, which can be modified in the same way)

Which compatibility issue do you have with alloca ?
I don't mind changing the rendering function especially if we get a small boost or a simplification, but if it's for compatibility we need to also state what the compatibility issue is.

But i'm only measuring the pure run time of the function itself, maybe the fact that the fill triangles are interleaved with the outline degrades performance? If so, it's easy to de-interleave them,

I guess it's more reasonable and representative of real use to keep them interleaved. Regardless, a simple manufactured/looping test would probably fit in the instruction cache anyway even with multiple functions running.

it also may be worthwhile to also use a different algorithm that doesn't produce long, thin triangles.

What's the problem with long thin triangles?

Mildly unrelated, but if you fancy looking at it, the non-AA thick strokes are broken (see #288, one of the oldest PR alive, it was submitted just at the time we switched to AA primitives).

@G-glop
Copy link
Author

G-glop commented May 29, 2018

Sorry for taking so long, i got sidetracked trying (and failing) to install vTune, to see what's really slowing it down.

But reading the disassembly i think i know what's going on. In the old version the normal computing loop got unrolled to call sqrtf 5 times each iteration (nevermind that sqrtf is a wrapper around the AVX sqrtf instruction which computes 8 sqrtfs at once). Meanwhile my version has a critical path of computing the normals sequentially.

Without doing explicit SIMD (intrinsics/ispc) i don't think i can get a (reliable) performance boost. Maybe a compromise with a small fixed size buffer will yield the best result.

About the compatibility issue; alloca itself is fine (tho someone will be surprised when using a small stack and drawing complex polygons), the issue reolves around the implied __chkstk_ms. Windows only reserves one page above the stack to commit in when the stack grows. This all works fine untilyou have stack frames larger than 4kB, then you could access memory beyond the reserved page and cause a seqfault, so chkstk is inserted to sequentially touch the pages.

MinGW has (had?) a problem with properly handling it, causing link errors.

And lastly the short version why long thin traingles aren't so great:

GPU's first determine which 8x8 tiles a triangle overlaps, in those tiles it applies the same process but for 2x2 groups of pixels, if the traingle is overlapping a group then the shader is executed for all the pixels in that group and pixels outside the triangle are discarded.

Additinaly mobile/newer generation GPUs use tile renderers, which add another level with even larger tiles designed to fit in cache.

long source

So long thin triangles hit a lot of tiles and groups, but contribute little pixels to the output.

@G-glop
Copy link
Author

G-glop commented May 30, 2018

Ok, i think i will leave it at this. Using the output as a buffer gets slightly worser performance when it's the only thing in cache and slightly better performace when there's cache pressure.

Should i do the same for AddPolyLine?

@ocornut
Copy link
Owner

ocornut commented Nov 8, 2021

Long overdue update:
Firt i want to say that your code has pretty much optimal performace, as in number of operations and throughput. That's why my first proposal failed abysmally, it both needed to compute one more sqrt, which matters when half of the inputs have 3 or 4 points, and was painfully serial.
So an intermediate buffer is necassary. But what about alloca? Well the key realization is that while another buffer is needed, more memory is not. That's what i did last year, re-use the vertex buffer to also store the normals. What i didn't do, and which prevented me from matching the original performace, was this:
ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;
Which is again reusing the vertex buffer, but now it's nice, transparent and tightly packed.

While wandering into old issues I noticed you post an interesting update in Dec 2018 which was made as an edit of the original post therefore without any notification coming. We'll investigate that new idea you posted in your edit.

@G-glop
Copy link
Author

G-glop commented Nov 8, 2021

I forgot to explain the different plotted quantities!

Quantity Description
initial inline Original function but with everything inlined, essentially what 9ad3419 did
joined Compute normals on the fly (initial idea of the PR)
vtx reuse safe Store normals in ImDrawVert::pos, iteration order avoids clobbering
idx reuse Store normals in a buffer typecast from _IdxWritePtr
vtx reuse Store normals in a buffer typecast from _VtxWritePtr and shifted such that writing to _VtxWritePtr doesn't clobber it too soon (what the PR currently does, modulo meaningless formatting changes)

More on compatibility: using alloca apparently requires a preprocessor include dance and silencing 3 warnings. Take your pick if it's better or worse than some cheeky buffer reuse, with or without typecasting.

I need to merge master into my branch and re-run the benchmarks. But overall I like the "idx reuse" strategy the most because it can't as easily suffer clobbering. Just generate all the vertices first and then the indexes. It's almost a drop-in replacement even for the much more complicated AddPolyline.

Thank you for taking a look at almost a 3 year idle PR!

@rokups
Copy link
Contributor

rokups commented Apr 6, 2022

I profiled it with Tracy. Profiler instrumentation was added only around two loops that use temp_normals. I do not see a meaningful difference here.

This trace is original code.
External trace is code from this PR.

Debug
dbg_old_vs_new

Release
rel_old_vs_new

Unsure how to implement total time metrics, maybe they are different because profiling sessions were not exactly equal, but average execution times are nearly same.

fangshun2004 added a commit to fangshun2004/imgui that referenced this pull request Sep 30, 2022
commit 5884219
Author: cfillion <cfillion@users.noreply.github.com>
Date:   Wed Sep 28 23:37:39 2022 -0400

    imgui_freetype: Assert if bitmap size exceed chunk size to avoid buffer overflow. (ocornut#5731)

commit f2a522d
Author: ocornut <omarcornut@gmail.com>
Date:   Fri Sep 30 15:43:27 2022 +0200

    ImDrawList: Not using alloca() anymore, lift single polygon size limits. (ocornut#5704, ocornut#1811)

commit cc5058e
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 21:59:32 2022 +0200

    IO: Filter duplicate input events during the AddXXX() calls. (ocornut#5599, ocornut#4921)

commit fac8295
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 21:31:36 2022 +0200

    IO: remove ImGuiInputEvent::IgnoredAsSame (revert part of 839c310), will filter earlier in next commit. (ocornut#5599)

    Making it a separate commit as this leads to much indentation change.

commit 9e7f460
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 20:42:45 2022 +0200

    Fixed GetKeyName() for ImGuiMod_XXX values, made invalid MousePos display in log nicer.  (ocornut#4921, ocornut#456)

    Amend fd408c9

commit 0749453
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 19:51:54 2022 +0200

    Menus, Nav: Fixed not being able to close a menu with Left arrow when parent is not a popup. (ocornut#5730)

commit 9f6aae3
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 19:48:27 2022 +0200

    Nav: Fixed race condition pressing Esc during popup opening frame causing crash.

commit bd2355a
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 19:25:26 2022 +0200

    Menus, Nav: Fixed using left/right navigation when appending to an existing menu (multiple BeginMenu() call with same names). (ocornut#1207)

commit 3532ed1
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 18:07:35 2022 +0200

    Menus, Nav: Fixed keyboard/gamepad navigation occasionally erroneously landing on menu-item in parent when the parent is not a popup. (ocornut#5730)

    Replace BeginMenu/MenuItem swapping g.NavWindow with a more adequate ImGuiItemFlags_NoWindowHoverableCheck.
    Expecting more subtle issues to stem from this.
    Note that NoWindowHoverableCheck is not supported by IsItemHovered() but then IsItemHovered() on BeginMenu() never worked: fix should be easy in BeginMenu() + add test is IsItemHovered(), will do later

commit d5d7050
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 29 17:26:52 2022 +0200

    Various comments

    As it turns out, functions like IsItemHovered() won't work on an open BeginMenu() because LastItemData is overriden by BeginPopup(). Probably an easy fix.

commit e74a50f
Author: Andrew D. Zonenberg <azonenberg@drawersteak.com>
Date:   Wed Sep 28 08:19:34 2022 -0700

    Added GetGlyphRangesGreek() helper for Greek & Coptic glyph range. (ocornut#5676, ocornut#5727)

commit d17627b
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 17:38:41 2022 +0200

    InputText: leave state->Flags uncleared for the purpose of backends emitting an on-screen keyboard for passwords. (ocornut#5724)

commit 0a7054c
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 17:00:45 2022 +0200

    Backends: Win32: Convert WM_CHAR values with MultiByteToWideChar() when window class was registered as MBCS (not Unicode). (ocornut#5725, ocornut#1807, ocornut#471, ocornut#2815, ocornut#1060)

commit a229a7f
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 16:57:09 2022 +0200

    Examples: Win32: Always use RegisterClassW() to ensure windows are Unicode. (ocornut#5725)

commit e0330c1
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 14:54:38 2022 +0200

    Fonts, Text: Fixed wrapped-text not doing a fast-forward on lines above the clipping region. (ocornut#5720)

    which would result in an abnormal number of vertices created.

commit 4d4889b
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 12:42:06 2022 +0200

    Refactor CalcWordWrapPositionA() to take on the responsability of minimum character display. Add CalcWordWrapNextLineStartA(), simplify caller code.

    Should be no-op but incrementing IMGUI_VERSION_NUM just in case.
    Preparing for ocornut#5720

commit 5c4426c
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 12:22:34 2022 +0200

    Demo: Fixed Log & Console from losing scrolling position with Auto-Scroll when child is clipped. (ocornut#5721)

commit 12c0246
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 28 12:07:43 2022 +0200

    Removed support for 1.42-era IMGUI_DISABLE_INCLUDE_IMCONFIG_H / IMGUI_INCLUDE_IMCONFIG_H. (ocornut#255)

commit 73efcec
Author: ocornut <omar@miracleworld.net>
Date:   Tue Sep 27 22:21:47 2022 +0200

    Examples: disable GL related warnings on Mac + amend to ignore list.

commit a725db1
Author: ocornut <omarcornut@gmail.com>
Date:   Tue Sep 27 18:47:20 2022 +0200

    Comments for flags discoverability + add to debug log (ocornut#3795, ocornut#4559)

commit 325299f
Author: ocornut <omarcornut@gmail.com>
Date:   Wed Sep 14 15:46:27 2022 +0200

    Backends: OpenGL: Add ability to #define IMGUI_IMPL_OPENGL_DEBUG. (ocornut#4468, ocornut#4825, ocornut#4832, ocornut#5127, ocornut#5655, ocornut#5709)

commit 56c3eae
Author: ocornut <omarcornut@gmail.com>
Date:   Tue Sep 27 14:24:21 2022 +0200

    ImDrawList: asserting on incorrect value for CurveTessellationTol (ocornut#5713)

commit 04316bd
Author: ocornut <omarcornut@gmail.com>
Date:   Mon Sep 26 16:32:09 2022 +0200

    ColorEdit3: fixed id collision leading to an assertion. (ocornut#5707)

commit c261dac
Author: ocornut <omarcornut@gmail.com>
Date:   Mon Sep 26 14:50:46 2022 +0200

    Demo: moved ShowUserGuide() lower in the file, to make main demo entry point more visible + fix using IMGUI_DEBUG_LOG() macros in if/else.

commit 51bbc70
Author: ocornut <omarcornut@gmail.com>
Date:   Mon Sep 26 14:44:26 2022 +0200

    Backends: SDL: Disable SDL 2.0.22 new "auto capture" which prevents drag and drop across windows, and don't capture mouse when drag and dropping. (ocornut#5710)

commit 7a9045d
Author: ocornut <omarcornut@gmail.com>
Date:   Mon Sep 26 11:55:07 2022 +0200

    Backends: WGPU: removed Emscripten version check (currently failing on CI, ensure why, and tbh its redundant/unnecessary with changes of wgpu api nowadays)

commit 83a0030
Author: ocornut <omarcornut@gmail.com>
Date:   Mon Sep 26 10:33:44 2022 +0200

    Added ImGuiMod_Shortcut which is ImGuiMod_Super on Mac and ImGuiMod_Ctrl otherwise. (ocornut#456)

commit fd408c9
Author: ocornut <omarcornut@gmail.com>
Date:   Thu Sep 22 18:58:33 2022 +0200

    Renamed and merged keyboard modifiers key enums and flags into a same set:. ImGuiKey_ModXXX -> ImGuiMod_XXX and ImGuiModFlags_XXX -> ImGuiMod_XXX. (ocornut#4921, ocornut#456)

    Changed signature of GetKeyChordName() to use ImGuiKeyChord.
    Additionally SetActiveIdUsingAllKeyboardKeys() doesn't set ImGuiKey_ModXXX but we never need/use those and the system will be changed in upcoming commits.

# Conflicts:
#	docs/CHANGELOG.txt
#	imgui.h
#	imgui_demo.cpp
@ocornut ocornut changed the title Remove tempoary buffer from AddConvexPolyFilled Remove temporary buffer from AddConvexPolyFilled Apr 7, 2023
@ocornut
Copy link
Owner

ocornut commented Aug 23, 2024

I was curious to have another look at this today, but it has become very confusing to follow because:

  • posts have been edited + proposed PR have been changed multiple times, it's hard to make sense of any measurement and comparisons done.
  • imgui code itself has been changed (we don't use alloca anymore)

As of today if I look at the leftover diff, the only difference becomes:

-- ImVec2* temp_normals = (ImVec2*)alloca(points_count * sizeof(ImVec2))
++ ImVec2* temp_normals = (ImVec2*)((size_t)(_VtxWritePtr + vtx_count) & ~(alignof(ImVec2) - 1)) - points_count;

But this doesn't apply the same to current code.

Considering the last posts from from April 2022 suggested no measurable difference I think we can close this.
At this point after this has been simplified down to moving that temporary buffer merely to a different location, I don't believe it would make any difference. Let me know if you don't agree.

@ocornut ocornut closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants