-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoftGPU perf opportunities #17613
Comments
Collected some data on that. For "Soulcalibur: Broken Destiny" in 3D scene, the distribution of
|
Where would the 0 pixel quads come from? Do you count alpha and depth testing there? |
No, I just count from So, 0 pixel quads are rejected right away and are pretty cheap. As for where they come from - if you just walk entire AABB (as some simple half-space rasterizers do) you'll get a lot of 0 pixel quads (a triangle covers 11/36 of its AABB on average, and no more than half). Here |
Ah right, that makes sense. BTW, Unknown has done quite a bit of exploration of the raster-in-quads idea in the past. |
More obvious example is tiny triangles - a triangle too small, that does not actually cover any sampling points still necessitates testing at least 1 quad. |
Actually, I did write out quads, but it introduced a lot of complexity and was often neutral, sometimes better, and sometimes worse for overall performance. The changes can be seen here and are mostly dedicated to jit: As you mention a lot of time is spent in the barycentric stuff, and so I moved instead to improving parallelism and trying to identify safe cases of rectangles or other things that could calculate simpler and more efficiently (previously a lot of time was wasted on things like full screen triangle strips.)
Even with AVX2 it's definitely complicated when you think about most textures being CLUT, so you end up in a situation where you're doing a gather from a gather. I have not really attacked that. A simpler (maybe) thing I've thought about is pre-caching certain textures based on heuristics. We end up with some annoying wasteful cases where we burn a lot of time in the sampler:
The PSP itself had a texture cache (which would be difficult to accurately model for sure), so there might be something to pre-caching and flushing that cache from texflush or texsync GE commands (I forget which, one seems to be related to block transfer actually iirc.) That said, it might make more sense to keep it thread-local to maintain high parallelism. And I'm not certain pre-caching would even be worth it... maybe CPU caches make these cases as fast as I could hope for anyway. I'm pretty sure doing it every draw would not be worth it.
Well, I've got an 8700 CPU (which I consider somewhat high end) and my goal so far has just been to move from measuring "seconds per frame" to making most games largely playable on my CPU. But I had someone try on an M1 and it ran quite well, I think actually as well or better than on my 8700. It looked like clang was vectorizing decently. As with Intel though, I've put nearly zero thought into 32-bit and only even looked at arm64.
This might just be cargo culting - I think you're right that the sum should be constant. That said, I recommend measuring any such changes both with in game and frame dumps because sometimes changes have subtle and unexpected impacts...
There's also things like scissor mask and bias that impact that a bit, but you get into an economics problem trying to account for it all, which can impact a game that is in the playable range negatively for a very small benefit to a game that is either already well playable or not very playable anyway. It's probably worth more experimentation, but there's a bunch of tradeoffs for sure. One reason for the two pixels is that things are intentionally aligned to a 2x2 grid for parallelism reasons. So you can end up with an unfortunate "upper half" and "lower half" but it's worth it overall to make sure you don't risk overlap between threads running different areas of the screen. That's another area where there could be a lot of gains, but I've struggled to find better heuristics than the current as it's a tricky case of tradeoffs as well. Right now we often end up with say 60% of threads taking on 85% of the work, and if we could even it out more it'd lead to better performance (on multi-core processors.) -[Unknown] |
Also worth noting: the barycentric triangle slice thing is currently the least accurate part of the pipeline. Blending, sampling/filtering, tests, etc. are all pretty darn accurate now. Some of the coordinate rounding seems a bit off before the slicing, but the weighting/etc. is not accurate. There's also the matrix multiplications etc. which are probably inaccurate. But even in through-mode (which skips a lot of that), the weighting is slightly incorrect across a triangle. Maybe it'll never be accurate and I'm not necessarily trying to make softgpu accurate at all costs, but have definitely aimed to make it accurate where reasonable, and sometimes that has led to perf improvements (some places are accurate to divide by 256 even where you might think it should be 255, or etc.) What I will say is the "region" registers are very interesting and made me think the PSP might do something similar to what we currently do. You have two registers split into:
For example, suppose I had a scissor set on drawing to constrain to 10,10 - 20,20. Considering only the "distance", my memory is that it goes from the top left of the primitive. So if distance was 5,5... you'd only possibly get 10,10-15,15 filled. Although I'm not sure if it's TL of the primitive before or after scissor. Then there's rate. This was weird, and no game ever sets it to anything other than 0,0. From memory, this boosts the "consumption rate" of distance at a fixed-point of 2.8. The maximum value, 0x3FF would consume almost 5 "distance" for each pixel, but also "zoom" the primitive out by almost 5 (i.e. texturing, everything else scales too.) The standard value consumes 1 distance per pixel and so is at 1x zoom for the primitive. But most all documentation names these registers region "x1,y1" and "x2,y2". Anyway, just a curiosity about how this might work on the PSP I thought worth mentioning. -[Unknown] |
Hm, this is interesting (and counterintuitive).
Some of it does sound like trying to beat CPU cache at its own game, which I'm not optimistic about. I also remember how I tried to add a special case for "all texel addresses are equal" to my SSE2 toy softrender, and it didn't help much even in best case (it didn't hurt much either). Though here we can bake conversion from underlying format into caching, which may be a win.
Oh, yes, clang is happy to auto-vectorize on
Are tasks fine-grained enough to make job-stealing worthwhile?
Now that I think about it, it might be faster to skip |
In |
Well,
|
I have thought about trying a few different things. I was initially a bit overzealous in trying to avoid early-exits maybe (could branch if all pixels fail alpha test), and maybe going from 2x2 to 4x1 would be better (possible con: maybe worse for tall and thin, which are not that uncommon.)
Hm, at least if it's split into bins, I remembered it being aligned. But maybe it was something I was planning to do or once did and decided wasn't worth it? It has been through various changes.
That's interesting and does look pretty terrible. I do see some improvement from changing that, even with some of our specializations. Will just do a quick pull. -[Unknown] |
Doing as many early-exits as possible is probably quite beneficial, it's running on a CPU after all, not a GPU. |
My intuition is the same, but needs testing, obviously. Tangentially related, but I'm considering PR to centralize 32-bit x86 workaround along the lines of: #if PPSSPP_ARCH(X86)
// [Detailed rationale here].
#define SAFE_M128(v) _mm_loadu_ps (reinterpret_cast<const float*> (&(v)))
#define SAFE_M128I(v) _mm_loadu_si128(reinterpret_cast<const __m128i*>(&(v)))
#else // x64, FWIW, also works on non-x86.
#define SAFE_M128(v) (v)
#define SAFE_M128I(v) (v)
#endif to eliminate
Open to the suggestions for better names too. |
In several places division is used instead of a shift. When signed, complier has to insert extra logic to match truncating division, which can hurt perf (unless truncating division is both possible and the desired behaviour). |
That looks like it might be a worthwhile improvement, where applicable. Feel free to PR. |
Note that this function, too, is not normally called very often. But it probably should be uint32_t. -[Unknown] |
x86 with JIT aside, wouldn't it get called quite often on ARM (4 times per |
I think this can be simplified: static inline void GetTextureCoordinatesProj(const VertexData& v0, const VertexData& v1, const float p, float &s, float &t) {
// This is for texture matrix projection.
float q0 = 1.f / v0.clipw;
float q1 = 1.f / v1.clipw;
float wq0 = p * q0;
float wq1 = (1.0f - p) * q1;
float q_recip = 1.0f / (wq0 + wq1);
float q = (v0.texturecoords.q() * wq0 + v1.texturecoords.q() * wq1) * q_recip;
q_recip *= 1.0f / q;
s = (v0.texturecoords.s() * wq0 + v1.texturecoords.s() * wq1) * q_recip;
t = (v0.texturecoords.t() * wq0 + v1.texturecoords.t() * wq1) * q_recip;
} Namely, There's a trick where you calculate 2 reciprocals for price of 1: // Naive.
float inv_a = 1.0f / a;
float inv_b = 1.0f / b;
// Trick.
float inv_ab=1.0f / (a*b);
float inv_a = inv_ab * b;
float inv_b = inv_ab * a; This can also be extended to 3 or more reciprocals, with more overhead. Does come with a lot of fine print, though (zeroes botch everything, non-exact where it could be), and might not be worth it anyway (CPUs can run several divisions in parallel). |
Looks like constants only need to be computed once (doubt that compiler hoists them, since it doesn't see ppsspp/GPU/Software/Rasterizer.cpp Lines 1126 to 1127 in 42d4b5d
|
Fair enough.
This is probably a case of tweaking something that was slightly wrong and then not fixing it, oops. Yes, should make sense to simplify.
If w is zero or infnan, generally graphics won't draw anyway on a PSP. That said, unless it realizes a measurable FPS gain, I'd be wary of doing things like that.
Yeah, I've tried this and it didn't help much and may have hurt. Maybe it's worth doing, but in this case the code only runs in throughmode (no matrix transforms, used for 2D drawing typically, most often low triangle counts.) And another if per triangle hurts all those tiny hand/finger parts of 3D models, although only very slightly. But at least before, it was such a small difference I couldn't tell it from error with or without. -[Unknown] |
Hm, For traditional top-left rule, it's something like this: // Get bias for oriented edge A->B, assuming (visible) triangles are CCW
// in Cartesian (y - up) coordinates,
static inline int edge_function_bias(const Vec2<int> &A, const Vec2<int> &B)
{
bool is_top_left=(B.y==A.y&&B.x<A.x)||(B.y<A.y);
return is_top_left?0:-1;
} Not sure what fill rules are for PSP, but this feels off (at least the implementation; semantics might be correct). |
As mentioned in hrydgard#17613 (comment) .
Looks like static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
if (line1.y == line2.y) {
// just check if vertex is above us => bottom line parallel to x-axis
return vertex.y < line1.y;
} else {
// check if vertex is on our left => right side
return vertex.x < line1.x + (line2.x - line1.x) * (vertex.y - line1.y) / (line2.y - line1.y);
}
} Consider: Vec2<int> A( 63, -59);
Vec2<int> B(-49, 73);
Vec2<int> C( 67, -64);
bool result = IsRightSideOrFlatBottomLine(C, A, B); // <-- false. However, in a (rather thin) ABC we have Edge AB also happens to pass through (7,7) which The example for the other winding order is I think what it tries to compute can be written as: static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
if (line1.y == line2.y) {
// just check if vertex is above us => bottom line parallel to x-axis
return vertex.y < line1.y;
} else {
// check if vertex is on our left => right side
int64_t delta = int64_t(vertex.x - line1.x) * int64_t(line2.y - line1.y) - int64_t(line2.x - line1.x) * int64_t(vertex.y - line1.y);
return line2.y > line1.y ? (delta < 0) : (delta > 0);
}
} The |
Experiments indicate that discrepancy occurs with about 0.0038% chance for vertices uniformly distributed in ( |
The compilers don't care to optimize |
Yeah, that function is 10 years old and I think came from Dolphin: b0d3848 We used to have much worse coverage issues, but they were caused by other worse problems. I think we probably don't really have a test that directly covers this. The closest we have are Should probably create a specific coverage test for each primitive, I guess, and check right to left order, etc. -[Unknown] |
ppsspp/GPU/Software/SamplerX86.cpp Lines 1080 to 1088 in f3d95a2
Loading 32 bits at a time does seem consistently faster in SSE2: https://godbolt.org/z/1j75fz9aM
I sometimes get a case where |
Just to mention the obvious. Since w0+w1+w2=const, interpolation can be rewritten as |
Well, wait, we can't use PINSRD in this path - we use that for SSE4. The question here is, would it be better to MOVD from RAM and then blend? Probably only relevant to try that on a pre-SSE4 CPU. So I just left it as two PINSRW.
At least on a Coffee Lake, I had pretty significant gains from using VPGATHERDD in PPSSPP (#15275.) Sure have come a long way since those speeds, heh.
Hm, this sounds like it could be a good idea. Might have to be careful about any accuracy impacts as you mention, although since we go at half pixels I don't know how often any weight lands directly on zero. But of course, it can happen. -[Unknown] |
The path named
On my (pre-SSE4) machine (adapted the same code from godbolt):
That said, loading to temporary in mem may be more hassle in JIT, so its understandable to not bother with it for (probably) rarely used path. On a different note, out of curiosity, I tried
and it was about twice as fast as PPSSPP's SW rendering (14 FPS, vs 7; though still lot slower than OpenGL which runs at full 60 FPS) in 3D scene in Soulcalibur. On windows, to test this, you can simply place Mesa3D's |
Running our hardware renderers in an outside software rasterizer is "cheating" in a number of ways - they will benefit from pre-decoded cached textures, for example, while our software renderer decodes on every access, and will inherit all the other inaccuracies of the hardware renderers, of course. |
What should happen
As mentioned, some (possibly dumb; I'm trying to wrap my head around it) observations on SoftGPU.
First off, here's what performance looks like (Linux, 32-bit x86 with SSE2 but without SSE4):
This is me playing a couple of minutes of "Soulcalibur: Broken Destiny", mostly a 3D scene.
The profile is flat, without
--callgraph
- it just shows where (and how often) samples land.Even without JIT,
Rasterizer::DrawSinglePixel
accounts for surprisingly little - much less thanSampler::SampleLinearLevel
(there might be some games that do not use linear much).A lot of time is spent in
Rasterizer::DrawTriangleSlice
itself (and whatever is inlined into it).Now, looking at the code, the biggest thing is that most of processing is done per pixel. The
DrawTriangleSlice
actually works on 2x2 pixel quads, but then most of the actual processing is per-pixel inside quad. All ofstate.drawPixel
,state.nearest
, andstate.linear
are per-pixel, even when JIT-ed. I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.Converting it to entirely quad-based seems like a lot of work (especially since it involves both JIT and non-JIT parts, in sync). It also seems lucrative for performance. Even purely scalar quad-based versions are likely to be faster their counterparts, since various
if(state.whatever)
would be amortized. And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX (and emulatinggather
in plain scalar+SSE is not even that bad). Going by the names likeDrawSinglePixel
the idea seems to have been there all along. Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.The
Vec3<...>
andVec4<...>
are used extensively, but some operations seem missing (notably bitwise stuff). Also, only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize, since auto-vectorization is under-O3
, but PPSSPP uses-O2
. Not sure if performance on ARM is a concern.Do not see why
ppsspp/GPU/Software/Rasterizer.cpp
Line 963 in a56f74c
is per-pixel (per-quad). Normally, w0+w1+w2=const invariant holds for entire triangle (the entire screen, actually), unless some weird per-edge scaling is done. When I tried computing it once per
DrawTriangleSlice
there appeared no visible problems.Who would this benefit
Platform (if relevant)
None
Games this would be useful in
Other emulators or software with a similar feature
No response
Checklist
The text was updated successfully, but these errors were encountered: