Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SoftGPU: Rasterize triangles in chunks of 4 pixels #9635

Merged
merged 4 commits into from
Apr 23, 2017

Conversation

unknownbrackets
Copy link
Collaborator

Currently, this is generally a bit slower, but it's a step in the right direction.

The last commit shows the benefit of this change in one area. Sample performance change in Tales of Destiny 2 (before this PR -> after this PR w/throughmode perf):

130% -> 200% - Logos during intro
35% -> 70% - Load save screen
64% -> 50% - 3D overworld

-[Unknown]

@hrydgard
Copy link
Owner

I like it. Good prep for mipmapping. As this clearly shows, doing SIMD where you simply write it like the straightline code, but with one component from each pixel in each lane, becomes quite easy, and once fully applied there's no way this won't be faster than doing it a single pixel at a time.

Buildbots are a little unhappy though.

@unknownbrackets
Copy link
Collaborator Author

Sure, although it can be a bit of a pain. We still have some sort of skew issue - a (0, 11)-(0, 11) 1:1 draw doesn't actually draw 1:1 in Crisis Core (The cross button symbol in the bottom right - in nearest.) I guess that means #8282 didn't handle all the cases.

But, probably better to start fixing these things in a four-pixel pipeline anyway.

-[Unknown]

GPU/Math3D.h Outdated
@@ -634,6 +634,13 @@ class Vec4
return Vec4(VecClamp(x, l, h), VecClamp(y, l, h), VecClamp(z, l, h), VecClamp(w, l, h));
}

Vec4 Reciprocal() const
{
// In case we use doubles, maintain accuracy.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? If T is a double, 1.0f will just be automatically cast to 1.0 and the division will be performed at double precision. Is this intended or not? I'm confused :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, was worrying about the accuracy problems and trying to mess with values to fix things - removed the comment.

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented Apr 23, 2017

Hm, I was thinking (this is not a call for action, just thoughts for the future): instead of using Vec4 across the lanes everywhere, an alternate way of formulating the math might be to use Vec2, Vec3, and Vec4 composed out of __m128. Like, Vec4<__m128>, then you can still perform "vector operations" and use various operator overloads etc while ignoring the fact that you're doing it for four pixels at a time.

And you'd have a type like "scalar" which would be used where a single float is currently used, just an __m128 with overloads like a Vec4 just not named so.

Not sure how confusing that would be though.

@unknownbrackets
Copy link
Collaborator Author

unknownbrackets commented Apr 23, 2017

Well, that sounds like it'd make the non-SSE paths more complicated.

I'd actually like to move the pipeline to jit, at first built as a chain of func calls (like vertexjit or more like MIPS Comp_Generic really), in steps. Then we can construct a key, select a jit program from a cache, and run it.

In that scenario, it might be ideal to use 16 u8s for colors, or maybe two pairs of 8 u16s to simplify blending. But not sure. Want to be mindful of available regs.

-[Unknown]

@hrydgard
Copy link
Owner

Yeah, good points - lots of this can definitely be done at 16-bit or 8-bit.

@hrydgard hrydgard merged commit 4f0e1a0 into hrydgard:master Apr 23, 2017
@unknownbrackets unknownbrackets deleted the softgpu branch April 23, 2017 19:21
@unknownbrackets
Copy link
Collaborator Author

A little experiment:
https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:samplerjit?expand=1

Just wanted to try it quickly and texel lookup was a nice self-contained piece. A bit underwhelming (considering ApplyTexturing is typically 20-40% of wall time), about 10% FPS improvement at best. Not terribly optimal though, and obviously would want to at least decode 16-bit directly to xmms (maybe via a jit ABI, and 4 texels at a time.)

The best profiling results were SampleNearest 21% -> SamplerJit 9% in Hexyz Force (at barely 64 FPS.) Probably need a "texture cache" for better performance...

-[Unknown]

@hrydgard
Copy link
Owner

If ApplyTexturing was 20% and you got a 10% total improvement, that means that you approximately doubled the speed of texel fetching, which isn't too bad still. But yeah, would also have expected a little better than that...

@unknownbrackets
Copy link
Collaborator Author

I wonder if rather than linear sampling from [(u,v), (u+1,v), (u,v+1), (u+1,v+1)] we instead just always sampled from even odd - u0 = u & ~1; v0 = v & ~1;. Based on swizzling I would sorta assume this is what the hardware MIGHT do?

If we did that, it should be possible to simply calculate all 4 addresses after the first one without much effort...

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented May 11, 2017

Not sure I understand what you mean. The texture coordinates are often very dissimilar from the pixel locations on the screen, imagine any perspective mapping or a rotated mapping.

Of course when drawing 1:1 rectangles, there are many possible optimizations including skipping the UV calculations altogether.

@unknownbrackets
Copy link
Collaborator Author

I mean when doing linear sampling (the 4 samples used to interpolate.) Currently it does winding:

https://github.com/hrydgard/ppsspp/blob/master/GPU/Software/Rasterizer.cpp#L209

I don't mean when drawing multiple pixels, this is for just one pixel.

-[Unknown]

@hrydgard
Copy link
Owner

Right, some simplification may be possible. You only need to calculate one address to fetch from, and then just offset by 1 horizontally and (texw) vertically to get the other three - if it weren't for wrapping and clamping which might have you fetch from either the same address, or alternatively from the other side of the texture. Not sure how to do this in the most elegant way.

@unknownbrackets
Copy link
Collaborator Author

unknownbrackets commented May 11, 2017

Well, my point is that if the U and V are even, then you're guaranteed:

  • U+1 will always be available by incrementing a few bytes.
  • V+1 will always be available based on bufw (not swizzled) or a fixed number (if swizzled.)

(you are not guaranteed these things if U or V are odd - in that case +1 might go to a new tile, when swizzled, so you end up needing to re-examine U and V.)

Wrapping and clamping won't cause problems in that case unless it's a 1x1 mip level, which can be special cased (since they are power of two sized.) In that case one might early-out of linear sampling anyway.

So if we (in linear filtering only) always sample based on even and off UVs, things get much simpler for sampling all four at once.

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented May 11, 2017

But that doesn't really work, does it? Let's imagine one dimension texture t[], and your sole texture coordinate U is:

0.5  :     We need lerp(t[0], t[1], 0.5).  Works!
2.25 :     We need lerp(t[2], t[3], 0.25). Works!
1.75 :     We need lerp(t[1], t[2], 0.75). Ooops... 1 is odd. 

Or are you saying that we'll get around that by rewriting the last equation to lerp[t[2], t[1], 0.25) by xoring the indices by the low bit and 1.0-x the lerp factor?

@unknownbrackets
Copy link
Collaborator Author

D'oh, right. I wasn't thinking about the lerp later of course. I'm stupid.

-[Unknown]

@unknownbrackets
Copy link
Collaborator Author

Interestingly, I found that with samplerjit, the thread loop (which is really naive) is mostly just waiting longer. I wonder if I have a threading bug somehow, or if it's just showing the naivety of slicing by y...

We could probably "bin" and trivially discard based on say 60x68 tiles or something, right?

-[Unknown]

@hrydgard
Copy link
Owner

That would explain some lack of speedup, yeah...

Tiled binning is a good way to go for multithreading rendering, definitely better than slicing by Y if you have many small triangles, which we generally do. Finding the optimal tile size is gonna be quite some trial and error though, I'm sure.

@unknownbrackets
Copy link
Collaborator Author

A bit better now with linear in the jit, but just not much faster...

master...unknownbrackets:samplerjit

Less rounding errors this way, though.

-[Unknown]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants