-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMDe #12
Comments
DPPS should already be implemented, maybe you are hitting simd-everywhere/simde#648 Anyway, duh, shit. I can't believe this flew under my radar this spring when I was looking for alternatives to SSEPlus. |
@AndreyTykhonov I am interested in hearing more about your SIMDe failure to compile; can you share more details? |
I was in very beginning. I use primary C#, so C++ is something crazy for me. I compiled sample pin 3.14 project, added instruction watch like in this code, but when I trying to include SSE 4.1 header I receive crazy amount of errors related to pin modules (even without simde calls / variable, just after including!). I tested simde in console app and it works perfect, but with pin something crazy is going. So I think that I should waste my time learning C++ to fix Cyberpunk instead of developers and removed solution with game, lol. But think that it can by handy for someone and posted here information about simde Before this I compiled project with simde mm_dp_ps, watched in debugger asm code and injected in game jmp to new memory where asm code from compiled exe, lol. I even got it to cyberpunk logos, but it too crazy so I stopped this research |
Pin can only include 3 very specific headers. If you are trying to extend the new icudt.dll that's a completely different approach. |
@AndreyTykhonov thanks! Looks very promising! Actually I've been looking for implementations since I crashed into those AVX instructions after prologue. Pintool does implement everything, but it doesn't allow using it's implementations freely. If you wish to add support for it, you will have to include SIMDe headers, add another
Refactor the calls like:
And the trickiest part - find all those usages of I used a modified version of |
Wow! Thanks! Did you fixed project compilation with simde headers? I'm not too good and C++, after including header I got hundreds of errors :D If you can attach project with connected simde header that compiles I would grateful! And about offsets, I already got all SSE 4.1 & SSE 4.2 instructions offsets for 1.04 version, here is my results, maybe you find a use for it (beware - there is starting offsets like Cyberpunk2077.AK::WriteBytesMem::Count, but I can recreate file with only Cyberpunk2077 reference as start point) Actually I tried to fix some instructions with assembler so this is what I fixed:
My asm code contained errors, so some of methods returned wrong values, I think it's the problem of freeze and crash. But looks like this is all methods that needed to got to menu. Anyway, all list of SSE functions that game used (if I not forget something):
|
So glad to see this thread, cause I was doing exactly the same what is described here: I've found all SSE 4.1 and SSE 4.2 function calls and was emulating them with SIMDe. I'm stuck earlier on the path tough: I'm experimenting with DPPS emulation and after the first DPPS emulated call I get "Access violation exception when trying to access 0x00000000000". It seems that DPPS returns wrong values to the registers. My best guess is that after calling ExceptionHandler values of the registers are restored to the stacked ones and I was trying to resolve that. However according to comments here - I might be wrong. SIMDe was successfully included into the project without any errors (few warnings), however I was using only popcnt_hotpatch project. @AndreyTykhonov if this is not the case for you - let's investigate. For me it was very easy - I've extracted source of SIMDe into a subfolder near the popcnt_hotpatch project and used relative paths to make inclusions.
As for this code - the issue is that variables a and b should be XMM registers. They come within the exception context as _M128A structure, so appropriate casting needs to be made. Unless I'm missing anything. I've ended up with something like:
And after the first call I'm getting "Access violation exception". With SDE Cyberpunk works. I'm working with 1.06 binary. Would be glad to get deeper into this. Suggestions? UPD. Actual code for DPPS emulation is a bit different than above:
|
I've made some progress with SIMDe, no more "Access violation issue". SIMDe seems pretty effective and generates like 2-7 ASM lines instead of single DPPS call. Now I need to form sseInstructions.json.txt, which is kind of tricky. I have all instructions list and their offsets, however I do not have length of the instructions. I've used IDA Pro to get those and I can get length of instructions one by one. But there are 1739 matches for SSE 4.1 and SSE 4.2 instructions, so running manually via hotpatch.log is not an option. As well as manually going though IDA search results. List of instructions provided by @AndreyTykhonov is different for Cyberpunk 1.06. It has also: But do not have: Can someone help me with either IDA parsing results from search occurances window or with offsets, instructions + their length? |
@EvgeniySpinov glad to see progress on this! My list didn't contains AVX instructions since I not parsing it. 1851 SSE 4.1 / 4.2 instructions:
|
Looks exactly what is needed! Thank you for sharing. I do not have 1.1 though, but will get an update. Question meanwhile: could you please share a way how you generate this? Really curious of the approach and would like to use it for other projects as well. Also a question: In this call And one more question about file contents. Some of the offsets are calculated from functions like "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF". Is there a way to get absolute offset for all instructions? |
@EvgeniySpinov I can regenerate list without "Cyberpunk2077.AK::ReadBytesSkip::Count+D1AF" if you need, just offsets after exe base position. All values are heximal. My steps to generate list:
|
Right, that doesn't look I'll be able to quickly reproduce :) I have some progress with IDA script, but I propose to unite our effort. Could you please regenerate file with absolute offset positions? Meanwhile I'll try to write a wrapper for JSON to translate those instructions into HOTFIX calls in C++ and implementing them with SIMDe. If that would work - then we can look into details of getting list of calls+offsets in more automated way. |
Not that a general fix for AVX would hurt, but anyway my dudes wasn't that already fixed in patch 1.05 for cyberpunk? |
My understanding that it was - I was able to play on my Athlon X6 1090T, which doesn't have AVX only with SSE 4.x patches. AVX was removed after shitstorm on CDPR forums from people with server Xeons, which do not have AVX either. |
Spent some time today moving forward on this one. Some of the instructions are represented in weird way. For example: Is that an address where dpp float should be taken for operation? IDA reports on this address: Also this one: In IDA: Anyone knows how to fetch second register value in C++ code from the exception? (without these instruction calls - I can get through few logos, apparently while game is loading the rest of the stuff. Emulated only DPPS for now) |
Ok, I've progressed through:
After experimenting with commenting out instruction calls - game starts and crashes as it should (with illigal instruction call). My guess is that it is somehow due to a number of instruction calls, heap size, etc, cause commenting different sets of calls allows to launch the game, so the problem is not with the calls themselves. IDA can also start the game in debug mode. Resulting dll with all the calls is 1.5M. When I comment like 10% of instruction calls (even the same call, like DPPS for instance) dll might reduce in size to 300-400Kb and then game launches. So currently observation is: big dll - process crashes instantly. small dll - process starts. @mirh @AndreyTykhonov Have you seen such a behavior before? Do you know which direction should I dig into? |
Ok, guys, I've made it. Everything works. 1727 lines with various instruction calls. The problem is ... I get 3 fps. Same as with Intel SDE. Completely unplayable as you may guess. How the hell, this guy makes it: https://cs.rin.ru/forum/viewtopic.php?f=10&t=71329 Look for "SSE 4.x". His patch works perfectly - I get 30-40 fps hitting my GPU. |
Yeah, luther_d is one sick fella. |
I've based updates on popcnt_hotfix. You think better idea is to use PIN to intercept instruction calls before they happen and emulate with SIMDe those calls instead of exception handling? |
I'm not really the sharpest tool on the shed, to be honest My uneducated guess without any kind of actual profiling is indeed that exception handling is the biggest performance offender. |
Great article, which means that SDE already using PIN tool JIT compilation in order to intercept instruction calls before any exception. And performance is equal to our solution - which surprises me, tbh, I would expect SDE to be faster, since we're working with exceptions. As a POPCNT emulator - idea of this tool is great - emulating only 1 instruction instead of whole CPU architecture allows to launch the game and have minimal impact. However whole SSE 4.x stack is heavy. BTW, Intel SDE is developing, so probably there would be a way to emulate selected set of instructions only. Haven't checked for popcnt, but probably there is a switch by now. There is definitely for SSE 4.1, 4.2, 4.3, etc. Need to get in touch with luther_d and understand how this could be tackled. My best guess is that luther_d is not emulating all of the instruction calls required. Likely he operates on a subfunction level, jumping over functions which contain SSE 4.x where possible and emulating their output when not. |
Hello from the SIMDe project! When using SIMDe to cope with SSE4.1 instructions not available on the running processor, do you compile using the highest SIMD level available (like SSE3, SSE2, etc..) or are you using the unoptimized fallback implementations? |
Hey @mr-c, thank you for coming to our bonfire :) You've got a great project and great fellows who help people like me to use it. If you mean compiler and linker options, then highest supported SIMD level is SSE2 for Phenom X6 1090T, which is default for MSVC 2019. I didn't change anything there. SSE3 is partially supported as figured later on: had to emulate pabsd and pshufb calls, cause they were causing invalid code exceptions. If you mean using SSE1,2 within SIMDe calls, then in here: simd-everywhere/simde#694 I was told that I should not mix them, i.e. either all native or SIMDe. So did I. |
:-) I'm the SIMDe cheerleader, all the credit goes to our amazing contributors! Yep, I meant compiler options. The MSVC equivalent of gcc's '-msse2', which seems to be According to https://www.cpu-world.com/CPUs/K10/AMD-Phenom%20II%20X6%201090T%20Black%20Edition%20-%20HDT90ZFBK6DGR%20(HDT90ZFBGRBOX).html I see that SSE3 is supported, but I don't see a MSVC command line option for that. Does MSVC automatically set |
Phenom X6 1090T seems have incomplete SSE3 support. It supports IA SSE3, but do not IA Supplemental SSE3. I do not know what is the difference though, but instructions mentioned above are SSE3 instructions and I still had to emulate them. But I've added SIMDE_ARCH_X86_SSE3 1 - that didn't trigger full rebuild. Seems like functions I'm using from SIMDe mostly using SSE2 functions. |
@EvgeniySpinov Interesting! I wasn't aware of the SSE3 sub-levels. Can you remind me (maybe with a link) how SIMDe is being compiled/used? |
SSE3 is SSE3, SSSE3 is SSSE3. MSVC has way less automatic granularity than, say, gcc but still I think intrinsics should do it.
No, because like ogurets said, he's using probe and not JIT. |
Oops, I was wrong, you should use |
Luther_d solution is not based on exception handling. It request new memory on game start, writes ASM code to new memory that will be executed instead of not supported instructions (for example dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F IS DIFFERENT CODE) After new memory created, he injecting jmps to new memory, like dpps xmm0, xmm1, 7F becomes jmp OFFSET_IN_MEMORY and nops, so this solution very fast As I understand, luther_d solution is not automated, since he releasing fixes with new overloads, I think he restarts games and fixes until it working, this is reason why it not going to be updated We can possible port fix to new game versions in few steps:
But there can be new instruction, so without good ASM knowledge we can't do much. I tried to use same method at beginning |
@mr-c I've added SIMDE_X86_SSE3_NATIVE - it didn't trigger rebuild as well. I think SSE3 is rarely used. In SIMDe as well. Regarding SIMDe compilation: didn't get your question. You mean which command line is used to compile DLL or how SIMDe is used in source code? @AndreyTykhonov That is possible. Is that an assumption or you've spoken to him? Asking cause there is few questions:
Regarding ASM code generation - I think it's possible to automate with SIMDe: create all calls combinations that is met within the EXE including:
Get their code in ASM disabling all (or almost all) compiler optimizations. Then do like you've said: before DPPS call for instance - jump to new memory region with needed stuff + NOOP. Question is: how to force application to behave differently on particular instruction call from DLL? |
I compared RDR2 & Cyberpunk with and without fix, it replaces unsupported instructions with jumps to new memory
As I understand, he is modifying executable memory after game started, there is no hook on exceptions, fix probably contains offsets that should be replaced, not automated at all, like in JSON that I created
I think theoretically it's possible, but I see a problem with registers. I don't know how you can manage something like "dpps xmm01, [rax+r1*4], 7F" inside SIMDe, as well simple registers like xmm0, xmm1
That's why I said that every instruction is different in memory. He basically created different methods for each instruction type: dpps xmm0, xmm1, 7F and dpps xmm1, xmm2, 7F is in different memory regions, so it can be easily jumpable without additional parameters |
Thanks for sharing your observations.
Do you or anyone in this thread know how to perform such a jumps in executable memory from DLL, i.e. knowing offset jump to other space before instruction is executed? You've said that you were experimenting with ASM calls previously? You've tried the same approach?
It seems that not all of the 1727 instruction calls play critical role in here. I had around 12-15 jump over calls (just jump to next instruction) which were hit by the program (I've checked) and game still worked fine. While if to jump over all instructions - that leads to a crash before the 1st logo. One more way to move forward without getting too deep into ASM: get list of instructions involved in rendering (have some ideas how to do this) and move them to the head of the list, so they would be found soonest while exception is processed. I've noticed that until loading a save game - my fix was behaving the same as with luther fix, i.e. 100% CPU load during startup, smooth video feed, smooth menu, etc. While during loading a saved game it clearly started to lag. Probably game utilizes instructions closer to the end of the list during loading and rendering and this causes additional delays. |
How do you go from SIMDe source code to object code? What transformations, compilation options, and defines are set? |
Compiler:
Linker:
While we're searching for other options, I've attached source code of the most recent source code, just to give more information on how it is done currently. |
Thanks @EvgeniySpinov ; I'm not seeing where you set |
Right, I did this before inclusion of sse4.2.h like so: However after no effect on resulting DLL removed it. I've put it back in my sources just to make sure it's always there. |
Ok, I've built up a quick profiler which output "most hot" instruction calls. There were not too much of those, like around 200, so I've put them in the beginning of the list (they were closer to the end before). Performance gain was around 2 times, so instead of 3 fps I've got like 5-6. Already better than SDE, but still too slow. Looks like the biggest penalty is exception handling. Need to find a way to modify executable space in order to jump before instructions and not enter exception handling. No other way around. |
Dag around PIN tool a bit in sense of using JIT. In theory it looks good, but documentation notifies about performance impact and according to README.md of this repo @ogurets already tried this approach with PIN tool using it just for a popcnt instruction call notifying all of us about visible performance hit. I have 1 more idea how to tackle the problem without PINs and SDEs:
Not so neat as just adding DLLs, but after looking into way of modifying executable code from DLL - that doesn't seem an easy way to do, or I'm missing something. What do you think? |
@EvgeniySpinov probably the best way is to start game suspended, inject DLL and unfreeze game. I don't know how to force C++ to mark input variable as specific register, like: As I mentioned before, we can actually update Luther_d fix to last version, and Luther_d already updated fix to 1.1 version, but this will works only with CyberPunk, not for next games. Based on Steam Hardware Survey, there is only 1.5% of PC that doesn't have SSE4.1. Even if we can play CyberPunk 2077 on 40 fps, next games will be slower. Several years and new games will give 20 fps and I'm not talking about Phenom single thread performance. Let's face it: it's time to upgrade PC, even budget i3 will give TWICE much FPS. I think it's not worth to continue research and we should close this issue. Do you agree? |
The only issue I'm seeing is that Windows DEP randomizes memory space for DLLs each time application launches. This is done to prevent DLLs from doing exactly this thing: modifying executable memory and jump to addresses populated by DLL. Cause viruses doing this as well.
This is very valid point. I'm also straggling to find a way to utilize exact registers. For reading and writing. Seems like that kind of access is on ASM level :(
Yep, I know that luther_d has done fix update and I'm happily using it. The thing is yes - those fixes, specially from luther_d are not available for all games with SSE 4.x, so that was kind of a vector for me.
You're absolutely right and this is valid point which I was also thinking off. And I'm glad to see that you have the same rationale and shared it. However I'm doing this not for actual gaming on Phenom, but rather to understand C++, DLLs and instructions hacking more. And I was really excited to see and to use SIMDe project for that and collaborate with you on this one. People with Phenom (like me) do not even need to upgrade to play games - now there are variety of clouds where you can game without a hassle. I've tried with Cyberpunk as well before luther_d first fix - Full HD, 60 fps rock solid on Nvidia GFN. Not sure I want to go to ASM space - that seems too much, but currently it also seems that only way to get level of fixes luther_d provides. So I'm kind of puzzled. |
I'm pretty sure DEP can be disabled |
https://github.com/simd-everywhere/simde
This project looks promising. I tried to add mm_dp_ps support (to fix SSE 4.1 in Cyberpunk) but failed to compile after that. Maybe you will be interested
The text was updated successfully, but these errors were encountered: