Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM calls error handler for memory allocation errors #880

Closed
khvzak opened this issue Mar 25, 2023 · 12 comments · Fixed by #895
Closed

VM calls error handler for memory allocation errors #880

khvzak opened this issue Mar 25, 2023 · 12 comments · Fixed by #895
Assignees
Labels
bug Something isn't working

Comments

@khvzak
Copy link
Contributor

khvzak commented Mar 25, 2023

I work on implementing memory limits for Luau in the mlua crate and noticed that the VM calls error handler (provided to lua_pcall) for memory allocation errors (LUA_ERRMEM code).

PUC Lua does not do this (from 5.1 doc):

LUA_ERRMEM: memory allocation error. For such errors, Lua does not call the error handler function.

In reality it triggers LUA_ERRERR later in the error handler (eg. when collecting traceback) which hides the original LUA_ERRMEM code.

Also, in this case there would be no error object on top the stack (and any returned object will be overwritten with LUA_MEMERRMSG)

@zeux
Copy link
Collaborator

zeux commented Mar 28, 2023

LUA_ERRMEM: memory allocation error. For such errors, Lua does not call the error handler function.

Hmm, interesting. My recollection is that the policy we have for this is:

  • xpcall can handle OOM errors
  • in case the error handler itself results in another OOM, xpcall fails - just as it would if the error handler failed for any other reason

And yeah I think the expectation is that the error message on the stack is synthesized during the call to xpcall's error function. We "pin" the error message string (LUA_MEMERRMSG) to make sure it's always resident in memory.

One possibility is that we should preserve the type of error in case xpcall error handler fails - so instead of generating LUA_ERRERR, we could keep LUA_MEMERR if the handler failed with OOM?

The reason why it can be beneficial for xpcall to handle OOM is that OOM may trigger due to large allocations, but there might be enough memory still left to actually process the error.

@khvzak
Copy link
Contributor Author

khvzak commented Mar 28, 2023

And yeah I think the expectation is that the error message on the stack is synthesized during the call to xpcall's error function. We "pin" the error message string (LUA_MEMERRMSG) to make sure it's always resident in memory.

Unfortunately it happens only after calling error handler.
Error message is set by calling seterrorobj(L, status, oldtop); here

I agree that it's beneficial to handle OOM, but currently there is no way to figure out (in the error handler) what was the original error code. In case of LUA_ERRMEM on top of the stack can be anything but original exception.
Currently as workaround I check a special flag set by allocator func, that OOM happened, and bypassing the message handler (code).

One possibility is that we should preserve the type of error in case xpcall error handler fails - so instead of generating LUA_ERRERR, we could keep LUA_MEMERR if the handler failed with OOM?

This sounds like a good solution, but would solve the problem only partially (ie. not hiding original memory error).

@zeux
Copy link
Collaborator

zeux commented Mar 28, 2023

Unfortunately it happens only after calling error handler.

Yup you're right. That's definitely a problem. Now, would it help you if that was fixed and there was an actual error message "not enough memory" on the stack?

@khvzak
Copy link
Contributor Author

khvzak commented Mar 28, 2023

Yup you're right. That's definitely a problem. Now, would it help you if that was fixed and there was an actual error message "not enough memory" on the stack?

I think this would help.
In my case, also preserving LUA_ERRMEM if error handler OOMed is not neccessary, since I always can collect traceback by instructing memory allocator to relax limits.

But in general, keepking LUA_ERRMEM for handler OOM seems a good idea too.

--
For non-api users this could be a bit misleading since error("not enough memory") would be indistinguishable from real memory error. But I don't have strong opinion on this case.

@zeux
Copy link
Collaborator

zeux commented Mar 28, 2023

But in general, keepking LUA_ERRMEM for handler OOM seems a good idea too.

Yeah, the caveat here is that needs a different interface. You are using lua_pcall presumably but that is bound to the normal Lua C stack contract - it can't smuggle an extra integer easily, other than passing it on the stack (which changes the "signature" of the error callback), and it can't put the error in the actual thread status (lua_status) because threads must only execute code in "OK" status. So for this to work we'd need a custom "C protected call" interface - maybe similar to lua_cpcall, but with a different callback signature that's richer than lua_CFunction.

At any rate it looks like we should fix the error message issue. That's going to improve the behavior regardless of whether luaD_pcall is used by a C host, or a Luau script. Further improvements would involve adding a new C interface and can follow later.

We'll take a look at this shortly.

@zeux zeux self-assigned this Mar 28, 2023
@khvzak
Copy link
Contributor Author

khvzak commented Mar 30, 2023

@zeux maybe there is an option to add new function, let's say int lua_laststatus(lua_State *L) to return status of last execution of luaD_rawrunprotected.
Then error (message) handler can use it to retrieve a reason why it was called.

This would not require +1 interface.

@zeux
Copy link
Collaborator

zeux commented Apr 3, 2023

This would mechanically work, but that interface doesn't seem super clean and may not be future proof.

That said, if you're using lua_pcall, then you would get the correct error (LUA_ERRMEM) as a return value from lua_pcall, you just won't have this inside the error handler. If the error handler fails again with OOM, we need to preserve the ERRMEM error code instead of doing what we do now, which is replacing it with LUA_ERRERR unconditionally. I think this is what you meant above, apologies if I misread this before. That gives the calling C code an opportunity to correctly translate the error regardless of what the handler does.

The only alternative to this right now that I see is just reverting to the behavior of Lua 5.x (which is to never call the handler in the first place), but that seems a little less satisfying. We can probably implement the more complete solution first, and only remove the call to error hander on OOM later if we need to.

@zeux
Copy link
Collaborator

zeux commented Apr 4, 2023

Something like this is what I had in mind. One problem is that lua_pcall must obscure the status in case the error handler succeeds; if that is not done (else branch in errstatus != 0 check), on the script side xpcall error function result gets effectively ignored. Neither of these is ideal, which might mean it's beneficial to revert to Lua 5.x behavior after all...

diff --git a/VM/include/lua.h b/VM/include/lua.h
index 9fb35555051..94c7a599aec 100644
--- a/Client/Luau/VM/include/lua.h
+++ b/Client/Luau/VM/include/lua.h
@@ -28,7 +28,7 @@ enum lua_Status
     LUA_OK = 0,
     LUA_YIELD,
     LUA_ERRRUN,
-    LUA_ERRSYNTAX,
+    LUA_ERRSYNTAX, // legacy error code, preserved for compatibility
     LUA_ERRMEM,
     LUA_ERRERR,
     LUA_BREAK, // yielded for a debug breakpoint
diff --git a/VM/src/ldo.cpp b/VM/src/ldo.cpp
index 61146155235..91a3766267d 100644
--- a/VM/src/ldo.cpp
+++ b/VM/src/ldo.cpp
@@ -17,6 +17,8 @@
 
 #include <string.h>
 
+LUAU_FASTFLAGVARIABLE(LuauBetterOOMHandling, false)
+
 /*
 ** {======================================================
 ** Error-recovery functions
@@ -79,22 +81,17 @@ public:
 
     const char* what() const throw() override
     {
-        // LUA_ERRRUN/LUA_ERRSYNTAX pass an object on the stack which is intended to describe the error.
-        if (status == LUA_ERRRUN || status == LUA_ERRSYNTAX)
-        {
-            // Conversion to a string could still fail.  For example if a user passes a non-string/non-number argument to `error()`.
+        // LUA_ERRRUN passes error object on the stack
+        if (status == LUA_ERRRUN || (status == LUA_ERRSYNTAX && !FFlag::LuauBetterOOMHandling))
             if (const char* str = lua_tostring(L, -1))
-            {
                 return str;
-            }
-        }
 
         switch (status)
         {
         case LUA_ERRRUN:
-            return "lua_exception: LUA_ERRRUN (no string/number provided as description)";
+            return "lua_exception: runtime error";
         case LUA_ERRSYNTAX:
-            return "lua_exception: LUA_ERRSYNTAX (no string/number provided as description)";
+            return "lua_exception: invalid syntax";
         case LUA_ERRMEM:
             return "lua_exception: " LUA_MEMERRMSG;
         case LUA_ERRERR:
@@ -553,16 +550,33 @@ LUA_SECURE int luaD_pcall(lua_State* L, Pfunc func, void* u, ptrdiff_t old_top,
         // call user-defined error function (used in xpcall)
         if (ef)
         {
-            // if errfunc fails, we fail with "error in error handling"
-            if (luaD_rawrunprotected(L, callerrfunc, restorestack(L, ef)) != 0)
-                status = LUA_ERRERR;
+            if (FFlag::LuauBetterOOMHandling)
+            {
+                // push error object to stack top if it's not already there
+                if (status != LUA_ERRRUN)
+                    seterrorobj(L, status, L->top);
+
+                // if errfunc fails, we fail with "error in error handling" or "not enough memory"
+                int errstatus = luaD_rawrunprotected(L, callerrfunc, restorestack(L, ef));
+
+                if (errstatus != 0)
+                    status = errstatus == LUA_ERRMEM ? LUA_ERRMEM : LUA_ERRERR;
+                else
+                    status = LUA_ERRRUN; // erase original error code to preserve the error object returned by errfunc
+            }
+            else
+            {
+                // if errfunc fails, we fail with "error in error handling"
+                if (luaD_rawrunprotected(L, callerrfunc, restorestack(L, ef)) != 0)
+                    status = LUA_ERRERR;
+            }
         }
 
         // since the call failed with an error, we might have to reset the 'active' thread state
         if (!oldactive)
             L->isactive = false;
 
-        // Restore nCcalls before calling the debugprotectederror callback which may rely on the proper value to have been restored.
+        // restore nCcalls before calling the debugprotectederror callback which may rely on the proper value to have been restored.
         L->nCcalls = oldnCcalls;
 
         // an error occurred, check if we have a protected error callback

@zeux
Copy link
Collaborator

zeux commented Apr 4, 2023

One other interesting case is, what if the function fails with a regular error code, but then the error handler fails with OOM? This is subtly different from other cases in that it's not clear what error we should return, and what error code we should return.

@zeux
Copy link
Collaborator

zeux commented Apr 4, 2023

Alright, I've tweaked the logic to the extent it can possibly go to handle some of these corner cases carefully. A preview version of this change should land at the end of the week behind a fast flag; if that adequately solves all concerns here we can close this.

Briefly, the new behavior is:

  • error handler will be called on OOMs
  • if it succeeds, lua_pcall will return LUA_ERRMEM + whatever result the error handler returns will be on stack
  • if it fails with OOM, lua_pcall will return LUA_ERRMEM + "not enough memory"
  • if it fails for unrelated reasons, lua_pcall will return LUA_ERRERR + "error in error handling". This allows the caller to distinguish between OOM and other errors in the handler.
  • if the function itself fails for non-OOM reasons, but lua_pcall fails with OOM, lua_pcall will also return LUA_ERRERR + "error in error handling". This is so that the caller knows that the function failed for non-OOM reasons.

@khvzak
Copy link
Contributor Author

khvzak commented Apr 4, 2023

Thanks @zeux
The diff looks good. Wondering, the status LUA_ERRSYNTAX should be impossible in Luau, is there any reason keep it in the conditions?

  • if it fails with OOM, lua_pcall will return LUA_ERRMEM + "not enough memory"
  • if the function itself fails for non-OOM reasons, but lua_pcall fails with OOM, lua_pcall will also return LUA_ERRERR + "error in error handling". This is so that the caller knows that the function failed for non-OOM reasons.

This sounds reasonable. The only question I may have, is have this (OOM handling) functionality ever been used? Given that it's broken (results from handler are ignored), it's unlikely?

If think hypothetically, even collecting traceback on OOM can be useless for many cases, as allocation can fail at any point (eg creating a new string), not necessary at place with largest allocation.

@zeux
Copy link
Collaborator

zeux commented Apr 5, 2023

This sounds reasonable. The only question I may have, is have this (OOM handling) functionality ever been used? Given that it's broken (results from handler are ignored), it's unlikely?

Well, any script that calls xpcall is implicitly subject to this behavior :) I don't know to what extent the OOMs are prevalent, but in our internal applications we have seen OOM caught and handled in the wild via pcall, so this will probably clean up some spurious reports due to erroneous first argument in these cases. And yeah, some OOMs are cascading and some are possible to recover from - right now we don't implement emergency GC, which would be useful for recovery here.

The other note is that lua_pcall can be used to collect the traceback without any heap allocations if you do this entirely on C side, using lua_getinfo + assembling the call stack on the native side. This can be used to report the error to telemetry - we do something like this at Roblox, although instead of lua_pcall we use lua_resume (which leaves the error stack in tact at an error).

the status LUA_ERRSYNTAX should be impossible in Luau, is there any reason keep it in the conditions

It will be mostly cleaned up after this change.

andyfriesen added a commit that referenced this issue Apr 7, 2023
* `table.sort` was improved further. It now guarentees N*log(N) time
complexity in the worst case.
* Fix #880

We are also working on fixing final bugs and crashes in the new type
solver.

On the CodeGen front we have a few things going on:
* We have a smarter register allocator for the x86 JIT
* We lower more instructions on arm64
* The vector constructor builtin is now translated to IR

---------

Co-authored-by: Arseny Kapoulkine <arseny.kapoulkine@gmail.com>
Co-authored-by: Vyacheslav Egorov <vegorov@roblox.com>
RomanKhafizianov pushed a commit to RomanKhafizianov/luau that referenced this issue Nov 27, 2023
* `table.sort` was improved further. It now guarentees N*log(N) time
complexity in the worst case.
* Fix luau-lang/luau#880

We are also working on fixing final bugs and crashes in the new type
solver.

On the CodeGen front we have a few things going on:
* We have a smarter register allocator for the x86 JIT
* We lower more instructions on arm64
* The vector constructor builtin is now translated to IR

---------

Co-authored-by: Arseny Kapoulkine <arseny.kapoulkine@gmail.com>
Co-authored-by: Vyacheslav Egorov <vegorov@roblox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
2 participants