Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Never panic during execution proof check #3504

Merged
merged 20 commits into from
Sep 11, 2019

Conversation

svyatonik
Copy link
Contributor

closes #3470

Description of problem: remote execution proof only contains parts of state (trie) that were touched during execution on remote node. Like (simplifying) if full state is { K1=V1, K2=V2, K3=V3 }, then remote proof could be { K2=V2 }. And if, when executed locally, code will try to access K3, it'll probably panic, because as we all know Externalities not allowed to fail within runtime. So this assumption isn't valid for this case.

When I was thinking about possible solution, I assumed that:

  1. we do not want to make Externalities methods return Result - this'll lead either to changing the whole runtime(s) to return results everywhere, or infinite number of expects in runtime code, or the same number of expects in some glue code (WasmExecutor, runtime support, ...);
  2. we only want to capture unwindings of panics from within Externalities (that's what happens now).

So the solution I suggest is to:

  1. during execution proof check, mark state backend as 'untrusted';
  2. if state backend is untrusted, then we're turning on special panic handler mode (NeverAbort), which will ignore all further changes to Abort until we drop the guard;
  3. in addition to Externalities there's now StorageExternalities. The main difference is that all methods of all implementations of StorageExternalities should only modify its internal state (like OverlayChanges), which must be discarded if execution fails;
  4. in wasm executor there are now two methods that provide access to Externalities: with_ext - will work as before. with_storage_ext - will provide access only to StorageExternalities part AND will capture all panics, turning them into Error::Runtime;
  5. so, when working with 'untrusted state backend' if panic occurs within StorageExternalities when wasm code is executing, it will be captured and transformed into execution error, allowing us to handle this properly. As before, if 'trusted state backend' is used (it is, in all cases, but the remote execution proof check), all externalities panics will lead to process abort.

@svyatonik svyatonik added A0-please_review Pull request needs code review. M4-core labels Aug 28, 2019
@svyatonik svyatonik requested a review from pepyakin as a code owner August 28, 2019 15:02
Copy link
Contributor

@Demi-Marie Demi-Marie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar nits. Also, a proper enum should be used instead of a Boolean flag.

core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>
@svyatonik svyatonik requested a review from kianenigma as a code owner August 29, 2019 05:12
/// Unwind when panic occurs.
Unwind,
/// Always unwind even if someone changes strategy to Abort afterwards.
NeverAbort,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, this is worrysome because if I see force_abort it doesn't mean that it will abort. The name force_abort doesn't add clarity as well...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I agree :) I could rename to unwind(), abort() and force_unwind() if you think it is better. Or you could suggest some other naming

Copy link
Contributor

@pepyakin pepyakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good broadly, however...

TBH, I wasn't really happy about this panic_abort situtation in externalities which seemed to be clunky, error prone and hard to reason about. But this PR brings it to a new level.

Now we admit that panics can happen in not runtime code and in some situations they are OK even though they are forced to abort the process. To add on that this looks esp. weird in the light of our proof-or-remove policy for the panics.

I agree with your point that this case is more of a edge-case and the runtime should not care about it (after all what kind of handling we can do there anyway? The only reasonable way I see is to panic). I am just not sure about the solution.

And yeah, I don't have a better idea either...

core/executor/src/wasm_executor.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
core/state-machine/src/lib.rs Show resolved Hide resolved
core/state-machine/src/lib.rs Outdated Show resolved Hide resolved
svyatonik and others added 3 commits August 30, 2019 07:27
Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>
Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>
@svyatonik
Copy link
Contributor Author

I'm not super-happy with this solution too, but also was unable to find better solution for 2 days. Tried many options (could share some on request), but this one looked 'mostly OK' to me.

And let me be a devil's advocate here - expects in Ext implementation do not fit proof-or-remove policy, for sure. And I would prefer them not to be there. Nevertheless, there were so many issues that we have found + not ignored simply because process has aborted + fixed because they're panics.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 2, 2019

To be clear, I am not proposing removing panics there and I also share the opinion that aborting the process is better than swallowing/coercing these panics to runtime execution errors.

Ok, I think a better solution might be to centralize panic handling.

I.e. instead of directly specifying code scopes where either to abort, to unwind or to forceful abort, we could instead do something like this:

           ┌──────────────────────────────┐   
           │  > Panics are caught here <  │   
           │Handling depends on source and│   
           │          exec mode           │   
        ┌──┴──────────────────────────────┴──┐
        │                                    │
        │                                    │
        │  ▲                                 │
           │                                  
           │                                  
           │    ┌─────────────────┐           
           │    │                 │           
           │    │                 │           
  Panics   │    │     Runtime     │           
 bubble up │    │                 │           
 in stack  │    │                 │           
           │    │                 │           
           │    └─────────────────┘           
           │                                  
           │    ━━━━━━━━━━━━━━━━━━━           
           │                                  
                   Externalities              

We track the source of a panic to know where a specific panic came from: runtime or externalities. All panics unwind regardless where they came from. We catch them before entering to the runtime code. We handle these panics depending on the mode (trusted/not trusted) and source. E.g. we would return a runtime error if it came from the runtime, if panic came from externalities, we would return a special backend error in case we are in untrusted backend mode or straight abort if the backend is trusted.

@svyatonik
Copy link
Contributor Author

I'm not sure I got the idea - your suggestion (at least how I understood it) is to throw ExternalitiesError() from all ext.rs panics. But there's no way to associate some data with panic in rust, right? So we must store the 'type' of panic somewhere else. Some global var? So basically - either we will add something like

match storage.get(hash) {
  Ok(value) => Ok(value),
  Err(error) => {
     i_m_going_to_panic_inside_externalities();
     panic!("Externalities not allowed to fail within runtime");
  }
}

, or

let ext_result = catch_unwind(|| { ext.storage(hash); })
if let Err(error) = ext_result {
     i_m_going_to_panic_inside_externalities();
     resume_unwind(error);
} 

If that's what you suggesting, I do not see - why it is better than scoped approach :/

@pepyakin
Copy link
Contributor

pepyakin commented Sep 3, 2019

Yeah, basically you can think of it as "throwing an exception" and yes, you are right, there is no way to directly associate a custom payload with a panic. However, we already doing precisely that with panic-handler.

So code-wise my idea is something like this.

Externalities could look almost the same from outside.

	fn original_storage(&self, key: &[u8]) -> Option<Vec<u8>> {
		let _guard = ExtGuard::ext_guard();
		self.backend.storage(key).expect(EXT_NOT_ALLOWED_TO_FAIL)
	}

However, they use an ext_guard or something. ExtGuard can also answer the is_in_ext() -> bool query. Note that it is similar to AbortGuard in that it also has a thread_local! flag. However, in contrast to AbortGuard, ExtGuard doesn't change any panic semantics - it is just a dumb flag.

The consequence of this change is that all panics both from the (native) runtime and from externalities always unwind.

Then, somewhere where we call into a runtime, we put a single centralized panic handler:

match catch_unwind(|| call_into_runtime(..)) {
  Ok(result) => Ok(result),
  Err(_) if untrusted_backend && ExtGuard::is_in_ext() => Err(InvalidProof),
  err @ Err(_) => resume_unwind(err),
}

Advantage of this approach is that you'd have a single place where you decide what to do in case of panic, instead of overriding logic in specific places.

@svyatonik
Copy link
Contributor Author

OK, I still do not think it is any better than proposed implementation. That's why:

  1. ExtGuard won't work that way, simply because ExtGuard::drop() will be called on panic, which will set is_in_ext to false again. So we'll either need additional catch_unwinds, or something like:
fn fn_from_ext() {
	we_are_in_ext();
	// do what required
	we_are_not_in_ext();
}

in every method of Ext (and all other non-test Externalities implementations that will ever be created);
2) actually, some kind of ExtGuard will only be helpful where we call catch_unwind - we may want clear that flag. And, tbh, we must clear it, because it has very strange semantics - 'last panic has happened inside Externalities'. This semantics is only defined within runtime call. Compare this to general panic handler strategy, which is always valid;
3) if we'll use resume_unwind, that would mean that the process won't abort if panic happens in non-main thread. I think we agreed that this isn't good;
4) if we'll abort instead of resume_unwind, then we're losing location+backtrace of original panic. We can try to save this in the same global variable, but this makes the whole construction even worse. Like infamous GetLastError function from WinApi;
5) and last, but not the least, we'll have to mark closure used in catch_unwind as AssertUnwindSafe. And though the handler you're proposing should be fine with that, I still do not like the fact - objects that will be passed to that closure aren't actually safe for unwinding. That's why I've separated Externalities and StorageExternalities traits.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 3, 2019

I see really good points there! However, it seems to me that they mostly address the particular code snippet. Of course, I am not even sure if it compiles and there might be other issues: I haven't delved into the code to check everything. IMO they should be solvable (e.g. pt.1 is solved by using a closure) if we put some thought into it (OTOH maybe some local hacks would be required.).

It seems to be that the main issue is with

This semantics is only defined within runtime call. Compare this to general panic handler strategy, which is always valid;

I'd argue that this is perflectly fine and reflects the real state of things. Let's remind ourselves that the runtime, or rather it's native version, is really special. The runtime is written with an assumption that is has some sort of ambient state around it. In wasm runtime we have wasm_executor. For the native API we have this: note all of this ext::with and also the with_externalities. This implies that the runtime is inherently scoped.

Compare this to general panic handler strategy, which is always valid;

Right, although it doesn't seem to me as an advantage: it allows code that overrides overriden modes which would be perfectly valid and there is no way to check that. OTOH, the explicitly scoped approach would allow you to actually check if the closure was called outside the runtime call scope to catch this.

If you bought on the advantages of the approach we can discuss the details : )

@Demi-Marie
Copy link
Contributor

I think the real problem here is that we are using panics for exception handling, which they are’t designed for. That brings in all of exception handling’s problems.

A better idea, in my opinion, is to wrap a catch_unwind around exactly the blocks of code where we do expect to panic, and no others.

@Demi-Marie Demi-Marie self-requested a review September 4, 2019 00:34
@svyatonik
Copy link
Contributor Author

@pepyakin Nope - my comment is not about the snippet, it is mostly about number of hacks we'll need to introduce to 'centralize' panic handling. IMO instead this makes panic handling spread all over the codebase (by introducing additional strange global vars, by preserving panic' artifacts for processing elsewhere) + introducing additional error-prone assumptions (like marking everything AssertUnwindSafe) + by changing implementations. Instead of aborting as soon, as we see the panic, we're trying to introduce (setjmp+longjmp+static)-type exception handling just to...abort.

That said, I don't think that 'centralized' panic handling worth implementing that structure you've suggested + I'm not seeing how this makes panic handling actually centralized. I do not want to continue arguing, though - I think we both have had enough arguments here. I can fix the issue once the 'centralized handling' is ready, but I'm not going to sign it off, or be a co-author. Feel free to close this PR once there's issue/other PR that suggests/introduces that new panic handling mechanism.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 4, 2019

While I still think that the centralized approach is worth to explore, I am not 100% sure that it will be fruitful. Failing to convience you in advantages of this didn't give me more confidence.

Since I am completely agree that we spent on this enough time, I wouldn't block merging this PR and not going to work towards exploring my idea unless someone else comes in and either supports it or has a better idea. (I also think that this hasn't had enough attention from the rest of the team, but oh well).

Anyway, the code looks good in that it implements the stated approach properly.

Copy link
Member

@gavofyork gavofyork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally ok, couple of nits.

Copy link
Contributor

@Demi-Marie Demi-Marie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@svyatonik svyatonik added A3-needsresolving and removed A0-please_review Pull request needs code review. labels Sep 11, 2019
@svyatonik svyatonik merged commit 12896e9 into master Sep 11, 2019
@svyatonik svyatonik deleted the do_not_panic_when_checking_exec_proof branch September 11, 2019 07:27
andresilva pushed a commit that referenced this pull request Sep 17, 2019
* do not panic during execution proof check

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* BackendTrustLevel enum

* up runtime version

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* fixed some grumbles

* Update core/state-machine/src/testing.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* Update core/state-machine/src/lib.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* mov where

* spaces -> tabs (to restart build)
Demi-Marie pushed a commit to Demi-Marie/substrate that referenced this pull request Sep 17, 2019
* do not panic during execution proof check

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* BackendTrustLevel enum

* up runtime version

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* fixed some grumbles

* Update core/state-machine/src/testing.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* Update core/state-machine/src/lib.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* mov where

* spaces -> tabs (to restart build)
en pushed a commit to en/substrate that referenced this pull request Sep 24, 2019
* do not panic during execution proof check

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* BackendTrustLevel enum

* up runtime version

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* Update core/state-machine/src/lib.rs

Co-Authored-By: DemiMarie-parity <48690212+DemiMarie-parity@users.noreply.github.com>

* fixed some grumbles

* Update core/state-machine/src/testing.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* Update core/state-machine/src/lib.rs

Co-Authored-By: Gavin Wood <gavin@parity.io>

* mov where

* spaces -> tabs (to restart build)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Light client may panic if invalid execution proof is provided
4 participants