Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch programs activation to whole-set based gating #11750

Merged
merged 23 commits into from
Aug 25, 2020

Conversation

jackcmay
Copy link
Contributor

@jackcmay jackcmay commented Aug 20, 2020

Problem

get_entered_epoch_callback isn't called when restoring from snapshots, which is too confusing and error-prone.
Also, we can't simply call it immediately after snapshot restoration because it expects to be called exactly once at each epoch boundary

get_programs and get_builtins returns delta set. i.e. add these new additions of programs to the current available set at the given epoch. This works nicely in the ideal world, where we're running the validator since genesis without ever restarting a perfect bug-free validator.

In reality, we must rely on snapshots 99.999% of time. When restoring from snapshots, the delta set doesn't work quite: we don't persist the current available set (namely bank.message_processor is effectively serde(skip)).

Summary of Changes

So, just reflect the reality by making these functions snapshot-friendly by returning whole-set of available programs at the given epoch. And make it callable from finish_init(), which is called after snapshot restoration.

Also, fix a bunch of other dangerous code along the way.

Also, this is intended to be back-port friendly; so the fix is intentionally not exhaustive. Still, get_entered_epoch_callback is a bit error-prone. Specifically, it must be idempotent. (We could solve this by artificially introducing some intermediate struct like ScheduledBankFeatures or the like instead of mind-opening way of passing &mut Bank).

Deprecates #11736

@jackcmay
Copy link
Contributor Author

jackcmay commented Aug 20, 2020

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction. We can probably at least separate the program activation mechanism from the generalized bank APIs to add built-ins.

Maybe genesis programs can have a list of (program, activation_epoch)s that is used by the epoch boundary callback to check if current_epoch == activation_epoch but provides a different callback for finish_init that does 'if current_epoch >= activation_epoch` in order to populate with all the activated programs.

Also, is there any reason not to make the hot fix permanent, seems the bank is the best place for that logic?

And, any reason to keep inflation in genesis-programs? Maybe we should also move that to runtime like the builtin programs are in runtime/src/builtins.rs?

@codecov
Copy link

codecov bot commented Aug 20, 2020

Codecov Report

Merging #11750 into master will increase coverage by 0.0%.
The diff coverage is 86.8%.

@@           Coverage Diff            @@
##           master   #11750    +/-   ##
========================================
  Coverage    82.0%    82.0%            
========================================
  Files         330      330            
  Lines       77174    77464   +290     
========================================
+ Hits        63334    63589   +255     
- Misses      13840    13875    +35     

Comment on lines 564 to 569
if parent.epoch() < new.epoch() {
new.refresh_programs_and_inflation();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction.

I don't think we're doing at every bank. I think we're doing the checks only when crossing the epoch and when restoring from snapshots: https://github.com/solana-labs/solana/pull/11750/files#r474369115 and notice that finish_init isn't called from Bank::new_from_parent.

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

And, any reason to keep inflation in genesis-programs? Maybe we should also move that to runtime like the builtin programs are in runtime/src/builtins.rs?

Yeah, I'll move it as well. it's a bit odd for inflation to reside at the current place...

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Maybe genesis programs can have a list of (program, activation_epoch)s that is used by the epoch boundary callback to check if current_epoch == activation_epoch but provides a different callback for finish_init that does 'if current_epoch >= activation_epoch` in order to populate with all the activated programs.

I think this approach will work too. But I originally avoided to do so because it'd make the activation mechanism a bit less flexible. On the other hand, this would lift the idempotent requirement which is currently required under refresh_programs_and_inflation. Anyway, I'll consider the approach a bit.

Also, note that I had this in mind: the most important objective here is make it super-simple to do add/tweak gating logic to avoid similar bugs in the foreseeable future.

FYI, we're doing the current checks at epoch boundaries already: #11750 (comment)

@jackcmay
Copy link
Contributor Author

@ryoqun I have some concerns with these changes, at every bank we are checking for the existence of the account and calling message_processor to override the program's process_instruction.

I don't think we're doing at every bank. I think we're doing the checks only when crossing the epoch and when restoring from snapshots: https://github.com/solana-labs/solana/pull/11750/files#r474369115 and notice that finish_init isn't called from Bank::new_from_parent.

Yeah, sorry I didn't mean every bank, every epoch, but at every epoch we are re-adding these programs, kinda weird

@jackcmay
Copy link
Contributor Author

Can you elaborate on how the (program, activation_epoch) would be less flexible?

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Yeah, sorry I didn't mean every bank, every epoch, but at every epoch we are re-adding these programs, kinda weird

Yeah, I want to avoid the weird behavior as well, too. I originally thought that the behavior wouldn't occur because Bank::add_builtin_loader(Bank::add_builtin) ensures not to re-add accounts. But, I noticed that Bank::add_native_program does unconditionally add accounts...

I wonder why there is subtle behavior differences between the two similar functions. I'll just align Bank::add_native_program to Bank::add_builtin_loader.

Can you elaborate on how the (program, activation_epoch) would be less flexible?

It's less flexible in that you can't specify arbitrary condition like with closures. So, say we can't conditionally trigger new programs with bank.capitalization() like that. But, it's unlikely we'd ever need that flexibility.

Anyway, thanks for the inputs. Because now I have some more time, I'll put more changes to fix them more properly.

*self.entered_epoch_callback.write().unwrap() = Some(entered_epoch_callback);
self.apply_feature_activations();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this is needed in addition to call at finish_init.

new.update_epoch_stakes(leader_schedule_epoch);
new.ancestors.insert(new.slot(), 0);
new.parents().iter().enumerate().for_each(|(i, p)| {
new.ancestors.insert(p.slot(), i + 1);
});
if parent.epoch() < new.epoch() {
Copy link
Member

@ryoqun ryoqun Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is moved because of use of get_account inside there.

@@ -175,6 +175,20 @@ pub type ProcessInstruction = fn(&Pubkey, &[KeyedAccount], &[u8]) -> Result<(),
pub type ProcessInstructionWithContext =
fn(&Pubkey, &[KeyedAccount], &[u8], &mut dyn InvokeContext) -> Result<(), InstructionError>;

// These are just type aliases for work around of Debug-ing above function pointers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean these are temporary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hope these are temporary. But, it'll take some time (a year?) considering the duration the upstream bug has been opening: rust-lang/rust#50280

If these are concerning, I'll try another workaround or remove workaround at all. I just want to dbg! some types containing these for quick debugging. This work-around isn't a must; just nice-to-have.

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Well, the current (EDIT: base commit of this pr) code doesn't populate message_processor.loaders at all if restored from snapshots... I wonder why the current cluster is working..

master when booted from genesis (without snapshot) => 2 bpf loaders:

[runtime/src/bank.rs:587] &new.message_processor = MessageProcessor { programs: ["11111111111111111111111111111111: 0x563895613300", "Config1111111111111111111111111111111111111: 0x563895867a80", "Stake11111111111111111111111111111111111111: 0x5638958224e0", "Vote111111111111111111111111111111111111111: 0x56389584eb10"], loaders: ["BPFLoader2111111111111111111111111111111111: 0x5638952eabd0", "BPFLoader1111111111111111111111111111111111: 0x5638952eabd0"], native_loader: NativeLoader { program_symbol_cache: RwLock { data: {} }, loader_symbol_cache: RwLock { data: {} } }, is_cross_program_supported: true }

master when booted from snapshot (with snapshot) => no loaders:

[runtime/src/bank.rs:587] &new.message_processor = MessageProcessor { programs: ["11111111111111111111111111111111: 0x55dad6dc2300", "Config1111111111111111111111111111111111111: 0x55dad7016a80", "Stake11111111111111111111111111111111111111: 0x55dad6fd14e0", "Vote111111111111111111111111111111111111111: 0x55dad6ffdb10"], loaders: [], native_loader: NativeLoader { program_symbol_cache: RwLock { data: {} }, loader_symbol_cache: RwLock { data: {} } }, is_cross_program_supported: true }

Also, it seems that the message_processor.loaders is always empty regardless of snapshot or no snapshot on v1.2...

@jackcmay
Copy link
Contributor Author

We need more snapshot tests ;-)

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

@jackcmay As for backporting this, if our ultimate goal is to land bpf_loader2 on v1.2, I think we're better off back-porting the prs first, not this one. I just noticed the amount of bpf2 is rather large. Back-porting in git-reversed order (= start to backport this pr now after merging) would be quite pain for this size of changes for me and you. ;)

If we can't backport individually those relevant prs due to known bug (maybe this?), we can create a pr against v1.2 and cherry-pick them in the git-chronological order, which will lessen the degree of merge conflicts.

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Yeah, I'll move it as well. it's a bit odd for inflation to reside at the current place...

Also, I'll postpone this as a separate pr later. Inflation doesn't change on v1.2. And this will just make this pr (already moderate sized) bigger for no good reason.

self.recheck_cross_program_support();
}

fn recheck_cross_program_support(self: &mut Bank) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why recheck and not just check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not strong reason; just to convey the idempotent connotation. If this sounds odd, I'm completely fine with fn check_....

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Ugh, I think I've found dos secrutiy... We aren't considering BPFLoader2111111111111111111111111111111111 is pre-occupied by malicious user....:

  • system accounts with some lamports (=> may cause capitalization off)
  • Stake1111111 (=> our cache invalidation is working correctly?)
  • Fake native loader account (=> are we always checking the account.owner before doing dlopen with its account.data?)

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Anyway, I think we need sysvar-like treatment for existing accounts.

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Also, it seems that the message_processor.loaders is always empty regardless of snapshot or no snapshot on v1.2...

I'm guessing now message_processor.loaders has been broken for some very long time. Only after #11516, the emptiness problem surfaced on master.

@jackcmay
Copy link
Contributor Author

@ryoqun Can you elaborate more on empty loaders issue you are seeing?

runtime/src/bank.rs Outdated Show resolved Hide resolved
@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

@ryoqun Can you elaborate more on empty loaders issue you are seeing?

There are some cases where message_processor.loaders is empty. For example, on v1.2, it's always empty even if we're starting from genesis or snapshot. I think that field should hold bpf_loader at least... But it's working correctly in that validator can process program's transactions. Maybe, dynamic link libbpf_loader.so fallbacks?

For v1.3, I haven't tested.

For HEAD of master at the moment (ie without this pr), message_processor.loaders can be empty when restored from snapshot and causes panic when program's transactions are encountered. When running from genesis, message_processor.loaders isn't empty.

With this pr, message_process.loaders is always with bpf_loader and bpf_loader2. And validator can process program's transactions.

All, can be tested with ./run.sh.

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

oh conflict...

@ryoqun
Copy link
Member

ryoqun commented Aug 21, 2020

Maybe genesis programs can have a list of (program, activation_epoch)s ...

Ok, tests are in place, finally! I'll address this.

if let Some(mut account) = self.get_account(&program_id) {
already_genuine_program_exists = native_loader::check_id(&account.owner);

if !already_genuine_program_exists {
Copy link
Contributor Author

@jackcmay jackcmay Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring on the fire and brimstone.... what if this account has a very high balance and is obviously worth a lot to someone. Seems like instead or along with we should block known addresses from being added in the first place?

Also, what if someone adds an account and assigns it to the native_loader?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring on the fire and brimstone....

Hehe, I like thinking hard for corner cases... :)

Also, what if someone adds an account and assigns it to the native_loader?

Yeah, I thought of it. Well, can someone assign accounts to non-existing owner program_id (=NativeLoader1111111111111111111111111111111)?:

ryoqun@ubuqun:~/work/solana/solana$ ./target/release/solana  --url http://api.mainnet-beta.solana.com account BPFLoader1111111111111111111111111111111111

Public Key: BPFLoader1111111111111111111111111111111111
Balance: 0.000000001 SOL
Owner: NativeLoader1111111111111111111111111111111
Executable: true
Rent Epoch: 0
Length: 25 (0x19) bytes
0000:   73 6f 6c 61  6e 61 5f 62  70 66 5f 6c  6f 61 64 65   solana_bpf_loade
0010:   72 5f 70 72  6f 67 72 61  6d                         r_program


ryoqun@ubuqun:~/work/solana/solana$ ./target/release/solana  --url http://api.mainnet-beta.solana.com account NativeLoader1111111111111111111111111111111
Error: AccountNotFound: pubkey=NativeLoader1111111111111111111111111111111

I just blindly thought the original code was covering the case that with mere this condition: native_loader::check_id(&account.owner).

If someone can do the wild, we can add account.executable condition to discern legitimate native loader program from the spoofed native loader program?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this account has a very high balance and is obviously worth a lot to someone. Seems like instead or along with we should block known addresses from being added in the first place?

;) Yeah, this could be quite a draconian. But, just to be sure, is this really protecting someone from misfortune? Well, people can give lump of sols into wrong account with 1 letter typo... Also, people can send to any of sysvars...

Anyway, I think we should need a preparatory pr to prohibit money transfer to & from some pretty long prefix or suffix? Considering new sysvar or new bpf loaders are rare, this might be a too much. Also, finding nice base58-friendly prefix/postfix seems hard....

Or, I'm also thinking to practice of 2-phased introduction of sysvars/loaders like this:
1(a). Release a patch release which just prohibits any money transfer to certain account at certain epoch.
1(b). Or, just create an account with Placeholder11111 with the executable bit? (might be easiest?)
1(c). Add BPFLoader11111 to frozen account.... (no, this is bad, dos vector)
2. release the acutual meat

self.store_account(program_id, &account);
let mut already_genuine_program_exists = false;
if let Some(mut account) = self.get_account(&program_id) {
already_genuine_program_exists = native_loader::check_id(&account.owner);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does genuine mean here, could just be is_native

Copy link
Member

@ryoqun ryoqun Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. :) Not strongly-opinionated word choice, I wanted to emphasize that an native-loader's accounts can be good or bad. And this is really trying to indicate this variable holds a flag for the good native loader program's account.

I just picked this word from these opposing word pools:

real, legitimate, genuine <=> spoofed/crafted/malicious

@ryoqun
Copy link
Member

ryoqun commented Aug 24, 2020

@jackcmay Thanks for continuous review! I think I've addressed all of review comments. Could you finish off reviewing this so that I can merge this? :)

@ryoqun
Copy link
Member

ryoqun commented Aug 24, 2020

Maybe genesis programs can have a list of (program, activation_epoch)s ...

Ok, tests are in place, finally! I'll address this.

This is done: ebba5bd

genesis-programs/src/lib.rs Outdated Show resolved Hide resolved
@jackcmay
Copy link
Contributor Author

Besides that one last comment about the bank config, this looks great, thanks @ryoqun !

@ryoqun ryoqun merged commit db4bbb3 into master Aug 25, 2020
@ryoqun
Copy link
Member

ryoqun commented Aug 25, 2020

@jackcmay I've just merged this after somewhat extensive local-testing!

@ryoqun ryoqun deleted the fragile-programs-gating branch August 25, 2020 17:07
@ryoqun ryoqun added the v1.3 label Aug 25, 2020
mergify bot pushed a commit that referenced this pull request Aug 25, 2020
* Implement Debug for MessageProcessor

* Switch from delta-based gating to whole-set gating

* Remove dbg!

* Fix clippy

* Clippy

* Add test

* add loader to stable operating mode at proper epoch

* refresh_programs_and_inflation after ancestor setup

* Callback via snapshot; avoid account re-add; Debug

* Fix test

* Fix test and fix the past history

* Make callback management stricter and cleaner

* Fix test

* Test overwrite and frozen for native programs

* Test epoch callback with genesis-programs

* Add assertions for parent bank

* Add tests and some minor cleaning

* Remove unsteady assertion...

* Fix test...

* Fix DOS

* Skip ensuring account by dual (whole/delta) gating

* Fix frozen abi implementation...

* Move compute budget constatnt init back into bank

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
(cherry picked from commit db4bbb3)

# Conflicts:
#	genesis-programs/src/lib.rs
mergify bot added a commit that referenced this pull request Aug 25, 2020
)

* Switch programs activation to whole-set based gating (#11750)

* Implement Debug for MessageProcessor

* Switch from delta-based gating to whole-set gating

* Remove dbg!

* Fix clippy

* Clippy

* Add test

* add loader to stable operating mode at proper epoch

* refresh_programs_and_inflation after ancestor setup

* Callback via snapshot; avoid account re-add; Debug

* Fix test

* Fix test and fix the past history

* Make callback management stricter and cleaner

* Fix test

* Test overwrite and frozen for native programs

* Test epoch callback with genesis-programs

* Add assertions for parent bank

* Add tests and some minor cleaning

* Remove unsteady assertion...

* Fix test...

* Fix DOS

* Skip ensuring account by dual (whole/delta) gating

* Fix frozen abi implementation...

* Move compute budget constatnt init back into bank

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
(cherry picked from commit db4bbb3)

# Conflicts:
#	genesis-programs/src/lib.rs

* Fix conflicts

Co-authored-by: Jack May <jack@solana.com>
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>

bank.add_builtin_program("mock_program1", vote_id, mock_ix_processor);
bank.add_builtin_program("mock_program2", stake_id, mock_ix_processor);
assert!(bank.stakes.read().unwrap().vote_accounts().is_empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants