Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

merge changes from 2.2.x into develop #10657

Merged
merged 3,349 commits into from
Aug 30, 2021
Merged

Conversation

huangminghuang
Copy link
Contributor

@huangminghuang huangminghuang commented Aug 25, 2021

Change Description

This PR merges changes related to state_history_plugin amqp-trx-plugin, rodeos, eosio-tester .

  • amqp_trx_plugin don't declare queue 📦 #10618:

    • Update amqp_trx_plugin to not declare queue. Instead it will assume queue exists.
    • Add --amqp-queue-name to cleos to allow user specification of queue instead of hard-coded "trx"
    • EPE-1190, EPE-1192
  • SHiP: Use shared_ptr instead of weak_ptr (explicit and implicit) 📦 #10611 :

    • Fix issue with disconnect of rodeos causing nodeos to core-dump.
    • Fix issue with head->block not valid on a request to nodeos. For example, a producer node that is not producing.
  • rodeos: Add support for stream_wapper_v1 with std::string route 📦  #10606 :

    • Add new stream_wrapper_v1 with std::string route which removes the restriction to use eosio::name for routing keys.
  • Add recover_key support to eosio-tester #10598 :
    Add support for recover_key in eosio-tester. Requires Include recover_key into tester_intrinsics eosio.cdt#1185

  • Rodeos filter current_time() 📦 #10582 :

    • Rodeos previously only supported current_time() host function for query wasms. This PR adds support for current_time() in filter wasms.
  • Rename receiver to make code more readable - develop-boxed #10575 :

    • Remove unused receiver and rename used receiver so it is clearer what it represents. No change in functionality just makes the code a little easier to follow.
  • Update to the force-write-stride to 100,000 #10561 :
    Update to the force-write-stride to 100,000

  • Use tested rocksdb_options.ini for sample file and add missing max_wr… #10557 :
    Publish the tested options file for example. Add max_write_buffer_number if options file is not present.

  • SHiP post send_update to main thread to avoid overwhelming the main thread - develop-boxed #10550 :

    • When many rodeos connected it would overwhelm nodeos. Post to main thread at medium so that other medium or higher tasks have an opportunity to run.
  • Stop rodeos if unable to listen on wql-listen #10549 :

    • Do not let rodeos continue if unable to listen on the provided wql-listen address.
  • Init force_write_stride to 1  #10548 :
    Init force_write_stride because it is used in modulus to avoid divide by 0

  • disable undo_stack by default, set tested typical rocksdb configurati… #10547 :
    For the convenience of users and less prone to make mistakes, undo_stack is disabled by default,
    and --enable-undo-stack is required to enable it; if options file is not provided, typical rocksdb configuration
    parameters will be used; If --rdb-threads and --rdb-max-files are used, rodeos will not start and detailed warning is given.

  • Make undo_stack configurable and add quit function for exception hand… #10529 :
    Add --disable-undo-stack to rodoes startup option to disable undo_stack handling.

    Gracefully shut down rodeos when a fork happens if undo stack is not enabled.

  • Added force-write-stride option in rodeos to control manual rocksdb flush frequency #10508 :
    Details of force-write-stride option in rodeos: Maximum number of blocks to process before forcing rocksdb to flush. This option is primarily useful to control re-sync durations under disaster recovery scenarios (when rodeos has unexpectedly exited). This option ensures blocks stored in rocksdb are at most force-write-stride blocks behind the current head block being processed by rodeos. However, saving too frequently may affect performance. It is likely that rocksdb itself will save rodeos data more frequently than this setting by flushing memtables to disk, based on various rocksdb options. In contrast, when rodeos exits normally, it saves the last block processed by rodeos into rocksdb and will continue processing new blocks from that last processed block number when it next starts up.

  • Add options file configuration to RocksDB and adjust default configur… #10503 :
    Add RocksDB options file support and adjust default configurations. Options file is specified by command line argument --rdb-options-file. Please see Documentation Additions below for details.

  • Fix memory leak (24GB per second) when rodeos starts with --filter-wa… #10519 :
    When rodeos' ship_client fails to connect to nodeos (like when nodeos is not running),. ship_client calls cloner_session's close() with retry set to true. This triggers cloner_plugin to recreate cloner_session every second until a connection is established. cloner_session creates rodeos_filter object if rodeos starts with --filter-wasm option. Although destructor of rodeos_filter is called before recreation, its memory is not completely freed. The root cause is the filter_state in rodeos_filter has a member eosio::vm::wasm_allocator wa. wa allocates VM memory when initiated. This memory must be released by explicit free().

    The same problem in thread_state is fixed by this PR as well.

  • Add recover_key & assert_recover_key to rodeos #10456 :

    • Add host functions recover_key & assert_recover_key to rodeos for use in queries
  • Passthrough exit code of WASM start function as exit code of eosio-tester #10382 :
    The eosio-tester program would only return an exit code of 1 (to indicate a problem) if it did not attempt to execute the test WASM (e.g. due to running help or a problem parsing arguments) or if it did attempt to execute the test WASM but failed to return successfully because of some exception (including exceptions caused by executing eosio_assert_message). Otherwise, it would return an exit code of 0 (indicating success).

    This works fine for test cases that would check for failures using (directly or indirectly) eosio_assert_message, for example: calling a function which calls eosio::check and the condition fails; or using REQUIRE in Catch2. But in other situations this would mask failures. For example, when using CHECK in Catch2, the errors are printed out and then ultimately the failure of the test run is indicated through the exit code returned by the main function. So in such failures it would appear looking at only the exit code of running eosio-tester (which is what ctest does) that the tests all passed.

    This PR checks to see if a return value exists for the start function that is called when the test WASM is executed. If so, it assumes it is a signed 32-bit integer, extracts that value and returns that as the exit code. If it doesn't exist, it will continue with the current behavior of returning 0.

    Note to fully take advantage of this change, the test WASMs must be compiled with a start function that passes through the return value of main. For that change, see: Passthrough return of main within start function of tester WASMs. eosio.cdt#1101.

  • Rodeos timeout fix #10252 :

    • Remove expires_after calls on connections.
    • Explicitly close connection on read/write error including timeout.
    • Ignore existing wql-idle-timeout option as it does not work as intended (see below)

    boost’s new expire_after timer(s) is a terrible interface if you ask me. rodeos was killing an active connection after the default 30 second timer because the reset of the timer was only reseting one of the internal timers.

    Here is the code for boost expires_after

    if(! impl_->read.pending)
        BOOST_VERIFY(
            impl_->read.timer.expires_after(
                expiry_time) == 0);
    if(! impl_->write.pending)
        BOOST_VERIFY(
            impl_->write.timer.expires_after(
                expiry_time) == 0);
    

    and the comment from the code

    /** Set the timeout for the next logical operation.
        This sets either the read timer, the write timer, or
        both timers to expire after the specified amount of time
        has elapsed. If a timer expires when the corresponding
        asynchronous operation is outstanding, the stream will be
        closed and any outstanding operations will complete with the
        error @ref beast::error::timeout. Otherwise, if the timer
        expires while no operations are outstanding, and the expiraton
        is not set again, the next operation will time out immediately.
        The timer applies collectively to any asynchronous reads
        or writes initiated after the expiration is set, until the
        expiration is set again. A call to @ref async_connect
        counts as both a read and a write.
        @param expiry_time The amount of time after which a logical
        operation should be considered timed out.
    */
    

    So there are TWO timers and if you are not VERY careful it will only set ONE of them.

    In our test case, only one was reset. The other timer would fire and kill the connection.
    Unless I’m missing something, the boost example code also has the same issue.
    https://www.boost.org/doc/libs/1_75_0/libs/beast/example/http/server/async/http_server_async.cpp

    If there is an outstanding (pending) read or write then the associated timer is not touched.
    I think the way forward here is to implement our own timers for idle timeout and read/write operations. This will be worked under a different PR.

  • Allow rodeos to recover from kill-9 #10164 :

    • If rodeos crashes (kill -9) then the eosvmoc code cache is invalid. If unable to open code cache remove the cache and try again. This allows rodeos to recover from a kill -9.
    • Enhance the existing rodeos test to add a kill -9 to rodeos and verify it recovers correctly.
    • Removed option -max-retained-history-files since it is currently broken
  • Add new get/set param methods #10097 :

    • Add missing get/set param methods.
  • Add get_raw_abi to rodeos #10083 :
    cleos needs this to send transaction to rodeos.

  • add compile time option to enforce a chain id 📦 #10049 :
    SHIP clients often do not perform any validation of the blocks they are receiving. For high assurance deployments SHIP clients need complete confidence they are receiving fully validated blocks. compile time option to disable light validation options 📦 #9721 added a compile time option to disable options in nodeos that disable full validation. With proper deployment permissions, unix socket support for state history endpoint 📦 #9963 can make it easier to prove that the SHIP client is connecting to the SHIP endpoint it expects (a fully validating trustworthy nodeos). Still, the SHIP client doesn't actually have proof that a fully validating nodeos is following the expected "real" chain (consider a malicious actor starting a fully trustworthy nodeos with a malicious genesis.json and then feeding that rouge chain with actions on accounts the SHIP client is looking for).

    This adds a compile time option -DEOSIO_REQUIRE_CHAIN_ID= where nodeos will only operate with a given chain id. An additional tool is added, eosio-chainid, which will print the chain id of a given genesis json which may make it easier to determine the proper chain id before building. I'm not sure the additional tool is worth it over just starting nodeos with a genesis json and doing a get info?

    I have no idea what to do about unit tests when this option is enabled.

  • amqp_trace_plugin SHiP protocol for transaction_trace #10057 :

    • Use stable SHiP protocol for transaction_trace
    • Move eosio-tester type conversion to state_history library
  • Update to latest eos-vm on eosio-boxed #10024 :

    • Pull in latest of eos-vm
  • unix socket support for rodeos ship connection & httpql server 📦 #10001 :
    Add support for rodeos connecting to state history endpoint via unix socket (as exposed via unix socket support for state history endpoint 📦 #9963). Also add support for the HTTP QL server to listen on a unix socket. Like http-server-address and state-history-endpoint, wql-listen has been modified such that if it's configured with the empty string the TCP endpoint is completely disabled.

  • unix socket support for state history endpoint 📦 #9963 :
    Oftentimes clients consuming from nodeos' state history endpoint fully trust the data received from this endpoint. It certainly doesn't have to be this way -- a client connected to the state history endpoint could perform light validation, for example. But oftentimes that is not the case; even rodeos fully trusts the state history endpoint it connects to.

    This change adds a new option to state_history_plugin to expose its endpoint over a unix socket in addition or in lieu of the existing TCP endpoint. In some secure environments this allows more tight coupling of nodeos and rodeos (or other ship consumers) to enforce that the state history client always connects to the expected (trustworthy) nodeos.

    The state-history-endpoint option has been changed to behave similar to http-server-address: if state-history-endpoint is configured as an empty string, the TCP endpoint is fully disabled.

  • rodeos: allow queries to use sha functions #9708 :
    rodeos: allow queries to use sha functions

  • rodeos filters: print_time_us intrinsic #9307 :
    Add print_time_us intrinsic to rodeos filters.

  • std::promise use #9331 :

    • Get future before promise use to avoid race on set_value & get_future
  • ack amqp input trx after execution #9303 :

    • Instead of immediately ack AMQP amqp_trx_plugin transaction wait until transaction has been executed.
    • This will allow AMQP (RabbitMQ) to monitor transaction processing
  • rodeos: checkpoint support #9234 :
    rodeos can now generate checkpoints (live backup of database) using /v1/rodeos/create_checkpoint if --wql-checkpoint-dir is enabled.

  • Rodeos wasm limits #9241 :
    Make rodeos query wasm memory limit configurable

  • rodeos: improve mmap failure prevention and reporting #9205 :
    It appears that rodeos running on a certain k8s cluster is having trouble using mmap. This change:

    • Increases diagnostics to help narrow in on the problem
    • Strengthens the wasm mmap cache so it doesn't dispose (and later have to recreate) regions when other errors occur.
    • Preallocates wasm mmaps ahead of time
  • Start consuming again when connection reestablished to AMQP 📦 #10615 :
    Don't consume amqp messages until unpaused #10541 added support for starting up nodeos not consuming from AMQP queue. This change however does not start consuming again when the AMQP connection is loss and re-established. This PR adds back in start of consuming when the AMQP connection is re-established.

  • Don't consume amqp messages until unpaused #10541 :
    Don't consume amqp messages until unpaused

  • Add support for specifying reply-to to amqp message 📦 #10412 :

    • Add support to AMQP to specify reply-to.
    • Add amqp-reply-to option to cleos
  • Add block_uuid to transaction trace 📦 #10371 :
    amqp_trx_plugin

    • Add block_uuid to transaction trace message
    • Add block_uuid_message for signaling correlation between block_uuid and block_id.
      • Sent only to routing keys specified by AMQP Header "block-uuid-msg". Add a AMQP Header of "block-uuid-msg" with a string value of the routing key to send the block_uuid_message. If not specified then no block_uuid_message will be sent.
    • Add ABI for amqp_trx_plugin messages. SHiP protocol with addition of amqp_trace_plugin types
  • amqp_trx_plugin ack immediately on trx failure #10334 :

    • The immediate ack on trx failure was only ack'ing on exception but not on a trace with an exception. Now amqp_trx_plugin will ack on trx failure for all failed transactions.
    • Also moved the sending of the trace after the ack although no guarantee is provided that it will actually happen first as it is async.
  • amqp_trx_plugin correlation_id #10316 :
    Set correlation_id of the amqp trace message to correspond to the correlation_id of the incoming amqp trx message. For amqp_trace_plugin amqp trace messages do not set correlation_id since it does not correspond to an incoming amqp message. This matches with amqp intended use case and matches clients expected behavior. Now clients can set correlation id to whatever they desire, which of course could include transaction id.

  • Rodeos per block ack #10162 :

    • Wait on ack from AMQP before continuing on to processing the next block.
  • Separate amqp_trx_plugin from amqp_trace_plugin #10141 :

    • amqp_trx_plugin no longer depends on amqp_trace_plugin for sending transaction traces to AMQP. amqp_trace_plugin still exists and provides ability to send all transaction traces to AMQP. amqp_trx_plugin is setup to only send transaction traces for transactions that it processes.
    • The AMQP reply-to-queue pattern is now used by amqp_trx_plugin. If a reply-to is sent with the transaction then the trace will be send to the AMQP default exchange connection with that routing-key. The correlation-id is set to the transaction id. If a reply-to is not specified then no trace is sent to AMQP.
  • Reliable amqp_trx_plugin & amqp_trace_plugin #10127 :
    Add reliability options to amqp_trx_plugin & amqp_trace_plugin

    • amqp_trace_plugin now has configurable reliability via amqp-trace-reliable-mode
      • Reliable mode 'exit', exit application on any AMQP publish error.
      • Reliable mode 'queue', queue transaction traces to send to AMQP on connection establishment.
      • Reliable mode 'log', log an error and drop message when unable to directly publish to AMQP.
    • amqp_trace_plugin now has configurable queue via amqp-trace-queue-name option.
    • amqp_trx_plugin now has configurable queue via amqp-trx-queue-name option.
    • Added correlation_id and error callback support to reliable_amqp_publisher.
    • reliable_amqp_publisher now calls on_fatal_error callback instead of dropping messages which in all cases calls app().quit() now.
    • Modified amqp_handler to use single_channel_retrying_amqp_connection to provide reliability and to move away from unsupported AMQP::LibBoostAsioHandler.

    Note: A follow-on PR is planned that will separate amqp_trace_plugin from amqp_trx_plugin.

  • Speculative validation optimizations #9594 :

    • develop-boxed version of Speculative validation optimizations #9593
      • Has the additional line for execute_incoming_transaction calling restart_speculative_block for the amqp_trx_plugin not in develop.
    • unapplied_transaction_queue used to have a process_mode where it would not store aborted or forked-out transactions when in knew that the node would never produce a block. This was an optimization because there seemed to be no reason to cache aborted or forked-out transactions. Since the trx has been validated locally and will not be validated again, why keep it around? If you are producing you need to keep it so on a fork-switch the trxs are not lost. But there seemed to be no reason to keep them for speculative mode. However, the unapplied_transaction_queue is also used when validating a block to look up already validated and constructed transaction_metadata. Keeping the already validated trxs in the queue allows for faster validation of received blocks for all the transactions already received by the node.
    • Before this change, when a speculative node's speculative block ran out of cpu or net, it would stop validating transactions until it received a block -- which would cause it to start a new speculative block after processing the received block. With this PR, a new speculative block is created anytime a block is exhausted during speculative transaction processing.
    • Before this change speculative 12th blocks would use the last-block-time-offset-us which doesn't really make any sense since the block is only speculative and will not be broadcast on the network. With this PR, all speculative blocks use the produce-time-offset-us including the 12th block.
  • Return failure traces when running with amqp_trx_plugin #9872 :

    • Return full transaction trace on failure instead of just the exception when running with amqp_trx_plugin
  • retrying_amqp_connection: commonize more amqp connection management stuff #9294 :
    Getting a bit tired of copy pasting around some amqp connection management stuff in a few places. retrying_amqp_connection tries to solve the typical use case of needing a single channel connected to an AMQP server that is retried on failure. Compared to the code that was in reliable_amqp_publisher previously, a few functional differences are:

    • Doesn’t use AMQP::LibBoostAsioHandler; AMQP-CPP upstream is very clear that this isn’t supported
    • Doesn’t rely on AMQP’s TCP module; that would break a win32 build
    • Retries the connection/channel every second instead of a back off on consecutive failures
    • Doesn’t print the AMQP password to logs (whoops)

    But there will likely be future improvements easier to add this way too, like TLS support with peer auth for example.

  • ack amqp input trx after execution #9303 :

    • Instead of immediately ack AMQP amqp_trx_plugin transaction wait until transaction has been executed.
    • This will allow AMQP (RabbitMQ) to monitor transaction processing
  • Using default logger in amqp_trx_plugin #9256 :

    • The new amqp_trx logger is no longer used as the amqp_trx_plugin & amqp_trace_plugin use the default logger.
  • amqp_trx_plugin - transaction in, transaction_trace out #9181 :
    New plugins: amqp_trx_plugin & amqp_trace_plugin

    • amqp_trx_plugin
      • Consumes from AMQP 'trx' queue for speculative execution
      • If amqp_trace_plugin enabled then errors are reported via the amqp_trace_plugin. Errors include: too many trx in progress and signature issues.

    Binary fc::pack format:

    // consume message types
    using transaction_msg = fc::static_variant<chain::packed_transaction_v0, chain::packed_transaction>;
    
    • amqp_trace_plugin
      • Publishes transaction trace results to 'trace' queue with CorrelationID set to transaction id.

    Binary fc::pack format:

    // publish message types
    struct transaction_trace_exception {
       int64_t error_code; ///< fc::exception code()
       std::string error_message;
    };
    using transaction_trace_msg = fc::static_variant<transaction_trace_exception, chain::transaction_trace>;
    

Change Type

Select ONE:

  • Documentation
  • Stability bug fix
  • Other
  • Other - special case

Testing Changes

Select ANY that apply:

  • New Tests
  • Existing Tests
  • Test Framework
  • CI System
  • Other

Consensus Changes

  • Consensus Changes

API Changes

  • API Changes

Documentation Additions

  • Documentation Additions

spoonincode and others added 30 commits July 12, 2021 13:07
add some addtional includes to fix build with boost 1.77 - 2.2
[docs] resourse payer support in cleos is not available in rc1
Update README.md with the entry to download Ubuntu 20.04
[docs] Update docs, fix links, add legal feedback
…way to start RocksDB, provide a warning if no options file is supplied
[2.2.x] Update base image and build script pipelines.
Copy link
Contributor

@spoonincode spoonincode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR's base is master not develop

@huangminghuang huangminghuang changed the base branch from master to develop August 25, 2021 19:36
@spoonincode spoonincode dismissed their stale review August 26, 2021 18:37

k now it's develop

@spoonincode
Copy link
Contributor

Is it possible to get a list of PRs this is up-porting?

@heifner
Copy link
Contributor

heifner commented Aug 26, 2021

Is it possible to get a list of PRs this is up-porting?

Not just a list, but a complete summary of all changes.

# Conflicts:
#	README.md
#	docs/01_nodeos/07_features/15_private_chain_access/index.md
#	docs/20_upgrade-guide/index.md
#	libraries/fc
#	plugins/chain_plugin/chain_plugin.cpp
#	plugins/state_history_plugin/state_history_plugin.cpp
#	programs/cleos/main.cpp
#	tests/Node.py
@huangminghuang huangminghuang force-pushed the huangminghuang/merge-2.2.x branch from 9fd47d7 to 7a56aea Compare August 27, 2021 19:33
@huangminghuang huangminghuang merged commit d42cb43 into develop Aug 30, 2021
huangminghuang added a commit that referenced this pull request Sep 1, 2021
@huangminghuang huangminghuang deleted the huangminghuang/merge-2.2.x branch October 8, 2021 17:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.