Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prod Release 20/11 #414

Merged
merged 11 commits into from
Nov 28, 2023
Merged

Prod Release 20/11 #414

merged 11 commits into from
Nov 28, 2023

Conversation

morgsmccauley
Copy link
Collaborator

morgsmccauley and others added 9 commits November 9, 2023 15:32
## Problem

The Historical backfill is composed of two main steps:
1. Fetching matching blocks from `near-delta-lake`/Databricks index
files
2. Manually filtering all blocks between the `last_indexed_block` from
`near-delta-lake` and the `block_height` in which the historical process
was triggered

Only after each of these steps are completed are the blocks flushed to
Redis. This leads to a large delay from when the process was triggered,
to when the blocks are actually executed by Runner, creating the
appearance of things 'not working'.

## Changes
This PR makes the following changes to reduce the time between trigger,
and execution:

### Flush blocks indexed blocks immediately
Fetching blocks from the index file (Step 1.), is relatively quick
process. Rather than wait for (Step 2.), we can flush the blocks
immediately, and then continue on to the following step.

### Prefetch blocks in manual filtering
In manual filtering, blocks are processed sequentially. To speed this
process up, we can fetch ahead so that we minimise the time spent
waiting for S3 requests. Luckily, `near-lake-framework` does exactly
this, therefore manual filtering has been refactored to use this.

## Results
The following is from a backfill of 19,959,790 blocks run locally. I'd
expect the Before to be much lower given the geographical distance
compared to actual our infrastructure, but the results are still
positive :).

Time till first block appears on Redis:
- Before: 33 minutes
- After: 2 seconds

Time to completion:
- Before: 33 minutes
- After: 45 seconds
We have two versions of `indexer_rules_engine`; sync and async. This
logic does not need to be `async` so I've removed this version and
refactored so only the sync version is used.

Additionally, I've migrated the `async` tests across to use the `sync`
version, and have added `cargo test` to Github Actions so that they are
always run. This required fixing the `set_provisioning_flag` test, as
well as ignoring the integration tests which rely on AWS access keys.
Runner sets Indexer Config when the thread is started. As a result, it
does not react to updates to that config, such as updates to code. This
is a problem as that means unless runner is restarted, published code
changes won't be used. I've changed it so that the config is read each
iteration of the loop. That way, config updates will be consumed. In the
short term, this can be tuned such as reading every X loops and on every
failure, if need be. In the long term, improved communication between
coordinator and runner can facilitate coordinator communicating to
runner to read the config as opposed to doing so all the time.

In addition, the metrics for block wait duration and overall execution
duration were wrong. I've moved the start time to the correct spot.
Coordinator pulls indexer config from registry upon initialization.
While it runs, it continues to check each block for registry update
calls. When the registry is updated, it kicks off a historical process
if necessary, and updates its local indexer configurations to reflect
the new config for that particular indexer.

It turns out that the local indexer config was in fact not correctly
updating. It was setting the wrong key for storing the config. As a
result, existing indexer updates were not being captured while new
indexers would have the wrong config. Coordinator would attempt to set
various status such as provisioning but the incorrect keys would lead to
errors around that. And, failing to capture indexer config updates leads
to runner never seeing those updates and continuing to run stale code.
Restarting coordinator resolved these issues since the config is read
freshly from registry, hiding these issues for some time, before they
would once again reappear.

In addition to the above problem, the configuration is only updated in
redis when the next matching block appears. So, if a new indexer is
published without a start block, its redis config and stream will not be
set until a new matching block appears. This can be problematic for
developers who are the producers of blocks matching their contract
filter. In that case, matching blocks may be rare, leading to a long
delay before updated code is read and executed. Setting the config
immediately after an update is seen eliminates such a problem.

I've fixed the particular line which was in correctly setting the key
wrong, as well as set the redis config immediately after an update, to
resolve the above issues.
This PR addresses a few painpoints of logging and QueryAPI Indexer
development.

1. Logs can now be seen through the QueryAPI editor window via a `Show
Logs` button.
- No need to switch between `Indexer Status` & `Editor` window anymore
saving time and frustration with the editor taking a few seconds to
load.
- Enables more powerful integrations with the editor for future PRs
(e.g: adding blocks to debug directly from the indexer Logs table)

2. Logs are easier to digest now and search now. 
- Logs are grouped by `block_height` so now the user sees all the
messages belonging to a block's execution together.
- Additionally the user can use the `search` functionality to search for
any block_height where the function was executed to view its log in
entirety.
             - Logs have relative dates now e.g (4 mins ago...)
- BlockHeights can be clicked on to take you to the explorer link
             
3. Simplifying QueryAPI links.
- There is no need for `&view=status/editor-window` to be passed in
anymore.
- As long as `selectedIndexerPath` is defined, the user is sent to the
editor window. If `&view=status` is passed the user is redirected to the
indexer logs page within the editor window.
- Note: I have a PR on `queryapi/docs` to reflect these changes:
near/docs#1584
  
Notes: Instead of using GraphQL to fetch logs, we are using a feature of
Hasura which `restifies` your graphql queries.
You can find details here:
https://near-queryapi.api.pagoda.co/console/api/rest/list
  
- The Old Indexer Status Widget should still be avalaible if needed via
the following link:
https://near.org/dataplatform.near/widget/QueryApi.IndexerStatus?accountId=dataplatform.near&indexer_name=social_feed
  
Other small updates
- If a invalid indexer path is chosen, the editor will show an error
- QueryAPI is avalaible for users in read only mode even if they are not
logged in on near.org

### This PR does not add WS support. This will be done in a future PR. 
  

![image](https://github.com/near/queryapi/assets/25015977/0e94ede2-6cca-41ae-9dd4-827ff388ba48)

![image](https://github.com/near/queryapi/assets/25015977/01e655cb-786d-4913-9b09-e00014037863)
![queryapi logs
testing](https://github.com/near/queryapi/assets/25015977/c66d980c-0988-492e-b225-46371ff7572e)
@morgsmccauley morgsmccauley requested a review from a team as a code owner November 20, 2023 03:39
@roshaans
Copy link
Contributor

I've updated a frontend environment variable in terraform which needs to be applied in our Prod enviornment here: https://github.com/near/near-ops/pull/1440. The change is already applied for our dev env.

@morgsmccauley
Copy link
Collaborator Author

I've updated a frontend environment variable in terraform which needs to be applied in our Prod enviornment here: near/near-ops#1440. The change is already applied for our dev env.

@roshaans could you please apply it?

So after a lot of testing, I found the problem. It turns out that the
lazy loading implementation I used for context.db was set up so that any
calls to a db function would first check if the dmlHandler exists (Load
on access). If not, then instantiate it. This is helpful for situations
where the functions aren't used. However, this had a side effect: If a
bunch of db function calls come in simultaneously, they will all poll
dmlHandler and see it is uninitialized and all try and create a new one.
Each initialization will try to create a pool with 10 available clients.
And then all of them will create at least one client each. This will
exhaust the clients very quick if there's enough of them. Changing it so
that the dmlHandler is lazy loaded immediately instead. And on access,
it will await its completion before calling any functions. This ensures
there is no more than 1 client active for that invocation. I wrote a
test as well which tries 100 simultaneous calls and verified 1 pool is
created.

Transaction Support can come next, now that the pooling works correctly
now.
Current block wait metric uses a gauge, which is causing spikes in block
wait to be lost. So, I have switched it to a histogram instead, which
does a much better job at capturing peaks and having a variety of more
useful expressions, some of which I list below. These expressions are
also customizable.

A more accurate average: 

`rate(queryapi_runner_block_wait_duration_milliseconds_sum[$__rate_interval])
/
rate(queryapi_runner_block_wait_duration_milliseconds_count[$__rate_interval])`

The duration under which 95% of block requests were fulfilled
`histogram_quantile(0.95,
sum(rate(queryapi_runner_block_wait_duration_milliseconds_bucket[$__rate_interval]))
by (le))`

The % of requests for each indexer where the block wait was under 100ms
```
  sum(rate(queryapi_runner_block_wait_duration_milliseconds_bucket{le="100"}[$__rate_interval])) by (indexer)
/
  sum(rate(queryapi_runner_block_wait_duration_milliseconds_count[$__rate_interval])) by (indexer)
```
@morgsmccauley morgsmccauley self-assigned this Nov 22, 2023
@morgsmccauley morgsmccauley merged commit b257f65 into stable Nov 28, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants