Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[THREESCALE-9537] Configure batcher policy storage #1452

Merged

Conversation

tkan145
Copy link
Contributor

@tkan145 tkan145 commented Mar 8, 2024

What

Fixes: https://issues.redhat.com/browse/THREESCALE-9537

Dev notes

1. What shared dict value should we increase?

3scale batcher policy use a few different shared dict caches

lua_shared_dict cached_auths 20m;
lua_shared_dict batched_reports 20m;
lua_shared_dict batched_reports_locks 1m;

and api_keys if Caching policy included in the chain

lua_shared_dict api_keys 30m;

First let's run some test to see how much 1m of shared cache can hold.

lua_shared_dict test 1m;
location = /t {
	content_by_lua_block{
		local rep = string.rep
		local dt = ngx.shared.test
		local val = rep("v", 15)
                local key = rep("k", 32)
		local i = 0
		while i < 200000 do
                    local ok, err = dt:safe_add("service_id:_" .. i ..",user_key:".. key ..",metric:hits", i)
                    if not ok then
		        break
		    end
		    i = i + 1
		end
		ngx.say(i, " key/value pairs inserted")
    }
}

The reason to use safe_set() here is to prevent automatic evicting least recently used items upon memory shortage in set().

Querying /t gives the response body on my Linux x86_64 system. NOTE: you may get different value as this actually depends on the underlying architecture.

4033 key/value pairs inserted.

So a 1m store can hold 4033 key/value pairs with 57-byte keys and 15-byte values. In reality, the actual available space will depend on the memory fragment but since these key/value pairs are consistent in size, we should have no problem. More details here

Changing the "dict" store to 10m gives

40673 key/value pairs inserted.

So a 10m store can hold 40673 pairs. It's a linear growth as expected.

Name Method When full TTL Growth
cached_auths set/get evict old/expired key 10 1 for each new transaction
batched_reports safe_add/incr return error 1 for each transaction that return 200. If the key exist then update the existing key. Flush all keys after batch report seconds timeout (default 10s)
batched_reports_locks set/delete evict old/expired key None. 1 for each transaction. But lock is release in the same function
api_keys get/set/delete evict old/expired key None

So we can see that all will grow the equally, but due to the use of safe_add and cache reports for 10 seconds, onlybatched_reports will return no memory error.

A possible workaround is to set batch_report_seconds to lower value.

2. Do I need to increase the batcher policy storage.

Let do a small test and increase the key size:

Key Value Key format credential lengths Key/values pairs
batched_reports 20m service_id:<service_id>,user_key:<service_crendential>,metric:<metric_name> 60 81409
120 81409
142 40705
400 20400

with a key that >400bytes, for the batched_reports to be fully filled, it would require 20400/10 = 2040 req/sec. It's very unlikely that a single gateway will be hit with this much traffic.

@eguzki do you know what is the highest load a single gateway can handle?

Verification Steps

Filling the storage is a bit tricky so I just check to see if the configuration file is filled with the correct value.

  • Create a apicast-config.json file with the following content
cat <<EOF >apicast-config.json
{
    "services": [
        {
            "id": "1",
            "backend_version": "1",
            "proxy": {
                "hosts": [
                    "one"
                ],
                "api_backend": "https://echo-api.3scale.net:443",
                "backend": {
                    "endpoint": "http://127.0.0.1:8081",
                    "host": "backend"
                },
                "policy_chain": [
                    {
                        "name": "apicast.policy.apicast"
                    }
                ],
                "proxy_rules": [
                    {
                        "http_method": "GET",
                        "pattern": "/",
                        "metric_system_name": "hits",
                        "delta": 1,
                        "parameters": [],
                        "querystring_parameters": {}
                    }
                ]
            }
        }
    ]
} 
EOF
  • Checkout this branch and start dev environment
make development
make dependencies
  • Run apicast locally with APICAST_POLICY_BATCHER_SHARED_MEMORY_SIZE set to 40m
THREESCALE_DEPLOYMENT_ENV=staging APICAST_LOG_LEVEL=debug APICAST_WORKER=1 APICAST_CONFIGURATION_LOADER=lazy APICAST_CONFIGURATION_CACHE=0  APICAST_POLICY_BATCHER_SHARED_MEMORY_SIZE="40m" THREESCALE_CONFIG_FILE=apicast-config.json ./bin/apicast
  • Stop the gateway
CTRL-C
  • Check that lua_shared_dict batched_reports is set to 40m
$ grep -nr "batched_reports" /tmp

/tmp/lua_PRpxLW:67:lua_shared_dict batched_reports 40m;
/tmp/lua_PRpxLW:68:lua_shared_dict batched_reports_locks 1m;

@tkan145 tkan145 force-pushed the THREESCALE-9537-batcher-policy-storage branch 6 times, most recently from 1ba4383 to 90e18eb Compare March 14, 2024 06:16
Copy link
Member

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verification steps successful.

do you know what is the highest load a single gateway can handle?

That depends on number of workers, policies and available infra resources. Hard to tell. It needs to be measured for given context and traffic type like credential length.

Question: The shared dict is shared for all the 3scale products? So if I have 20m, those are shared for me and other 3scale users?

Maybe I would add some documentation with your tests, saying what you can get for the default values of the policy and the new env var for several key sizes. Then the same for half/double of the default value of batch_report_seconds. Same thing for haf/double for the default value of the new env var.

Regarding

lua_shared_dict test 1m;
location = /t {
	content_by_lua_block{
		local rep = string.rep
		local dt = ngx.shared.test
		local val = rep("v", 15)
                local key = rep("k", 32)
		local i = 0
		while i < 200000 do
                    local ok, err = dict:safe_add("service_id:_" .. i ..",user_key:".. key ..",metric:hits", i)
                    if not ok then
		        break
		    end
		    i = i + 1
		end
		ngx.say(n, " key/value pairs inserted")
    }
}

dt local variable is not being used.
n variable is unknown to me as well.

gateway/conf.d/backend.conf Outdated Show resolved Hide resolved
doc/parameters.md Show resolved Hide resolved
@tkan145
Copy link
Contributor Author

tkan145 commented Mar 22, 2024

Also fix the test steps.

Question: The shared dict is shared for all the 3scale products? So if I have 20m, those are shared for me and other 3scale users?

Yes the shared dict is shared between workers and all 3scal products and perhaps user also.

Maybe I would add some documentation with your tests, saying what you can get for the default values of the policy and the new env var for several key sizes. Then the same for half/double of the default value of batch_report_seconds. Same thing for haf/double for the default value of the new env var.

Where do you think that doc would live? inside the top level doc or inside the policy?

@tkan145 tkan145 marked this pull request as ready for review March 22, 2024 07:25
@tkan145 tkan145 requested review from a team as code owners March 22, 2024 07:25
@eguzki
Copy link
Member

eguzki commented Mar 22, 2024

he top level doc or inside the policy

I would say in the specific readme for the batcher policy: https://github.com/3scale/APIcast/blob/master/gateway/src/apicast/policy/3scale_batcher/README.md

@tkan145 tkan145 requested a review from eguzki March 28, 2024 12:38
CHANGELOG.md Outdated
@@ -55,6 +55,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

- Added `APICAST_CLIENT_REQUEST_HEADER_BUFFERS` variable to allow configure of the NGINX `client_request_header_buffers` directive: [PR #1446](https://github.com/3scale/APIcast/pull/1446), [THREESCALE-10164](https://issues.redhat.com/browse/THREESCALE-10164)

Added `APICAST_POLICY_BATCHER_SHARED_MEMORY_SIZE` variable to allow configure batcher policy share memory size. [PR #1452](https://github.com/3scale/APIcast/pull/1452), [THREESCALE-9537](https://issues.redhat.com/browse/THREESCALE-9537)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Added `APICAST_POLICY_BATCHER_SHARED_MEMORY_SIZE` variable to allow configure batcher policy share memory size. [PR #1452](https://github.com/3scale/APIcast/pull/1452), [THREESCALE-9537](https://issues.redhat.com/browse/THREESCALE-9537)
Added the `APICAST_POLICY_BATCHER_SHARED_MEMORY_SIZE` variable to allow configuration of the batcher policy to share memory size. [PR #1452](https://github.com/3scale/APIcast/pull/1452), [THREESCALE-9537](https://issues.redhat.com/browse/THREESCALE-9537)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

lua_shared_dict limiter 1m;

# This shared dictionaries are only used in the 3scale batcher policy.
# This is not ideal, but they'll need to be here until we allow policies to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This is not ideal, but they'll need to be here until we allow policies to
# These requirements will remain in place until we allow policy changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

| | 512 | 15 | 40712 |

In practice, the actual number will depend on the size of the key/value pair, the
underlying OS architecture, memory segment sizes, etc... More details [here](https://blog.openresty.com/en/nginx-shm-frag/)
Copy link
Contributor

@dfennessy dfennessy Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
underlying OS architecture, memory segment sizes, etc... More details [here](https://blog.openresty.com/en/nginx-shm-frag/)
underlying operating system (OS) architecture and memory segment sizes...More details [here](https://blog.openresty.com/en/nginx-shm-frag/)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@dfennessy dfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some suggestions.

If cache_handler exists, the policy will need to update_downtime_cache
regardless of the backend status code. Thus, instead of passing cache_handler
to other functions it is much simpler to update the cache inside access().
In some cases, the batcher policy will run out of storage space (batched_reports)
and cause metrics to not being reported. This commit adds a new environment
variable to configure the batcher policy shared memory storage size
@tkan145 tkan145 force-pushed the THREESCALE-9537-batcher-policy-storage branch from f1e0a96 to c08731c Compare April 3, 2024 00:19
@tkan145
Copy link
Contributor Author

tkan145 commented Apr 3, 2024

Thanks @dfennessy. I will need your approval also

@dfennessy dfennessy self-requested a review April 3, 2024 08:27
Copy link
Contributor

@dfennessy dfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tkan145 tkan145 merged commit 9bcea54 into 3scale:master Apr 4, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants