Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache /api/v0/preferences and /api/v0/attribute_groups #8957

Open
CharlesNepote opened this issue Sep 5, 2023 · 4 comments
Open

Cache /api/v0/preferences and /api/v0/attribute_groups #8957

CharlesNepote opened this issue Sep 5, 2023 · 4 comments
Labels
Attributes https://wiki.openfoodfacts.org/Product_Attributes 🚅 Performance product attributes

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Sep 5, 2023

Based on 50 millions nginx log lines analysis, we have found that these URLs represent respectively 2.83% and 2.79% (5.62%) of all requests.

These two files are used to setup preferences. They are generated by Perl, without any database access. See:

It's very easy and efficient to cache them with nginx for a few dozen of seconds (1 minute should be ok, said Stéphane).

We currently (2023-09) have around 3000/6000 requests per minutes. Caching 5.6% of the requests would lead to save around 170/330 req/minute. It would also help in case of peaks.

The nginx conf could be configured this way:

# ***
# * Cache
# *
# * - Introducing article: https://www.nginx.com/blog/nginx-caching-guide/
# * - Long article: https://www.nginx.com/blog/nginx-high-performance-caching/#BasicPrinciplesofContentCaching
# * - Firefox extension to debug http headers: https://addons.mozilla.org/en-US/firefox/addon/http-header-live/
# * 
# * Only two directives are needed to enable basic caching: proxy_cache_path (http{} level) and proxy_cache (server{} level).
# * proxy_cache_path directive sets the path and configuration of the cache, and the proxy_cache directive activates it.
# *   levels:    sets up a two   ^`^qlevel directory hierarchy under /path/to/cache/. Having a large number of files in a
# *              single directory can slow down file access, so we recommend a two level directory hierarchy for most 
# *              deployments. If the levels parameter is not included, NGINX puts all files in the same directory.
# *   keys_zone: sets up a shared memory zone for storing the cache keys and metadata such as usage timers. 
# *              Having a copy of the keys in memory enables NGINX to quickly determine if a request is a HIT 
# *              or a MISS without having to go to disk, greatly speeding up the check. A 1MB zone can store 
# *              data for about 8,000 keys, so the 10MB zone configured in the example can store data for about 80,000 keys.
# *   inactive:  specifies how long an item can remain in the cache without being accessed. In this example, a file that 
# *              has not been requested for 24h is automatically deleted from the cache by the cache manager process, 
# *              regardless of whether or not it has expired. The default value is 10 minutes (10m). Inactive content 
# *              differs from expired content. NGINX does not automatically delete content that has expired as defined 
# *              by a cache control header (Cache-Control:max-age=120 for example). Expired (stale) content is deleted 
# *              only when it has not been accessed for the time specified by inactive. When expired content is accessed, 
# *              NGINX refreshes it from the origin server and resets the inactive timer.
# *   max_size:  sets the upper limit of the size of the cache (to 2 gb in this example). It is optional; not specifying 
# *              a value allows the cache to grow to use all available disk space. When the cache size reaches the limit, 
# *              a process called the cache manager removes the files that were least recently used to bring the cache size back under the limit.
#
# You can check the directory size from time to time: du -sh /var/cache/nginx
proxy_cache_path  /var/cache/nginx  levels=1:2  keys_zone=cachezone:60m  inactive=1h  max_size=200m;

server {
    location ~ ^/api/v./(preferences|attribute_groups) {
        # Activate cache configuration named "cachezone"
        proxy_cache             cachezone;

        # proxy_cache_valid indicates which query codes is concerned by the cache and the caching time
        proxy_cache_valid       any  1m;

        # proxy_cache_use_stale: delivers cached content when the origin is down
        # "Additionally, the updating parameter permits using a stale cached response if it is 
        #  currently being updated. This allows minimizing the number of accesses to proxied servers 
        #  when updating cached data.
        proxy_cache_use_stale          updating error timeout http_500 http_502 http_503 http_504;

        # Adds an X-Cache-Status HTTP header in responses to clients: helps debugging the
        # cache.
        # https://www.nginx.com/blog/nginx-caching-guide/#Frequently-Asked-Questions-(FAQ)
        # Eg. X-Cache-Status: HIT
        add_header X-Cache-Status $upstream_cache_status;
    }
}

To debug and analyze the cache hits, it's possible to create a temporary specific log:
(source: https://serverfault.com/a/912897)

# This directive needs to be place in nginx global configuration
log_format cache_st '$remote_addr - $upstream_cache_status [$time_local]  '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

And:

# The path needs to be adapted. These logs should be temporary: please do not keep this logs after the tests.
access_log   /var/log/nginx/domain.com.cache.log cache_st;

Then it's easy to get some stats about the cache, and verify it is working and efficient:
(source: https://serverfault.com/a/912897 )

HIT vs MISS vs BYPASS vs EXPIRED
awk '{print $3}' cache.log | sort | uniq -c | sort -r

MISS URLs:
awk '($3 ~ /MISS/)' cache.log | awk '{print $7}' | sort | uniq -c | sort -r

BYPASS URLs:
awk '($3 ~ /BYPASS/)' cache.log | awk '{print $7}' | sort | uniq -c | sort -r

Part of

@CharlesNepote CharlesNepote added ✨ Feature Features or enhancements to Open Food Facts server 🚅 Performance labels Sep 5, 2023
@teolemon
Copy link
Member

teolemon commented Sep 6, 2023

@CharlesNepote the website does those requests right ?
we do ping in the app once during setup, and then periodically, but I guess we cache it.
@monsieurtanuki @g123k

@teolemon teolemon added product attributes Attributes https://wiki.openfoodfacts.org/Product_Attributes labels Sep 6, 2023
@alexgarel
Copy link
Member

@CharlesNepote using https://nginx.org/en/docs/http/ngx_http_memcached_module.html could be far more efficient (we already have the memcached server)

It seems to be available with nginx-extras package.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Sep 12, 2023

I have done more interesting computations, based on 41 millions of requests from nginx logs.

  1. Every minute, these files* are requested at least more than 16 times!!! and sometimes more than 360 times.
  2. If configured to be active for a minute, the cache would be requested 99.21% of the time for /api/v0/attribute_groups.

*https://world.openfoodfacts.org/api/v0/preferences and https://world.openfoodfacts.org/api/v0/attribute_groups

This means that the cache would be very efficient.

@alexgarel I don't understand why using memcached would be far more efficient vs vs reading cache from the filesystem. I would also be more complicated. I feel that nginx directives without any dependency are more robust.

@alexgarel
Copy link
Member

@CharlesNepote You are right about the fact that the files will be in cache. So you can implement it that way.

@teolemon teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024
@teolemon teolemon removed the ✨ Feature Features or enhancements to Open Food Facts server label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Attributes https://wiki.openfoodfacts.org/Product_Attributes 🚅 Performance product attributes
Projects
Status: To discuss and validate
Development

No branches or pull requests

3 participants