Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache not being cleaned up while using high-threshold option #1226

Closed
vakaobr opened this issue Aug 16, 2023 · 13 comments · Fixed by #1233
Closed

Cache not being cleaned up while using high-threshold option #1226

vakaobr opened this issue Aug 16, 2023 · 13 comments · Fixed by #1233
Assignees
Labels
Milestone

Comments

@vakaobr
Copy link

vakaobr commented Aug 16, 2023

Which version of blobfuse was used?

2.0.5, 2.0.6-preview.1 and 2.1.0-preview.2

Which OS distribution and version are you using?

Ubuntu 22.04

If relevant, please share your mount command.

Not exactly the mount command, but my .yaml file used to mount the volume:

allow-other: true #|false <allow other users to access the mounted directory - used for FUSE and File Cache>

# Logger configuration
logging:
  type: base # syslog|silent|base <type of logger to be used by the system. silent = no logger, base = file based logger. Default - syslog>
  level: log_debug # log_off|log_crit|log_err|log_warning|log_info|log_trace|log_debug <log level. Default - log_warning>
  file-path: /var/log/logfile-blobfuse2.log #<path where log files shall be stored. Default - '$HOME/.blobfuse2/blobfuse2.log'>
  max-file-size-mb: 1024 #<maximum allowed size for each log file (in MB). Default - 512 MB>
  track-time: true # <track time taken by important operations>
components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage
libfuse:
  disable-writeback-cache: true
  ignore-open-flags: true
  max-fuse-threads: 64
  fuse-trace: false
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240
# Disk cache related configuration
file_cache:
  # Required
  path: /cachedata/blobfusetmp
  # Optional
  policy: lru # <eviction policy to be engaged for cache eviction. lru = least recently used file to be deleted, lfu = least frequently used file to be deleted. Default - lru>
  timeout-sec: 1209600 # <default cache eviction timeout (in sec). Default - 120 sec> - Try to evict a file only after 14 days
  high-threshold: 5 # <% disk space consumed which triggers eviction. This parameter overrides 'timeout-sec' parameter and cached files will be removed even if they have not expired. Default - 80> - Start eviction with files from here instead of DVR
  low-threshold: 2 # <% disk space consumed which triggers eviction to stop when previously triggered by the high-threshold. Default - 60>
attr_cache:
  timeout-sec: 7200
azstorage:
  account-name: <sanitized> # <storage account name>
  account-key: <sanitized> # <storage account key>
  endpoint: <sanitized> # <storage account endpoint>
  container: <sanitizied> # <storage container name>
  update-md5: false
  validate-md5: false
  virtual-directory: true
  block-size-mb: 32
  max-concurrency: 8
  disable-compression: true
  sdk-trace: false # <enable storage sdk logging>

What was the issue encountered?

In a similar fashion of the #1114 the cache are not being cleaned up based on configuration parameters. The documentation states the parameter high-threshold overrides the value set to timeout-sec, but that's not happening.

Have you found a mitigation/solution?

Tried to use a compiled version from main (2.0.6-preview.1) and from 2.1.0-preview.2 with no luck (looks like the changes introduced here didn't fixed this issue)

@vakaobr vakaobr changed the title Cache not being cleaned up while using high-threshold Cache not being cleaned up while using high-threshold option Aug 16, 2023
@vibhansa-msft
Copy link
Member

Is your use-case involves a lot of rename operations? Recently we have found an issue with file-cache where rename leaves the destination file in cache for ever.

@vibhansa-msft vibhansa-msft self-assigned this Aug 16, 2023
@vibhansa-msft vibhansa-msft added this to the V2-2.1.0 milestone Aug 16, 2023
@vakaobr
Copy link
Author

vakaobr commented Aug 17, 2023

Hello @vibhansa-msft, actually no rename operations are involved, we're storing video files and our usecase is:

  • Keep the files on cache while those are being generated and/or requested from the storage account as long as possible (we're aiming for 2 weeks setting timeout-sec: 1209600 on file_cache: section)
  • Clean up the least recently used ones when storage hits a given number (85% on our case) going down to 75% using the high-threshold and low-threshold respectively

@vibhansa-msft
Copy link
Member

There can be potentially one of the two cases here:

  • These thresholds are not hit because 'du' command is not found by blobfuse binary.
  • Application is not closing the handles of the files, hence blobfuse will never be able to delete those.
    Can you enable log_debug and share the log files, that will help us root-cause this further.

@vakaobr
Copy link
Author

vakaobr commented Aug 17, 2023

Hey hey,

Thanks for the feedback. For the du binary, its on the default location of the OS (reachable via $PATH) so it should be fine:

root@my-vm:/var/log# which du
/usr/bin/du

Find the log attached, this are from my last tests yesterday (when I set the default timeout for 120 seconds to confirm it will trigger a deletion of the files) and it works, it don't work as expected when we hit the high threshold of disk utilisation (for the test, I've created several files using urandom and dd in a loop so I think the files was closed OK)

content-dir-blobfuse2.log

@vibhansa-msft
Copy link
Member

Can you share your latest config file. I do see one issue here. Though there are multiple files to be deleted from cache the eviction logic stops are deleting just one of them, hence the cache usage keep rising. This points to some mis-configuration of file-cache and I suspect you have set max-eviction to a very low number somewhere.

Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (11) ##
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (12) /cachedir/blobfusetmp/2023-08-16_15-40-42.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (13) /cachedir/blobfusetmp/2023-08-16_15-40-29.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (14) /cachedir/blobfusetmp/2023-08-16_15-40-16.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (15) /cachedir/blobfusetmp/2023-08-16_15-40-03.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (16) /cachedir/blobfusetmp/2023-08-16_15-39-50.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (17) /cachedir/blobfusetmp/2023-08-16_15-39-37.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (18) /cachedir/blobfusetmp/2023-08-16_15-39-24.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (19) /cachedir/blobfusetmp/2023-08-16_15-39-10.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (20) /cachedir/blobfusetmp/2023-08-16_15-38-55.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (21) /cachedir/blobfusetmp/2023-08-16_15-38-29.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (22) /cachedir/blobfusetmp/2023-08-16_15-38-16.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (23) /cachedir/blobfusetmp/2023-08-16_15-38-03.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (24) /cachedir/blobfusetmp/2023-08-16_15-37-50.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (25) /cachedir/blobfusetmp/2023-08-16_15-37-36.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (26) /cachedir/blobfusetmp/2023-08-16_15-37-23.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (27) /cachedir/blobfusetmp/2023-08-16_15-37-09.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (28) /cachedir/blobfusetmp/2023-08-16_15-36-56.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (29) /cachedir/blobfusetmp/2023-08-16_15-36-28.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (477)]:  ==> (30) /cachedir/blobfusetmp/2023-08-16_15-36-14.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (481)]: lruPolicy::printNodes : Ends
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (380)]: lruPolicy::deleteExpiredNodes : Starts
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (404)]: lruPolicy::DeleteExpiredNodes : Max deletion count hit
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_DEBUG [lru_policy.go (413)]: lruPolicy::deleteExpiredNodes : List generated 1 items
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_TRACE [lru_policy.go (321)]: lruPolicy::removeNode : /cachedir/blobfusetmp/2023-08-16_15-40-42.txt
Wed Aug 16 15:43:02 UTC 2023 : blobfuse2[246922] : LOG_TRACE [lru_policy.go (426)]: lruPolicy::deleteItem : Deleting /cachedir/blobfusetmp/2023-08-16_15-40-42.txt

@vibhansa-msft
Copy link
Member

Are you providing "max-eviction" anywhere in your config file or CLI parameters ?

@vibhansa-msft
Copy link
Member

In logs I also observe a lot of time config file is being modified while blobfuse is mounted. Is there any reason to do so, just suspecting some config issues here while you keep on updating the config file on the fly.

@vakaobr
Copy link
Author

vakaobr commented Aug 18, 2023

Are you providing "max-eviction" anywhere in your config file or CLI parameters ?

No, are relying on the default value (5000)

In logs I also observe a lot of time config file is being modified while blobfuse is mounted. Is there any reason to do so, just suspecting some config issues here while you keep on updating the config file on the fly.

Don't recall to did any changes on the config during while tests are running, but I can try to repeat it and upload new results later today

@vibhansa-msft
Copy link
Member

Yes we do provide it both as config and cli parameter. You shall not be setting it anywhere. These are tweaks for very specific usage and shall be used only after consulting with us.

@vakaobr
Copy link
Author

vakaobr commented Aug 18, 2023

Hello again,

So, I just re-run the tests.

Here our actual configuration:

allow-other: true #|false <allow other users to access the mounted directory - used for FUSE and File Cache>

# Logger configuration
logging:
  type: base # syslog|silent|base <type of logger to be used by the system. silent = no logger, base = file based logger. Default - syslog>
  level: log_debug # log_off|log_crit|log_err|log_warning|log_info|log_trace|log_debug <log level. Default - log_warning>
  file-path: /var/log/test-eviction-blobfuse2.log #<path where log files shall be stored. Default - '$HOME/.blobfuse2/blobfuse2.log'>
  max-file-size-mb: 1024 #<maximum allowed size for each log file (in MB). Default - 512 MB>
  track-time: true # <track time taken by important operations>
components:
  - libfuse
  - file_cache
  - attr_cache
  - azstorage
libfuse:
  disable-writeback-cache: true
  ignore-open-flags: true
  max-fuse-threads: 64
  fuse-trace: false
  attribute-expiration-sec: 120
  entry-expiration-sec: 120
  negative-entry-expiration-sec: 240
# Disk cache related configuration
file_cache:
  # Required
  path: /cachedir/blobfusetmp
  # Optional
  policy: lru # <eviction policy to be engaged for cache eviction. lru = least recently used file to be deleted, lfu = least frequently used file to be deleted. Default - lru>
  timeout-sec: 1209600 # <default cache eviction timeout (in sec). Default - 120 sec> - Try to evict a file only after 14 days
  high-threshold: 5 # <% disk space consumed which triggers eviction. This parameter overrides 'timeout-sec' parameter and cached files will be removed even if they have not expired. Default - 80> - Start eviction with files from here instead of DVR
  low-threshold: 2 # <% disk space consumed which triggers eviction to stop when previously triggered by the high-threshold. Default - 60>
attr_cache:
  timeout-sec: 7200
azstorage:
  account-name: <sanitized>
  account-key: <sanitized>
  endpoint: https://<sanitized>.blob.core.windows.net
  container: <sanitized>
  update-md5: false
  validate-md5: false
  virtual-directory: true
  block-size-mb: 32
  max-concurrency: 8
  disable-compression: true
  sdk-trace: false # <enable storage sdk logging>

The command we're using to mount the volume is:

blobfuse2 <targetdirectory> --config-file=<configfile> (so, no extra parameters from command line, relying 100% on the config file definition and/or default values)

The command we've used to run the tests is:

dd if=/dev/urandom of=<targetdirectory>/<date+time>.txt bs=2000M on a loop of 50 iterations. After the run, the disk usage on the directory was:

/dev/sdb1       503G   96G  382G  21% /cachedir

If the high-threshold was kicking in as expected, it should be cleaned up back to 2% of usage, as set on low-threshold

Log file is attached. Thanks for the support!

test-eviction-blobfuse2.log

@vakaobr
Copy link
Author

vakaobr commented Aug 24, 2023

@vibhansa-msft if you need more information, please let me know, thanks!

@vibhansa-msft
Copy link
Member

Thanks for sharing the logs. I was able to cross reference this with the code and 'du' is not working in your case because you have not set any cache limit for file-cache. When cache limit is not set blobfuse assumes entire disk is available for consumption and hence 'du' has no way to compute the % usage. 'du' is computing only the given directory size and if cache limit is set we compute the % usage and then match is against the threshold. If limit is not set then % computation is ignored and hence untill the timeout no eviction will hit.

@vibhansa-msft
Copy link
Member

Will try to clean up the code and fix this in next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants