Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better caching in fetch #209

Merged
merged 2 commits into from
Dec 18, 2020
Merged

Better caching in fetch #209

merged 2 commits into from
Dec 18, 2020

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Dec 17, 2020

Description

This improves the caching in fetch in two ways:

  1. Use a timeout for the cache of the row start/stop interval for given content, tstart, tstop inputs.
  2. Use a more robust cache key for caching the times arrays for a query. The previous key could lead to a failure if the archive was updated between subsequent queries.

The inspiration is an issue with the cheta archive where a persistent session was unable to see new updates in the data.

This requires sot/ska_helpers#20.

Testing

  • Passes unit tests on MacOS
  • Functional testing

Functional testing: 1

The fetch.py code was patched to have a 10-second timeout instead of the actual 10-minute timeout. Then this code was run:

from cheta import fetch
import time

t0 = time.time()

def fetch_msid(msid, wait):
    dt = time.time() - t0
    print(f'dt={dt:.0f} {msid=} {wait=}')
    fetch.Msid(msid, '2020:001', '2020:002')
    print(fetch.get_interval.cache_info())
    print()
    time.sleep(wait)

fetch_msid('tephin', 1)
fetch_msid('tephin', 1)
fetch_msid('tephin', 1)
fetch_msid('aopcadmd', 1)
fetch_msid('tephin', 8)
fetch_msid('tephin', 1)
fetch_msid('tephin', 1)
fetch_msid('tephin', 1)
fetch_msid('aopcadmd', 1)

Output

In [10]: run go                                                                                                                     
dt=0 msid='tephin' wait=1
CacheInfo(hits=0, misses=1, maxsize=1000, currsize=1)

dt=1 msid='tephin' wait=1
CacheInfo(hits=1, misses=1, maxsize=1000, currsize=1)

dt=2 msid='tephin' wait=1
CacheInfo(hits=2, misses=1, maxsize=1000, currsize=1)

dt=3 msid='aopcadmd' wait=1
CacheInfo(hits=2, misses=2, maxsize=1000, currsize=2)

dt=4 msid='tephin' wait=8
CacheInfo(hits=3, misses=2, maxsize=1000, currsize=2)

dt=12 msid='tephin' wait=1
CacheInfo(hits=0, misses=1, maxsize=1000, currsize=1)

dt=13 msid='tephin' wait=1
CacheInfo(hits=1, misses=1, maxsize=1000, currsize=1)

dt=14 msid='tephin' wait=1
CacheInfo(hits=2, misses=1, maxsize=1000, currsize=1)

dt=15 msid='aopcadmd' wait=1
CacheInfo(hits=2, misses=2, maxsize=1000, currsize=2)

Functional testing: 2

The fetch.py code was patched to have a 2-minute timeout instead of the actual 10-minute timeout. A local copy the cheta archive was created that is a week out of date. Then this code was run:

ska3-shiny) ➜  eng_archive git:(better-caching) ✗ ipython
Python 3.8.3 (default, Jul  2 2020, 11:26:31) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from cheta import fetch                                                                                                     
fetch: using ENG_ARCHIVE=/Users/aldcroft/git/eng_archive for archive path

In [2]: fetch.add_logging_handler()                                                                                                 

In [3]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                        
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/TIME.h5
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [4]: from cxotime import CxoTime                                                                                                 

In [5]: CxoTime(dat.times[-1]).date                                                                                                 
Out[5]: '2020:345:03:58:32.755'

#########################################################
### In another window update the local archive to up present
### using cheta_sync.
### Then fetch repeatedly at intervals.
#########################################################

In [6]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                        
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [7]: CxoTime(dat.times[-1]).date                                                                                                 
Out[7]: '2020:345:03:58:32.755'

In [8]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                        
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [9]: CxoTime(dat.times[-1]).date                                                                                                 
Out[9]: '2020:345:03:58:32.755'

In [10]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [11]: CxoTime(dat.times[-1]).date                                                                                                
Out[11]: '2020:345:03:58:32.755'

In [12]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [13]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [14]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [15]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

In [16]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:18726]

##############################################
### More than 2 minutes passed, cache gets cleared
##############################################

In [17]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/TIME.h5       <======
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:41916]

In [18]: dat = fetch.Msid('AACCCDPT', '2020:340', '2021:001')                                                                       
_get_data: Getting data for AACCCDPT between 2020:340:00:00:00.000 to 2021:001:00:00:00.000
_get_msid_data_from_cxc: Using times_cache for pcad5eng 723513669.184 to 725846469.184
_get_msid_data_from_cxc: Reading /Users/aldcroft/git/eng_archive/data/pcad5eng/AACCCDPT.h5
_get_msid_data_from_cxc: Slicing AACCCDPT arrays [165:41916]

In [19]: CxoTime(dat.times[-1]).date                                                                                                
Out[19]: '2020:352:02:08:06.392'

@taldcroft taldcroft requested a review from mbaski December 17, 2020 15:03
@mbaski
Copy link
Contributor

mbaski commented Dec 17, 2020

Looks good, but two questions:

  1. Why was 10 minutes chosen as the timeout?
  2. Is the cache used when getting data from MAUDE?

@taldcroft
Copy link
Member Author

@mbaski

The 10 minutes was pretty arbitrary. The real use case in my mind is querying a bunch of MSIDs in the same content type in close succession, which would usually finish within 10 minutes. This is weighed against the update process that happens daily. If you have thoughts about a different value I'm open to that.

About MAUDE data, there is no caching there (at least within fetch).

@mbaski
Copy link
Contributor

mbaski commented Dec 17, 2020

Sounds good Tom. Yeah, different use cases could need different timeouts - your use case sounds like a good driver to set it to 10 minutes.

@taldcroft taldcroft merged commit 79421b7 into master Dec 18, 2020
@taldcroft taldcroft deleted the better-caching branch December 18, 2020 16:51
@javierggt javierggt mentioned this pull request Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants