[BUG] whisper-fetch.py --drop incorrect timestamps #305

cdeil · 2020-11-12T10:00:07Z

I think there's a bug here:

Lines 67 to 69 in 8d21c56

    
           if options.drop: 
        
             fcn = _DROP_FUNCTIONS.get(options.drop) 
        
             values = [x for x in values if fcn(x)]

I was using whisper-fetch.py --drop nulls and got incorrect timestamps.

When dropping values, the timestamps also need to be adjusted to match, no?

The text was updated successfully, but these errors were encountered:

deniszh · 2020-11-12T10:50:55Z

Hi @cdeil ,

As far as I remember it was not initial intention of author in #57
Other users also noticed that - e.g. #250

deniszh · 2020-11-12T10:53:13Z

TBH I think proper implementation should not drop point but replace it with 0 or previous value instead. But then we should not call this function "drop", isn't it?

piotr1212 · 2020-11-12T11:08:52Z

I don't see where it says this is intended. Returning the wrong timestamp does not make any sense to me, can't imagine why someone would want this.

cdeil · 2020-11-12T12:03:46Z

This cost me a day at work, because I ran whisper-fetch.py --drop=nulls and got timestamps for 2010 - 2012 and requested another data extract from a legacy system and was debugging why the extract doesn't work properly there. But really the data was for 2018-2020 as it should be already in the Whisper file, I just created wrong timestamps due to this bug.

For now I'm changing my code to just call whisper.fetch() in my pipeline instead of whisper-fetch.py, put the timestamps and values into a pandas.Series and call dropna to drop irrelevant (t, val).

cdeil · 2020-11-12T12:09:02Z

Possible fix (completely untested): #306

cdeil · 2020-11-12T13:13:27Z

Another issue that I ran into is that here you're using local time of the machine that I'm running the data processing on, which gave incorrect results in my case:

whisper/bin/whisper-fetch.py

Lines 86 to 89 in 8d21c56

    
           if options.time_format: 
        
             timestr = time.strftime(options.time_format, time.localtime(t)) 
        
           else: 
        
             timestr = time.ctime(t)

I'm now using this, I think that should be correct:

def read_whisper(path):
    (fromTime, untilTime, step), val = whisper.fetch(path, fromTime=0, archiveToSelect=None)
    fromTimeStamp = pd.Timestamp(fromTime, unit="s", tz="Europe/Berlin")
    index = pd.date_range(
        start=fromTimeStamp,
        freq="H",
        periods=len(val),
    )
    data = {"val": val}
    return pd.DataFrame(data, index=index)

cdeil · 2020-11-13T10:31:42Z

I was getting incorrect data with whisper.fetch for archiveToSelect=60, tried varies fromTime values.

I see correct data with whisper-dump.py, but I don't want to write temp CSV files.

Wrote this, which seems to work fine:

def read_whisper_archive(path: str, archive_id: int) -> pd.DataFrame:
    """Whisper data read direct implementation with Numpy and Pandas"""
    infos = whisper.info(path)
    if archive_id < 0 or archive_id >= len(infos["archives"]):
        raise ValueError(f"Invalid archive_id = {archive_id}")

    dtype = np.dtype([
        ("time", ">u4"),
        ("val", ">f8")
    ])

    offset = infos["archives"][archive_id]["offset"]
    data = np.fromfile(path, dtype=dtype, offset=offset)
    data = data[np.nonzero(data["time"])]
    # The astype is needed to avoid this error later on
    # ValueError: Big-endian buffer not supported on little-endian compiler
    df = pd.DataFrame(
        data={"val": data["val"].astype(float)},
        index=pd.to_datetime(data["time"], unit="s")
    )
    df = df.sort_index()
    return df

This should be much faster and memory-efficient than the current whisper-dump.py, which used Python types and lists, and also more convenient for use cases where people want to do data analysis directly, or dump to binary formats like Parquet / HDF5 / ....

Is the processing correct, i.e. is it guaranteed that non-filled values have time=0? and is the sorting by time at the end needed, or is this already the case in the file? the description at https://graphite.readthedocs.io/en/stable/whisper.html unfortunately doesn't explain where values are filled within the archive.

Do you think it could make sense to add a function like this to this repo? The Numpy & Pandas import could be delayed to the function, i.e. it would be an optional dependency.
Alternatively I could just make a file whisper_pandas.py in my private Github account share some functions there.

deniszh · 2020-11-13T10:44:44Z

Hi @cdeil,
I agreed that Whisper has many use cases, but IMO using it for analytical purposes is not widely adopted.
That's why I do not think that including pandas or numpy as part of library is a good idea.
But you can add your scripts to contrib/ directory, which exist for exact that reason.

stale · 2021-01-12T11:55:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cdeil · 2021-07-19T21:36:17Z

Just in case someone finds this old thread and is looking for a Whisper file Pandas reader, check this out:
https://github.com/cdeil/whisper_pandas/blob/main/whisper_pandas.py

Of course, any feedback or contribution would be welcome.

Specifically I'm not sure if the data = data[data["time"] != 0] line and sort_index is valid.

It seems to work for my files, but the WhisperDB docs at https://graphite.readthedocs.io/en/latest/whisper.html unfortunately don't say how the file is initialised or where the points are inserted.

cdeil added the bug label Nov 12, 2020

cdeil mentioned this issue Nov 12, 2020

Fix whisper-fetch.py --drop timestamps #306

Open

stale bot added the stale label Jan 12, 2021

stale bot closed this as completed Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] whisper-fetch.py --drop incorrect timestamps #305

[BUG] whisper-fetch.py --drop incorrect timestamps #305

cdeil commented Nov 12, 2020

deniszh commented Nov 12, 2020

deniszh commented Nov 12, 2020

piotr1212 commented Nov 12, 2020

cdeil commented Nov 12, 2020

cdeil commented Nov 12, 2020 •

edited

Loading

cdeil commented Nov 12, 2020

cdeil commented Nov 13, 2020

deniszh commented Nov 13, 2020

stale bot commented Jan 12, 2021

cdeil commented Jul 19, 2021

[BUG] whisper-fetch.py --drop incorrect timestamps #305

[BUG] whisper-fetch.py --drop incorrect timestamps #305

Comments

cdeil commented Nov 12, 2020

deniszh commented Nov 12, 2020

deniszh commented Nov 12, 2020

piotr1212 commented Nov 12, 2020

cdeil commented Nov 12, 2020

cdeil commented Nov 12, 2020 • edited Loading

cdeil commented Nov 12, 2020

cdeil commented Nov 13, 2020

deniszh commented Nov 13, 2020

stale bot commented Jan 12, 2021

cdeil commented Jul 19, 2021

cdeil commented Nov 12, 2020 •

edited

Loading