Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] whisper-fetch.py --drop incorrect timestamps #305

Closed
cdeil opened this issue Nov 12, 2020 · 10 comments
Closed

[BUG] whisper-fetch.py --drop incorrect timestamps #305

cdeil opened this issue Nov 12, 2020 · 10 comments

Comments

@cdeil
Copy link
Contributor

cdeil commented Nov 12, 2020

I think there's a bug here:

if options.drop:
fcn = _DROP_FUNCTIONS.get(options.drop)
values = [x for x in values if fcn(x)]

I was using whisper-fetch.py --drop nulls and got incorrect timestamps.

When dropping values, the timestamps also need to be adjusted to match, no?

@cdeil cdeil added the bug label Nov 12, 2020
@deniszh
Copy link
Member

deniszh commented Nov 12, 2020

Hi @cdeil ,

As far as I remember it was not initial intention of author in #57
Other users also noticed that - e.g. #250

@deniszh
Copy link
Member

deniszh commented Nov 12, 2020

TBH I think proper implementation should not drop point but replace it with 0 or previous value instead. But then we should not call this function "drop", isn't it?

@piotr1212
Copy link
Member

I don't see where it says this is intended. Returning the wrong timestamp does not make any sense to me, can't imagine why someone would want this.

@cdeil
Copy link
Contributor Author

cdeil commented Nov 12, 2020

This cost me a day at work, because I ran whisper-fetch.py --drop=nulls and got timestamps for 2010 - 2012 and requested another data extract from a legacy system and was debugging why the extract doesn't work properly there. But really the data was for 2018-2020 as it should be already in the Whisper file, I just created wrong timestamps due to this bug.

For now I'm changing my code to just call whisper.fetch() in my pipeline instead of whisper-fetch.py, put the timestamps and values into a pandas.Series and call dropna to drop irrelevant (t, val).

@cdeil
Copy link
Contributor Author

cdeil commented Nov 12, 2020

Possible fix (completely untested): #306

@cdeil
Copy link
Contributor Author

cdeil commented Nov 12, 2020

Another issue that I ran into is that here you're using local time of the machine that I'm running the data processing on, which gave incorrect results in my case:

if options.time_format:
timestr = time.strftime(options.time_format, time.localtime(t))
else:
timestr = time.ctime(t)

I'm now using this, I think that should be correct:

def read_whisper(path):
    (fromTime, untilTime, step), val = whisper.fetch(path, fromTime=0, archiveToSelect=None)
    fromTimeStamp = pd.Timestamp(fromTime, unit="s", tz="Europe/Berlin")
    index = pd.date_range(
        start=fromTimeStamp,
        freq="H",
        periods=len(val),
    )
    data = {"val": val}
    return pd.DataFrame(data, index=index)

@cdeil
Copy link
Contributor Author

cdeil commented Nov 13, 2020

I was getting incorrect data with whisper.fetch for archiveToSelect=60, tried varies fromTime values.

I see correct data with whisper-dump.py, but I don't want to write temp CSV files.

Wrote this, which seems to work fine:

def read_whisper_archive(path: str, archive_id: int) -> pd.DataFrame:
    """Whisper data read direct implementation with Numpy and Pandas"""
    infos = whisper.info(path)
    if archive_id < 0 or archive_id >= len(infos["archives"]):
        raise ValueError(f"Invalid archive_id = {archive_id}")

    dtype = np.dtype([
        ("time", ">u4"),
        ("val", ">f8")
    ])

    offset = infos["archives"][archive_id]["offset"]
    data = np.fromfile(path, dtype=dtype, offset=offset)
    data = data[np.nonzero(data["time"])]
    # The astype is needed to avoid this error later on
    # ValueError: Big-endian buffer not supported on little-endian compiler
    df = pd.DataFrame(
        data={"val": data["val"].astype(float)},
        index=pd.to_datetime(data["time"], unit="s")
    )
    df = df.sort_index()
    return df

This should be much faster and memory-efficient than the current whisper-dump.py, which used Python types and lists, and also more convenient for use cases where people want to do data analysis directly, or dump to binary formats like Parquet / HDF5 / ....

Is the processing correct, i.e. is it guaranteed that non-filled values have time=0? and is the sorting by time at the end needed, or is this already the case in the file? the description at https://graphite.readthedocs.io/en/stable/whisper.html unfortunately doesn't explain where values are filled within the archive.

Do you think it could make sense to add a function like this to this repo? The Numpy & Pandas import could be delayed to the function, i.e. it would be an optional dependency.
Alternatively I could just make a file whisper_pandas.py in my private Github account share some functions there.

@deniszh
Copy link
Member

deniszh commented Nov 13, 2020

Hi @cdeil,
I agreed that Whisper has many use cases, but IMO using it for analytical purposes is not widely adopted.
That's why I do not think that including pandas or numpy as part of library is a good idea.
But you can add your scripts to contrib/ directory, which exist for exact that reason.

@stale
Copy link

stale bot commented Jan 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 12, 2021
@stale stale bot closed this as completed Jan 19, 2021
@cdeil
Copy link
Contributor Author

cdeil commented Jul 19, 2021

Just in case someone finds this old thread and is looking for a Whisper file Pandas reader, check this out:
https://github.com/cdeil/whisper_pandas/blob/main/whisper_pandas.py

Of course, any feedback or contribution would be welcome.

Specifically I'm not sure if the data = data[data["time"] != 0] line and sort_index is valid.

It seems to work for my files, but the WhisperDB docs at https://graphite.readthedocs.io/en/latest/whisper.html unfortunately don't say how the file is initialised or where the points are inserted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants