Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe with missing columns returned when load() returns empty #302

Closed
ttngu207 opened this issue Jan 9, 2024 · 4 comments
Closed

Dataframe with missing columns returned when load() returns empty #302

ttngu207 opened this issue Jan 9, 2024 · 4 comments
Assignees
Labels
bug Something isn't working critical

Comments

@ttngu207
Copy link
Contributor

ttngu207 commented Jan 9, 2024

Using the aeon api for reader and load, if the returned pd.DataFrame is empty (due to no data found in the specified time period), the empty DataFrame has missing columns.

To reproduce the bug

import pathlib
import pandas as pd
import aeon

raw_data_dir = pathlib.Path("/ceph/aeon/aeon/data/raw/AEON4/social0.1")
chunk_start = "2023-11-27 11:47:59"
chunk_end = "2023-11-27 11:55:51"

stream = aeon.io.reader.Csv(pattern="Patch3_*", columns=['threshold', 'offset', 'rate'], extension="csv")
stream_data = io_api.load(
    root=raw_data_dir.as_posix(),
    reader=stream,
    start=pd.Timestamp(chunk_start),
    end=pd.Timestamp(chunk_end),
)

The specified time period has no data, so it is expected for an empty DataFrame being returned. However, this empty df should have the same columns as specified in the reader (['threshold', 'offset', 'rate']).
However, the returned empty df only has columns: ['offset', 'rate']

Empty DataFrame
Columns: [offset, rate]
Index: []
@ttngu207 ttngu207 added the bug Something isn't working label Jan 9, 2024
@jkbhagatio jkbhagatio changed the title Dataframe with missing columns returned when api.load() returns empty Dataframe with missing columns returned when load() returns empty Feb 6, 2024
@jkbhagatio jkbhagatio added this to the Social0.2 Ongoing milestone Feb 7, 2024
@ttngu207
Copy link
Contributor Author

ttngu207 commented Feb 15, 2024

@jkbhagatio The issue seems to be at the Csv reader. Particularly when the csv file does exist, but the file itself is empty (not sure why, corrupted?).

return pd.read_csv(file, header=0, names=self.columns, dtype=self.dtype, index_col=0)

In that case, the index_col=0 will remove one column.

A fix could be

return pd.read_csv(
            file,
            header=0,
            names=self.columns,
            dtype=self.dtype,
            index_col=0 if file.stat().st_size else None,
        )

@jkbhagatio
Copy link
Member

Also worth discussing in regards to this issue is whether or not we should find a way to ensure empty files don't make their way into the dataset, on the acquisition side @glopesdev

@jkbhagatio
Copy link
Member

We should look into https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html to see if this can be resolved via setting pandas args appropriately, that will handle this case of reading from an empty file

@jkbhagatio
Copy link
Member

We're happy with @ttngu207 solution in #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical
Projects
None yet
Development

No branches or pull requests

3 participants