Question about slicing a dataframe #253

Oggy16 · 2023-08-25T11:41:12Z

Oggy16
Aug 25, 2023

Hi all,

Today I have a question about slicing a dataframe. I only need a small window of data, and processing the entire dataframe feels like wasted computation, so I want to reduce it to only the last range of datapoints (in this case, three times the roll_period I use for linear regression).

Looking at the docs the "remove_data_by_idx" function looks like what I want, so I've written the following code:

unsigned long df_size = df.get_column<float>("last").size(); 
cerr << "Size: " << df_size << endl;
if (df_size > (roll_period * 3)) {
    df = df.get_data_by_idx<std::string, float>(
         Index2D<ULDataFrame::IndexType>{df_size - (roll_period * 3), df_size}
    ); 
}

It compiles fine, but when I run it, I get an odd error. Once the conditional is fulfilled I get DataFrame::get_column(): ERROR: Cannot find column 'timestamp'. The "timestamp" is a valid column in the original dataframe, and from the docs remove_data_by_idx should preserve all columns in the dataframe, so what am I doing wrong?
Only difference I can see from the hello_world example is that I am overwriting the original df variable with the new dataframe (in the interest of memory efficiency) but the copied dataframe slice should have all the columns anyway.

In addition and more as a general question, is this the right way to effectively get a sliding window? I betray my Python roots as I've used the above pattern with pandas a lot in the past. Perhaps there is a better/more elegant way it can be done using this library and C++?

Answered by Oggy16

Aug 27, 2023

During my tests I was able to reproduce the issue both with a separate dataframe and overwriting the existing one. While reading further through the docs, I found the "df.get_data_by_loc" function, which accepts Python style negative indexes.

As that pattern was familiar to me, I re-wrote the section as follows:

unsigned long df_size = df.get_column<float>("last").size();
if (df_size > (roll_period + 1)) {
    df = df.get_data_by_loc<std::string, float>(Index2D<long>{-(roll_period + 1), -1});
}

Not only does this look more familiar to me, my issue has also gone away. The sliding window works well and as I had expected, my performance has increased a lot after this change.

I am not sure w…

View full answer

hosseinmoein · 2023-08-25T14:26:07Z

hosseinmoein
Aug 25, 2023
Maintainer

I have limited access to computers for the next few days. So I can’t tell what’s going on. Experiment with different options, like assigning to a new DataFrame… Sent from the all new AOL app for iOS On Friday, August 25, 2023, 7:41 AM, Oggy16 ***@***.***> wrote: Hi all, Today I have a question about slicing a dataframe. I only need a small window of data, and processing the entire dataframe feels like wasted computation, so I want to reduce it to only the last range of datapoints (in this case, three times the roll_period I use for linear regression). Looking at the docs the "remove_data_by_idx" function looks like what I want, so I've written the following code: unsigned long df_size = df.get_column<float>("last").size(); cerr << "Size: " << df_size << endl; if (df_size > (roll_period * 3)) { df = df.get_data_by_idx<std::string, float>( Index2D<ULDataFrame::IndexType>{df_size - (roll_period * 3), df_size} ); } It compiles fine, but when I run it, I get an odd error. Once the conditional is fulfilled I get DataFrame::get_column(): ERROR: Cannot find column 'timestamp'. The "timestamp" is a valid column in the original dataframe, and from the docs remove_data_by_idx should preserve all columns in the dataframe, so what am I doing wrong? Only difference I can see from the hello_world example is that I am overwriting the original df variable with the new dataframe (in the interest of memory efficiency) but the copied dataframe slice should have all the columns anyway. In addition and more as a general question, is this the right way to effectively get a sliding window? I betray my Python roots as I've used the above pattern with pandas a lot in the past. Perhaps there is a better/more elegant way it can be done using this library and C++? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

3 replies

Oggy16 Aug 26, 2023
Author

Hello,

No problem, thanks for letting me know. I will carry on by trial and error and see if I can find a solution.

hosseinmoein Aug 27, 2023
Maintainer

I can not reproduce this problem. I ran the following code

    StlVecType<unsigned long>  idx =
        { 123450, 123451, 123452, 123453, 123454, 123455, 123456, 123457, 123458, 123459, 123460, 123461, 123462, 123466,
          123467, 123468, 123469, 123470, 123471, 123472, 123473 };
    StlVecType<double>         d1 =
        { 2.5, 2.45, -0.65, -0.1, -1.1, 1.87, 0.98, 0.34, 1.56, -0.34, 2.3, -0.34, -1.9, 0.387, 0.123, 1.06, -0.65, 2.03, 0.4, -1.0, 0.59 };
    StlVecType<double>         d2 =
        { 0.2, 0.58, -0.60, -0.08, 0.05, 0.87, 0.2, 0.4, 0.5, 0.06, 0.3, -0.34, -0.9, 0.8, -0.4, 0.86, 0.01, 1.02, -0.02, -1.5, 0.2 };
    StlVecType<int>            i1 = { 22, 23, 24, 25, 99 };
    StlVecType<std::string>    strvec =
        { "zz", "bb", "cc", "ww", "ee", "ff", "gg", "hh", "ii", "jj", "kk",
          "ll", "mm", "nn", "oo", "AAA 1", "AAA 2", "AAA 3", "AAA 4", "AAA 5", "AAA 6" };
    MyDataFrame                 df;

    df.load_data(std::move(idx),
                 std::make_pair("Double Col1", d1),
                 std::make_pair("Double Col2", d2),
                 std::make_pair("Int Col", i1),
                 std::make_pair("String Col", strvec));
    df.write<std::ostream, double, int, std::string>(std::cout, io_format::csv2);
    std::cout << "\n\n\n";

    df = df.get_data_by_idx<double, int, std::string>(Index2D<unsigned long> { 123461UL, 123473UL });
    df.write<std::ostream, double, int, std::string>(std::cout, io_format::csv2);

It runs and prints as expected

Oggy16 Aug 27, 2023
Author

During my tests I was able to reproduce the issue both with a separate dataframe and overwriting the existing one. While reading further through the docs, I found the "df.get_data_by_loc" function, which accepts Python style negative indexes.

As that pattern was familiar to me, I re-wrote the section as follows:

unsigned long df_size = df.get_column<float>("last").size();
if (df_size > (roll_period + 1)) {
    df = df.get_data_by_loc<std::string, float>(Index2D<long>{-(roll_period + 1), -1});
}

Not only does this look more familiar to me, my issue has also gone away. The sliding window works well and as I had expected, my performance has increased a lot after this change.

I am not sure what my original issue was, but it was most likely me doing something incorrect due to unfamiliarity with the language. As such I am happy to mark this as answered if you concur, many thanks again for your assistance :-)

Answer selected by Oggy16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about slicing a dataframe #253

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about slicing a dataframe #253

Oggy16 Aug 25, 2023

Replies: 1 comment · 3 replies

hosseinmoein Aug 25, 2023 Maintainer

Oggy16 Aug 26, 2023 Author

hosseinmoein Aug 27, 2023 Maintainer

Oggy16 Aug 27, 2023 Author

Oggy16
Aug 25, 2023

Replies: 1 comment 3 replies

hosseinmoein
Aug 25, 2023
Maintainer

Oggy16 Aug 26, 2023
Author

hosseinmoein Aug 27, 2023
Maintainer

Oggy16 Aug 27, 2023
Author