Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using object instead of string in index #418

Open
hagenw opened this issue Feb 14, 2024 · 0 comments
Open

Consider using object instead of string in index #418

hagenw opened this issue Feb 14, 2024 · 0 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@hagenw
Copy link
Member

hagenw commented Feb 14, 2024

It turns out that even in version 2.2.0 of pandas the new string dtype is not up to the same speed for some tasks, and unfortunately one of them is indexing:

import pandas as pd
import timeit

points = 1000000
data = [f"data-{n}" for n in range(points)]
for dtype in ["object", "string", "string[pyarrow]"]:
    index = pd.Index([f"index-{n}" for n in range(points)], dtype=dtype)
    df = pd.DataFrame(data, index=index, dtype=dtype)
    print(dtype)
    %timeit df.loc['index-2000']

which returns

object
9.78 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string
15.7 µs ± 36.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
string[pyarrow]
17.6 µs ± 66.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So we might consider switching to store the file index again as object dtype as we do now for the dependencies in audb (audeering/audb#371). The only problem is, that in audb the change is hidden for the user, whereas here it would be a breaking change.

@hagenw hagenw added enhancement New feature or request question Further information is requested labels Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant