Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastest way to check is and object is int or float in one pass #36

Closed
argenisleon opened this issue Apr 29, 2020 · 14 comments
Closed

Fastest way to check is and object is int or float in one pass #36

argenisleon opened this issue Apr 29, 2020 · 14 comments

Comments

@argenisleon
Copy link

Hi,
Is there a way to check if an object is an int or float in one pass?
Right now I am using .isint and .isfloat.
Any help?

@argenisleon argenisleon changed the title Fastest way to check is and object is int or float Fastest way to check is and object is int or float in one pass Apr 29, 2020
@argenisleon
Copy link
Author

argenisleon commented Apr 30, 2020

Thanks for the fast response @SethMMorton
I was looking for something like is_int_or_float with a return of 0 for int or 1 for float.

Maybe is something niche but it will be helpful for filtering data types in columns

@SethMMorton
Copy link
Owner

To be pedantic, isreal is is_int_or_float - you're looking for is_int_xor_float 😄

No, there currently is no functionality for that directly. You could do int_xor_float = isint(x) + isfloat(). It will be 0 if it's not a number, 1 if it's an int, and 2 if it's a float.

@argenisleon
Copy link
Author

argenisleon commented May 2, 2020

Thanks. I am just trying to gain the maximum speed I can get :)

@SethMMorton
Copy link
Owner

SethMMorton commented May 4, 2020

If this is something you think would be useful, I would be very open to a PR. I don't think it would require adding any new algorithms, just a new top-level function utilizing existing code.

@argenisleon
Copy link
Author

Thanks, @SethMMorton,

At the moment I have neither the bandwidth or C knowledge to tackle this, but I will be more than happy to collaborate in the future if I can not get other options work.

@SethMMorton
Copy link
Owner

Is the suggestion I made usable, or is this something you need? I'm curious, what application are you needing this for?

@argenisleon
Copy link
Author

argenisleon commented May 8, 2020

Thanks @SethMMorton

I am the main developer in Bumblebee https://github.com/ironmussa/Bumblebee/tree/develop-3.0
I am trying to infer the data type of a column as fast as I can. For that, I would like to apply a function to every element in a pandas series(like an array) to know if it is object,int, float, or null. . Then count every data type.

The problem with the default approach in Dask is that it loads the data in chunks and tries to inter the datatype in every chunk. Sometimes it fails because every chunk results in different data types.

The final goal is to reduce the memory usage, casting to the data type that better represents the data.

@SethMMorton
Copy link
Owner

Doesn't pandas auto-infer the datatype for you? Or are you trying to infer the type of your dataset before inserting into the dataframe?

Either way, I can see the value of a function to tell the type, not just answer "is this a particular type"? I think I was a bit thrown off by the specificity of is_int_or_float - I think a better name would be something like detect_type, and instead of returning a number, it would return the actual python type int or float or whatnot.

Rough python equivalent put in terms of existing fastnumbers functionality:

from fastnumbers import isint, isfloat

def detect_type(x):
    if isint(x):
        return int
    elif isfloat(x):
        return float
    else:
        return None

Open questions:

  • If given a string that is non-numeric, should it return None as shown above, or return str?
  • If given something completely crazy, like a list, should it return None, list, or raise a TypeError?

@argenisleon
Copy link
Author

Yes, pandas can infer the datatype. The problem is Dask because it is inferring the datatype in every chunk of data.

The code you wrote is exactly what I am doing right now :)
About your questions:

If given a string that is non-numeric, should it return None as shown above, or return str?
return str

If given something completely crazy, like a list, should it return None, list, or raise a TypeError?
return a list

@SethMMorton
Copy link
Owner

@argenisleon I have created a PR for this at #38. Can you please review? At the very least, please review the following:

@argenisleon
Copy link
Author

Sure @SethMMorton , I will be reviewing this today

@SethMMorton
Copy link
Owner

Closed by #38

@SethMMorton
Copy link
Owner

FYI - this was released as part of fastnumbers 3.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants