Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a decimal parameter to read_csv / scan_csv #6698

Closed
Bebio95 opened this issue Feb 6, 2023 · 17 comments · Fixed by #15774
Closed

Add a decimal parameter to read_csv / scan_csv #6698

Bebio95 opened this issue Feb 6, 2023 · 17 comments · Fixed by #15774
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@Bebio95
Copy link

Bebio95 commented Feb 6, 2023

Problem description

As a french user of polars, it would be very convenient to have a decimal parameter (as in Pandas) to specify it (',' for France but I think it's also the case in Germany) and obtain directly the desired dataframe, without being forced to use a str.replace on every import.

@Bebio95 Bebio95 added the enhancement New feature or an improvement of an existing feature label Feb 6, 2023
@igmriegel
Copy link
Contributor

As a Latam User I come across the same necessity. I've recently made an PR to be able to give this type of parsing instructions to pyarrow through our Polars API, the only limitation is the pyarrow is not available for use on the lazy api with scan_csv.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 6, 2023

I'm curious what the CSV files look like - if they are using a comma inside the number, presumably they must use a different (non-comma) separator? (TAB, perhaps?) Or are all the numeric values typically double-quoted instead?

For example...

colx\tcoly
1,234\t4,567

...or:

colx,coly
"1,234","4,567"

@Bebio95
Copy link
Author

Bebio95 commented Feb 6, 2023

...

I can answer for France, where the semicolon is used as separator.

@igmriegel
Copy link
Contributor

igmriegel commented Feb 6, 2023

The file standard separator for us is semi-colon, so the txt would be like:

colx;coly
10,20;4,5

Some systems sometime even use pipe as separator, but it is not something that is a problem.

colx|coly
10,20|4,5

The important detail is that the numbers are not involved by quotes and the comma is the decimal separator, therefore:
latam -> 500.432,98 (Probably french too)
us -> 500,432.98

We just swap the dot for the comma.

edit_1
https://en.wikipedia.org/wiki/Decimal_separator#Examples_of_use

A wiki table that shows the common radix point patterns across the globe.

@igmriegel
Copy link
Contributor

@Bebio95
For now you could use the pyarrow to import your data and use the polars.from_arrow to convert to Polars
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html#

The pyarrow parsing options:
https://arrow.apache.org/docs/python/csv.html#customized-conversion

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 7, 2023

Ok, looks much as I expected, thanks; I think I can add this facility into the polars-native (Rust) parser at very little cost, but probably not until the weekend 👍

@alexander-beedie alexander-beedie self-assigned this Feb 11, 2023
@danilogalisteu
Copy link

Hello,
For the same reasons, it would be nice to have a thousands argument (optional, would be sometimes set to '.' by European and LatAm users, and to ',' by English speaking users).
This value is often present due to formatting, like in the example above #6698 (comment), and the character could just be dropped before the actual parsing.
Thanks!

@bjornasm
Copy link

Is there any progress on this? It would be great to be able to specify polars.read_csv('foo,csv', decimal_char=",")` for us europeans. Every column with decimal value defaults to string instead of float.

@igmriegel
Copy link
Contributor

This PR only aid the write function, we need this parameters on the read and scan functions too...
#7806

@JavierRojas14
Copy link

I would appreciate this feature being added too!

@Julian-J-S
Copy link
Contributor

I would love to see this too!
In my experience, CSVs with a comma as a decimal separator (and usually semi-colon als column separator) is unfortunately very common :D

@LucasBou
Copy link

It seems that the fast_float crate is used to parse to floating point types, but it does not support specifying a different decimal separator.
The lexical crate, which is already used to parse to integer types, does support it through its lexical::parse_with_options method.
Though I guess using lexical instead of fast_float could incur a performance hit.
I would be willing to work on implementing this feature.

@Wainberg
Copy link
Contributor

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately. As a related issue, it would be nice to support thousands separators for both int (e.g. "10,000") and float (e.g. "10,000.5"), with the ability to customize both the thousands separator and the decimal separator.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 15, 2024

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately.

I was planning to as our previous float parser did support this; the newer SIMD parser unfortunately does not, so this is currently stuck until such support can be added 😓

@jqnatividad
Copy link
Contributor

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 25, 2024

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

We have our own CSV parser, but this is referring to the SIMD string→float parsing library which that CSV parser calls when handling float-like strings.

@kevinw26
Copy link

I think it would also be really nice to see a thousands parameter just like pd.read_csv(file, thousands=','). Lots of SAS-exported CSVs still throw thousands in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.