Add a decimal parameter to read_csv / scan_csv #6698

Bebio95 · 2023-02-06T10:48:16Z

Problem description

As a french user of polars, it would be very convenient to have a decimal parameter (as in Pandas) to specify it (',' for France but I think it's also the case in Germany) and obtain directly the desired dataframe, without being forced to use a str.replace on every import.

igmriegel · 2023-02-06T11:36:41Z

As a Latam User I come across the same necessity. I've recently made an PR to be able to give this type of parsing instructions to pyarrow through our Polars API, the only limitation is the pyarrow is not available for use on the lazy api with scan_csv.

alexander-beedie · 2023-02-06T19:56:20Z

I'm curious what the CSV files look like - if they are using a comma inside the number, presumably they must use a different (non-comma) separator? (TAB, perhaps?) Or are all the numeric values typically double-quoted instead?

For example...

colx\tcoly
1,234\t4,567

...or:

colx,coly
"1,234","4,567"

Bebio95 · 2023-02-06T20:16:06Z

...

I can answer for France, where the semicolon is used as separator.

igmriegel · 2023-02-06T21:04:10Z

The file standard separator for us is semi-colon, so the txt would be like:

colx;coly
10,20;4,5

Some systems sometime even use pipe as separator, but it is not something that is a problem.

colx|coly
10,20|4,5

The important detail is that the numbers are not involved by quotes and the comma is the decimal separator, therefore:
latam -> 500.432,98 (Probably french too)
us -> 500,432.98

We just swap the dot for the comma.

edit_1
https://en.wikipedia.org/wiki/Decimal_separator#Examples_of_use

A wiki table that shows the common radix point patterns across the globe.

igmriegel · 2023-02-06T21:19:38Z

@Bebio95
For now you could use the pyarrow to import your data and use the polars.from_arrow to convert to Polars
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html#

The pyarrow parsing options:
https://arrow.apache.org/docs/python/csv.html#customized-conversion

alexander-beedie · 2023-02-07T07:07:00Z

Ok, looks much as I expected, thanks; I think I can add this facility into the polars-native (Rust) parser at very little cost, but probably not until the weekend 👍

danilogalisteu · 2023-02-11T22:20:32Z

Hello,
For the same reasons, it would be nice to have a thousands argument (optional, would be sometimes set to '.' by European and LatAm users, and to ',' by English speaking users).
This value is often present due to formatting, like in the example above #6698 (comment), and the character could just be dropped before the actual parsing.
Thanks!

bjornasm · 2023-04-12T11:51:47Z

Is there any progress on this? It would be great to be able to specify polars.read_csv('foo,csv', decimal_char=",")` for us europeans. Every column with decimal value defaults to string instead of float.

igmriegel · 2023-04-12T18:10:19Z

This PR only aid the write function, we need this parameters on the read and scan functions too...
#7806

JavierRojas14 · 2023-04-13T14:09:38Z

I would appreciate this feature being added too!

Julian-J-S · 2023-04-18T09:55:09Z

I would love to see this too!
In my experience, CSVs with a comma as a decimal separator (and usually semi-colon als column separator) is unfortunately very common :D

LucasBou · 2023-10-25T12:50:06Z

It seems that the fast_float crate is used to parse to floating point types, but it does not support specifying a different decimal separator.
The lexical crate, which is already used to parse to integer types, does support it through its lexical::parse_with_options method.
Though I guess using lexical instead of fast_float could incur a performance hit.
I would be willing to work on implementing this feature.

Wainberg · 2024-01-15T00:15:06Z

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately. As a related issue, it would be nice to support thousands separators for both int (e.g. "10,000") and float (e.g. "10,000.5"), with the ability to customize both the thousands separator and the decimal separator.

alexander-beedie · 2024-01-15T08:04:11Z

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately.

I was planning to as our previous float parser did support this; the newer SIMD parser unfortunately does not, so this is currently stuck until such support can be added 😓

jqnatividad · 2024-03-25T11:13:21Z

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

alexander-beedie · 2024-03-25T12:04:30Z

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

We have our own CSV parser, but this is referring to the SIMD string→float parsing library which that CSV parser calls when handling float-like strings.

kevinw26 · 2024-05-10T05:03:26Z

I think it would also be really nice to see a thousands parameter just like pd.read_csv(file, thousands=','). Lots of SAS-exported CSVs still throw thousands in.

Bebio95 added the enhancement New feature or an improvement of an existing feature label Feb 6, 2023

igmriegel mentioned this issue Feb 6, 2023

feat(python) add arguments on read_csv to use all pyarrow options #6699

Closed

alexander-beedie self-assigned this Feb 11, 2023

alexander-beedie mentioned this issue Mar 27, 2023

add decimal and float_format arguments to write_csv like in pandas #7806

Closed

jqnatividad mentioned this issue Jun 15, 2023

sqlp: add ability to handle different decimal operators jqnatividad/qsv#1050

Closed

ritchie46 mentioned this issue Jun 23, 2023

write_csv() does not support specifying the decimal separator #9501

Open

Wainberg mentioned this issue Dec 31, 2023

Meta-issue: improving CSV reading and writing #13346

Open

philter87 mentioned this issue Mar 25, 2024

Include tests to check for ; separators with , decimals seedcase-project/seedcase-sprout#268

Closed

ritchie46 mentioned this issue Apr 19, 2024

feat: Support decimal float parsing in CSV #15774

Merged

ritchie46 closed this as completed in #15774 Apr 19, 2024

c-peters added the accepted Ready for implementation label Apr 22, 2024

c-peters assigned ritchie46 Apr 22, 2024

c-peters added this to Backlog Apr 22, 2024

c-peters moved this to Done in Backlog Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a decimal parameter to read_csv / scan_csv #6698

Add a decimal parameter to read_csv / scan_csv #6698

Bebio95 commented Feb 6, 2023

igmriegel commented Feb 6, 2023

alexander-beedie commented Feb 6, 2023 •

edited

Loading

Bebio95 commented Feb 6, 2023

igmriegel commented Feb 6, 2023 •

edited

Loading

igmriegel commented Feb 6, 2023

alexander-beedie commented Feb 7, 2023 •

edited

Loading

danilogalisteu commented Feb 11, 2023

bjornasm commented Apr 12, 2023

igmriegel commented Apr 12, 2023

JavierRojas14 commented Apr 13, 2023

Julian-J-S commented Apr 18, 2023

LucasBou commented Oct 25, 2023

Wainberg commented Jan 15, 2024

alexander-beedie commented Jan 15, 2024 •

edited

Loading

jqnatividad commented Mar 25, 2024

alexander-beedie commented Mar 25, 2024 •

edited

Loading

kevinw26 commented May 10, 2024

Add a decimal parameter to read_csv / scan_csv #6698

Add a decimal parameter to read_csv / scan_csv #6698

Comments

Bebio95 commented Feb 6, 2023

Problem description

igmriegel commented Feb 6, 2023

alexander-beedie commented Feb 6, 2023 • edited Loading

Bebio95 commented Feb 6, 2023

igmriegel commented Feb 6, 2023 • edited Loading

igmriegel commented Feb 6, 2023

alexander-beedie commented Feb 7, 2023 • edited Loading

danilogalisteu commented Feb 11, 2023

bjornasm commented Apr 12, 2023

igmriegel commented Apr 12, 2023

JavierRojas14 commented Apr 13, 2023

Julian-J-S commented Apr 18, 2023

LucasBou commented Oct 25, 2023

Wainberg commented Jan 15, 2024

alexander-beedie commented Jan 15, 2024 • edited Loading

jqnatividad commented Mar 25, 2024

alexander-beedie commented Mar 25, 2024 • edited Loading

kevinw26 commented May 10, 2024

alexander-beedie commented Feb 6, 2023 •

edited

Loading

igmriegel commented Feb 6, 2023 •

edited

Loading

alexander-beedie commented Feb 7, 2023 •

edited

Loading

alexander-beedie commented Jan 15, 2024 •

edited

Loading

alexander-beedie commented Mar 25, 2024 •

edited

Loading