-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_fwf docs #49832
read_fwf docs #49832
Conversation
…pace, and how to override. Enhances GH-issue-16950 pandas-dev#16950
if it is not spaces (e.g., '~'). | ||
Default are space and tab characters. | ||
Used to specify the character(s) to strip from start and end of every field. | ||
To preserve whitespace, set to a character that does not exist in the data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like an anti pattern, you should not use the function at all when you want to read the whole data as one column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @phofl,
I think there's a misunderstanding.
This PR is merely documenting the current anti-patterns that exist in read_fwf
:
- Input file contains 172 fields / columns, precisely defined by colspecs.
- File is read into DataFrame
- Data is mangled - white space is stripped
- To preserve white space, a delimiter field is also required, and its value must be something that will never appear at start or end of any field
Anti-patters observed:
- In a fixed-width data file, data should not be changed unless explicitly requested
- Fixed-width files do not have delimiters, rather colspecs
read_csv
will preserve white space
I think parts of read_fwf were designed to handle tabular, human readable data, not flat database files. For reading tabular data, read_table
seems the appropriate tool, IMHO.
TL;DR This PR is attempting to accurately describe the current behaviour as #16772 shows people still confused by it and #16950 didn't address readers.py.
Also, thank you @jbrockmendel for labelling this as Docs! I could not figure out how to do that myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather fix this instead of documenting a workaround then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to hear a fix is preferred. Working on that now.
Expecting controversy by breaking current default behaviour but will clearly document how to achieve current behaviour of stripping white space. Am inclined to also mention read_table
as a potential solution for some users.
If anyone is using delimiter="~"
as is mentioned as an example in the documentation, planning to continue to support such usage, but thinking to raise FutureWarning
if delimiter
keyword is used.
Is this reasonable / acceptable pandas policy?
Should I amend doc/source/whatsnew/v1.5.3.rst
or doc/source/whatsnew/v2.0.0.rst
?
Thank you for your help with all the issues with attempting a successful first PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.0, only regression fixes are backported
So to summarise: If a delimiter is passed and the character is present: What happens in this case? If a delimiter is passed and does not exist, all whitespaces are preserved, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct on both counts.
- If a delimiter is passed and the character is present, it is stripped from start & end of every field.
- If a delimiter is passed (that is not a space char), then whitespaces are preserved.
Assigning default value(s) to delimiter:
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1168
Stripping delimiter(s) from each field (thus also removes \n\r
from each line):
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fact that we strip the character from the end of the fields documented anywhere? If no, we can definitely deprecate. This sounds odd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mentioned obliquely:
https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/io.rst
The function parameters to read_fwf are largely the same as read_csv with two extra parameters, and a different usage of the delimiter parameter:
...
delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., '~').
This is confusing, as for flat files any use of delimiters is unexpected since colspecs are defined (or, inferred - need to check the use of delimiters here).
In a flat file, there are no "filler character[s]", hence confusion.
Later, among the examples, is this:
The parser will take care of extra white spaces around the columns so it's ok to have extra separation between the columns in the file.
All examples are using tabular (human-readable) data.
Some conflation between read_table
and read_fwf
, IMHO.
See mentions at #16772, from 2017, and follow up questions still in 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather fix this instead of documenting a workaround then
I think I've come up with a solution that
- causes minimal disruption to users depending on existing behaviour
- clearly documents existing behaviour
- adds 2 options to give finer-grained control over the whitespace handling in
read_fwf
A newer PR can be found at: #51569
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. Additionally if a fix is required it should probably be discussed in an issue before moving forward to closing in favor of a future issue |
…ce' (default=True) and 'whitespace_chars' (default=[space] and [tab] chars). Deprecation warning for 'delimiter'. See pandas-dev#49832 (comment) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
* 'keep_whitespace' (default=True) * 'whitespace_chars' (default=[space] and [tab] chars) See: pandas-dev#49832 (comment) https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space * changes in pandas/io/parsers/readers.py: _fwf_defaults() read_fwf() * pandas/io/parsers/python_parsers.py FixedWidthReader __init__ __next__ FixedWidthFieldParser __init__ _make_reader Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>
Enhanced documentation on
read_fwf
: clarifies whitespace is stripped by default and how to override via settingdelimiter
.pandas/io/parsers/readers.py
had no mention of delimiter option.