read_fwf docs #49832

RonaldBarnes · 2022-11-22T06:23:20Z

Enhances Clarify whitespace behavior in read_fwf documentation (#16772) #16950
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

Enhanced documentation on read_fwf: clarifies whitespace is stripped by default and how to override via setting delimiter.

pandas/io/parsers/readers.py had no mention of delimiter option.

…pace, and how to override. Enhances GH-issue-16950 pandas-dev#16950

…ead_fwf_docs

phofl · 2022-11-22T15:20:55Z

doc/source/user_guide/io.rst

-  if it is not spaces (e.g., '~').
+  Default are space and tab characters.
+  Used to specify the character(s) to strip from start and end of every field.
+  To preserve whitespace, set to a character that does not exist in the data,


This sounds like an anti pattern, you should not use the function at all when you want to read the whole data as one column

Hi @phofl,

I think there's a misunderstanding.

This PR is merely documenting the current anti-patterns that exist in read_fwf:

Input file contains 172 fields / columns, precisely defined by colspecs.

File is read into DataFrame

Data is mangled - white space is stripped

To preserve white space, a delimiter field is also required, and its value must be something that will never appear at start or end of any field

Anti-patters observed:

In a fixed-width data file, data should not be changed unless explicitly requested

Fixed-width files do not have delimiters, rather colspecs

read_csv will preserve white space

I think parts of read_fwf were designed to handle tabular, human readable data, not flat database files. For reading tabular data, read_table seems the appropriate tool, IMHO.

TL;DR This PR is attempting to accurately describe the current behaviour as #16772 shows people still confused by it and #16950 didn't address readers.py.

Also, thank you @jbrockmendel for labelling this as Docs! I could not figure out how to do that myself.

I'd rather fix this instead of documenting a workaround then

Happy to hear a fix is preferred. Working on that now.

Expecting controversy by breaking current default behaviour but will clearly document how to achieve current behaviour of stripping white space. Am inclined to also mention read_table as a potential solution for some users.

If anyone is using delimiter="~" as is mentioned as an example in the documentation, planning to continue to support such usage, but thinking to raise FutureWarning if delimiter keyword is used.

Is this reasonable / acceptable pandas policy?

Should I amend doc/source/whatsnew/v1.5.3.rst or doc/source/whatsnew/v2.0.0.rst?

Thank you for your help with all the issues with attempting a successful first PR!

2.0, only regression fixes are backported

So to summarise: If a delimiter is passed and the character is present: What happens in this case? If a delimiter is passed and does not exist, all whitespaces are preserved, correct?

Correct on both counts.

If a delimiter is passed and the character is present, it is stripped from start & end of every field.

If a delimiter is passed (that is not a space char), then whitespaces are preserved.

Assigning default value(s) to delimiter:
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1168

Stripping delimiter(s) from each field (thus also removes \n\r from each line):
https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/python_parser.py#L1267

Is the fact that we strip the character from the end of the fields documented anywhere? If no, we can definitely deprecate. This sounds odd

It is mentioned obliquely:

https://github.com/pandas-dev/pandas/blob/main/doc/source/user_guide/io.rst

The function parameters to read_fwf are largely the same as read_csv with two extra parameters, and a different usage of the delimiter parameter:
...
delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., '~').

This is confusing, as for flat files any use of delimiters is unexpected since colspecs are defined (or, inferred - need to check the use of delimiters here).

In a flat file, there are no "filler character[s]", hence confusion.

Later, among the examples, is this:

The parser will take care of extra white spaces around the columns so it's ok to have extra separation between the columns in the file.

All examples are using tabular (human-readable) data.

Some conflation between read_table and read_fwf, IMHO.

See mentions at #16772, from 2017, and follow up questions still in 2022.

I'd rather fix this instead of documenting a workaround then

I think I've come up with a solution that

causes minimal disruption to users depending on existing behaviour

clearly documents existing behaviour

adds 2 options to give finer-grained control over the whitespace handling in read_fwf

A newer PR can be found at: #51569

github-actions · 2022-12-29T00:05:09Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-01-04T01:46:32Z

Thanks for the pull request, but it appears to have gone stale. Additionally if a fix is required it should probably be discussed in an issue before moving forward to closing in favor of a future issue

…ce' (default=True) and 'whitespace_chars' (default=[space] and [tab] chars). Deprecation warning for 'delimiter'. See pandas-dev#49832 (comment) Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

* 'keep_whitespace' (default=True) * 'whitespace_chars' (default=[space] and [tab] chars) See: pandas-dev#49832 (comment) https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space * changes in pandas/io/parsers/readers.py: _fwf_defaults() read_fwf() * pandas/io/parsers/python_parsers.py FixedWidthReader __init__ __next__ FixedWidthFieldParser __init__ _make_reader Signed-off-by: Ronald Barnes <ron@ronaldbarnes.ca>

RonaldBarnes and others added 6 commits November 21, 2022 21:30

Updated documentation indicating default behaviour is to strip whites…

2bfa90a

…pace, and how to override. Enhances GH-issue-16950 pandas-dev#16950

Merge branch 'pandas-dev:main' into read_fwf_docs

1823753

Fix failed Sphinx lint issue.

f297d99

Merge branch 'read_fwf_docs' of github.com:RonaldBarnes/pandas into r…

49f24d5

…ead_fwf_docs

Added delimiter to _fwf_defaults.

a0304a7

Changed comment from ## to # per flake8.

7adb89d

phofl reviewed Nov 22, 2022

View reviewed changes

jbrockmendel added the Docs label Nov 22, 2022

RonaldBarnes and others added 3 commits November 22, 2022 22:02

Merge branch 'pandas-dev:main' into read_fwf_docs

62b8125

Delimiters used by colspecs='infer'

ab111c7

Merge branch 'pandas-dev:main' into read_fwf_docs

9504520

github-actions bot added the Stale label Dec 29, 2022

mroeschke closed this Jan 4, 2023

RonaldBarnes mentioned this pull request Jan 27, 2023

Read fwf try2 #51018

Closed

5 tasks

mroeschke mentioned this pull request Jul 7, 2023

Add keep_whitespace and whitespace_chars to read_fwf #51577

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf docs #49832

read_fwf docs #49832

RonaldBarnes commented Nov 22, 2022 •

edited

Loading

phofl Nov 22, 2022

RonaldBarnes Nov 23, 2022 •

edited

Loading

phofl Nov 23, 2022

RonaldBarnes Nov 28, 2022

phofl Nov 28, 2022

RonaldBarnes Nov 28, 2022

phofl Nov 28, 2022

RonaldBarnes Nov 28, 2022

RonaldBarnes Mar 16, 2023

github-actions bot commented Dec 29, 2022

mroeschke commented Jan 4, 2023

read_fwf docs #49832

read_fwf docs #49832

Conversation

RonaldBarnes commented Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

RonaldBarnes Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 29, 2022

mroeschke commented Jan 4, 2023

RonaldBarnes commented Nov 22, 2022 •

edited

Loading

RonaldBarnes Nov 23, 2022 •

edited

Loading