Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read fwf try2 #51018

Closed
wants to merge 0 commits into from
Closed

Conversation

RonaldBarnes
Copy link

@RonaldBarnes RonaldBarnes commented Jan 27, 2023

Addresses "anti-patterns" discussed in #49832 where fixed width files have whitespace removed from fields by default, and require delimiter option to override this behaviour.

Specifically, @phofl commented here: #(#49832 (comment)) about preferring a fix to documenting current behaviour.

Fixed width files shouldn't have delimiters, rather colspecs or widths. Seems to be some conflation between, say, read_table and read_fwf (and even read_csv).

This PR shouldn't interfere with colspecs="infer", but IMHO that usage should be discouraged with fixed width files; sounds like a job for read_table or read_csv.

Feedback welcome, thanks.

@RonaldBarnes
Copy link
Author

Looking at the errors thrown by pytest pandast (I now know how to run that pre-commit), I could use some guidance from the maintainers before continuing.

Many test results depend on stripping all whitespace from tabular data without explicitly specifying such behaviour.

  • Should I change the test parameters to get the same results?
  • Change the results to match the proposed new behaviour?
  • Change the data to non-tabular data?
  • Move the tests into read_table (there is no existing test file for read_table)?

What are the maintainers' thoughts on read_fwf being so intertwined with the concept of delimiters?

In my background and recent use case, a fixed width file possibly came from a tape, likely does not even have line terminators, may contain packed data,...

In this scenario, leading whitespace within fields is critical, delimiters are unheard of, and visually / programmatically inferring field / column boundaries wouldn't be possible.

Hence, my opinion is that read_table should be encouraged for many scenarios the documentation are actually using read_fwf for.

Thanks!

@RonaldBarnes
Copy link
Author

I've given this PR further thought, and it'd be nice to also accept a tuple(bool, bool) for keep_whitespace for indicating a (left,right) stripping of whitespace.

I'll hold off on developing this feature 'til I hear feedback on the PR and the whole delimiters vs fixed width files situation.

Thank you.

@mroeschke mroeschke requested a review from phofl January 31, 2023 21:29
@phofl
Copy link
Member

phofl commented Jan 31, 2023

Not totally sure how to fix. My understanding is that delimiter should be the number of whitespaces that separate the fields to preserve additional whitespaces that are not part of the filed-delimiter?

Why exactly does this not work right now?

@mroeschke mroeschke added the IO Data IO issues that don't fit into a more specific label label Feb 1, 2023
@RonaldBarnes
Copy link
Author

My understanding is that delimiter should be the number of whitespaces that separate the fields to preserve additional whitespaces that are not part of the filed-delimiter?

That's how it works, but if there are delimiters of any kind, then the file is either a table (delimiters are spaces) or a form of CSV. The file may have columns / fields and lines / records of a fixed width, but it's not a Fixed Width File™.

Why exactly does this not work right now?

Because read_fwf changes the data as it reads it in and that missing data cannot be recovered (leading spaces).

In a Fixed Width File™, leading spaces are not simply padding, they're important - like in a Python script file.

In this anonymized and trimmed fixed-width data, there are several fields, but no delimiters:

00000013753x arxb91 ...
00000021308xjrxb104 ...
00000037098xjrxb105 ...
00000048336xjrxb106 ...
00000059290xjrxb108 ...
00000067863xjrxb109 ...

The default behaviour for read_fwf will trim delimiter (default: \r\n\t ) from each field, unless a delimiter is set to a character that (hopefully) does not exist at a field boundary within the data (\r or \n are safe bets).

An "anti-pattern", IMHO.

Also, this isn't mentioned at all in the API reference.

It's mentioned obliquely in the user guide, but is confusing:

delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler character of the fields if it is not spaces (e.g., ‘~’).

I read that, not thinking of potential leading whitespace as "filler" and didn't realize data was being mangled.

This is fairly clear, though further down the page:

The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between the columns in the file.

It appears others have similar expectations:

https://stackoverflow.com/questions/72235501/python-pandas-read-fwf-strips-white-space

https://stackoverflow.com/questions/57012437/pandas-read-fwf-removes-white-space

Not totally sure how to fix.

Agree - it's tricky.

I don't think this will be a popular idea, but fixed width files and tables are distinctly different and there already exists a method for reading tables; users ought to be directed to the proper tool.

Merely documenting #49832 this anti-pattern of delimiter + colspecs doesn't get to the root of the problem.

I suggest people reading data where the file may be of a fixed width, but is basically tabular data should be using read_table.

Perhaps delimiter in the context of read_fwf can be retained for colspecs="infer", but its other functionality should be moved to, say, keep_whitespace.

Thank you for looking at this issue.

@RonaldBarnes
Copy link
Author

I messed up a git merge and closed this PR.

I shall re-do the work, create a new issue & PR from a fresh clone of pandas and start fresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_fwf removes leading and trailing whitespace
3 participants