Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does agate use lineterminator LF instead of Python's default CRLF? #669

Closed
jpmckinney opened this issue Feb 11, 2017 · 6 comments
Closed

Comments

@jpmckinney
Copy link
Member

wireservice/csvkit#792

@onyxfish
Copy link
Collaborator

I don't recall the exact reasoning for this, but as with all the output formatting it was done to maximize compatibility with something or other. I'm not particularly persuaded by that RFC because it only "documents the format that seems to be followed by most implementations." Also, this is the first complaint I've heard about it.

That being said, I'm always open to being wrong if somebody can show me a good example of this breaking things. (And demonstrate that CRLF won't break things elsewhere!)

@jpmckinney
Copy link
Member Author

^ @CharlesNepote

@CharlesNepote
Copy link

Firstly, I think one tool shouldn't modify something without telling the user. I can understand that if you have [LF] in input you should have [LF] in output. But if I have [CRLF] in input, why should I have [LF] output? At least, you should tell the user of this transformation, even if you don't explain it. Some old softwares, specially in the Windows world, are not able to read correctly documents with [LF]: NotePad, for example, is often used to open CSV files by beginners.

Secondly, I know that the RFC "documents the format that seems to be followed by most implementations" but why modify the end-of-line when people would like to follow the only "specification" that exist. Some people wants to follow the RFC, even if it's not really a clear specification. I'm currently writing a document for new open data publishers. CSV Lint, which I recommand to them, is yelling when we have [lf] end of lines:

Structural problem: Non-standard Line Breaks on row 1

Your CSV appears to use LF line-breaks. While this will be fine in most cases, RFC 4180 specifies that CSV files should use CR-LF (a carriage-return and line-feed pair, e.g. \r\n). This may be labelled as "Windows line endings" on some systems.

This make difficult to promote csvclean because I have to provide them a horrible command line as a workaround:
csvclean file.csv; csvformat -M $'\r\n' file_out.csv > tmp.csv && mv tmp.csv file_out.csv

Third point:

(And demonstrate that CRLF won't break things elsewhere!)

I'm not able to demonstrate that CRLF won't break things elsewhere. Don't you have unit tests and real life CSV files to give a try? I'm not a programmer but if you want to, I can build real life CSV test files with different sort of end-of-lines, encodings and so on.

@onyxfish
Copy link
Collaborator

@CharlesNepote Hey Charles.

Firstly, I think one tool shouldn't modify something without telling the user.

You are welcome to your beliefs, but this runs directly contrary to the stated intention of csvkit. The relevant documentation is in the Principles of Development. In short, we always prefer a uniform output format. This is the only way we can effectively support piping between the tools. (And it's also just good practice to encourage people to use common output formats.)

Some people wants to follow the RFC, even if it's not really a clear specification.

I get that. And I even encourage it. However, that spec existed when I created csvkit and I saw no evidence anyone took it seriously. Perhaps its better supported today than it was. In either case, the vast majority of software will parse either LF or CRLF correctly.

Some people wants to follow the RFC, even if it's not really a clear specification.

I agree this is annoying and I'll consider changing the default in the next major release, however, this is the first complaint I've heard so I'm skeptical it's a common use-case. CSVLint could also add a flag to ignore the line endings...

I'm not able to demonstrate that CRLF won't break things elsewhere. Don't you have unit tests and real life CSV files to give a try?

Yeah, tons of them, but what you're describing isn't a unit testing issue, it's an integration issue. csvkit is working as intended.

If others care to weigh in on this issue, I'm all ears.

@CharlesNepote
Copy link

Thanks for your answers.

Firstly, I think one tool shouldn't modify something without telling the user.

You are welcome to your beliefs, but this runs directly contrary to the stated intention of csvkit. The relevant documentation is in the Principles of Development. In short, we always prefer a uniform output format. This is the only way we can effectively support piping between the tools. (And it's also just good practice to encourage people to use common output formats.)

Ok, I understand quite well and this point of view has also clear advantages. At least there might be a small documentation and UI issue:

If you agree, I'll open a generic bug report in csvkit: "Better explain common transformations in each csvkit tool". And I'll try to make more precise suggestions.

That said, I'm still a bit disappointed that two great and so useful tools such as csvkit and CSVLint doesn't share the same ideas... That's life ;)

Maybe there are more tools that tend to strictly follow the RFC but I can't say. I'll try to investigate. Maybe you're right and this issue only affects a few tools such as NotePad.

CSVLint could also add a flag to ignore the line endings...

I don't think it would be good: CSVLint UI is probably better with as less options as possible. At least they could write a better error message not to frighten users.

@jpmckinney
Copy link
Member Author

@CharlesNepote Yes, please open a documentation issue on csvkit. If you can also open a pull request, we will be able to make the changes more quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants