Why does agate use lineterminator LF instead of Python's default CRLF? #669

jpmckinney · 2017-02-11T01:16:13Z

onyxfish · 2017-02-27T14:47:56Z

I don't recall the exact reasoning for this, but as with all the output formatting it was done to maximize compatibility with something or other. I'm not particularly persuaded by that RFC because it only "documents the format that seems to be followed by most implementations." Also, this is the first complaint I've heard about it.

That being said, I'm always open to being wrong if somebody can show me a good example of this breaking things. (And demonstrate that CRLF won't break things elsewhere!)

jpmckinney · 2017-02-27T15:19:30Z

^ @CharlesNepote

CharlesNepote · 2017-02-28T09:28:10Z

Firstly, I think one tool shouldn't modify something without telling the user. I can understand that if you have [LF] in input you should have [LF] in output. But if I have [CRLF] in input, why should I have [LF] output? At least, you should tell the user of this transformation, even if you don't explain it. Some old softwares, specially in the Windows world, are not able to read correctly documents with [LF]: NotePad, for example, is often used to open CSV files by beginners.

Secondly, I know that the RFC "documents the format that seems to be followed by most implementations" but why modify the end-of-line when people would like to follow the only "specification" that exist. Some people wants to follow the RFC, even if it's not really a clear specification. I'm currently writing a document for new open data publishers. CSV Lint, which I recommand to them, is yelling when we have [lf] end of lines:

Structural problem: Non-standard Line Breaks on row 1

Your CSV appears to use LF line-breaks. While this will be fine in most cases, RFC 4180 specifies that CSV files should use CR-LF (a carriage-return and line-feed pair, e.g. \r\n). This may be labelled as "Windows line endings" on some systems.

This make difficult to promote csvclean because I have to provide them a horrible command line as a workaround:
csvclean file.csv; csvformat -M $'\r\n' file_out.csv > tmp.csv && mv tmp.csv file_out.csv

Third point:

(And demonstrate that CRLF won't break things elsewhere!)

I'm not able to demonstrate that CRLF won't break things elsewhere. Don't you have unit tests and real life CSV files to give a try? I'm not a programmer but if you want to, I can build real life CSV test files with different sort of end-of-lines, encodings and so on.

onyxfish · 2017-02-28T14:29:25Z

@CharlesNepote Hey Charles.

Firstly, I think one tool shouldn't modify something without telling the user.

You are welcome to your beliefs, but this runs directly contrary to the stated intention of csvkit. The relevant documentation is in the Principles of Development. In short, we always prefer a uniform output format. This is the only way we can effectively support piping between the tools. (And it's also just good practice to encourage people to use common output formats.)

Some people wants to follow the RFC, even if it's not really a clear specification.

I get that. And I even encourage it. However, that spec existed when I created csvkit and I saw no evidence anyone took it seriously. Perhaps its better supported today than it was. In either case, the vast majority of software will parse either LF or CRLF correctly.

Some people wants to follow the RFC, even if it's not really a clear specification.

I agree this is annoying and I'll consider changing the default in the next major release, however, this is the first complaint I've heard so I'm skeptical it's a common use-case. CSVLint could also add a flag to ignore the line endings...

I'm not able to demonstrate that CRLF won't break things elsewhere. Don't you have unit tests and real life CSV files to give a try?

Yeah, tons of them, but what you're describing isn't a unit testing issue, it's an integration issue. csvkit is working as intended.

If others care to weigh in on this issue, I'm all ears.

CharlesNepote · 2017-02-28T17:58:31Z

Thanks for your answers.

Firstly, I think one tool shouldn't modify something without telling the user.

You are welcome to your beliefs, but this runs directly contrary to the stated intention of csvkit. The relevant documentation is in the Principles of Development. In short, we always prefer a uniform output format. This is the only way we can effectively support piping between the tools. (And it's also just good practice to encourage people to use common output formats.)

Ok, I understand quite well and this point of view has also clear advantages. At least there might be a small documentation and UI issue:

"Quoting is done with double-quotes and only when required. Fields are delimited with commas. Rows are terminated with Unix line endings (“\n”)." I red that weeks ago but forget it. Telling that only once at the end of the documentation is not sufficient. We should tell it again, at least in the csvclean section. At the end of each command documentation there is "See also: Arguments common to all tools". If it's the case, we could add a "See also: Transformations common to all tools".
In the csvclean report I would tell all the changes made: charset encoding, delimiter, end-of-line, etc.
Personally I would also explain all the transformations directly in the command line response.

If you agree, I'll open a generic bug report in csvkit: "Better explain common transformations in each csvkit tool". And I'll try to make more precise suggestions.

That said, I'm still a bit disappointed that two great and so useful tools such as csvkit and CSVLint doesn't share the same ideas... That's life ;)

Maybe there are more tools that tend to strictly follow the RFC but I can't say. I'll try to investigate. Maybe you're right and this issue only affects a few tools such as NotePad.

CSVLint could also add a flag to ignore the line endings...

I don't think it would be good: CSVLint UI is probably better with as less options as possible. At least they could write a better error message not to frighten users.

jpmckinney · 2017-02-28T18:24:50Z

@CharlesNepote Yes, please open a documentation issue on csvkit. If you can also open a pull request, we will be able to make the changes more quickly.

…ter, closes #669

jpmckinney mentioned this issue Feb 11, 2017

Do not convert [cr][lf] end of line to [lf] wireservice/csvkit#792

Closed

onyxfish added documentation priority-low labels Feb 27, 2017

jpmckinney removed the priority-low label Oct 4, 2023

jpmckinney closed this as completed in d3bf621 Oct 18, 2023

jpmckinney added a commit that referenced this issue Oct 18, 2023

docs: Document that the lineterminator defaults to the newline charac…

11ea63e

…ter, closes #669

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does agate use lineterminator LF instead of Python's default CRLF? #669

Why does agate use lineterminator LF instead of Python's default CRLF? #669

jpmckinney commented Feb 11, 2017

onyxfish commented Feb 27, 2017

jpmckinney commented Feb 27, 2017

CharlesNepote commented Feb 28, 2017

onyxfish commented Feb 28, 2017

CharlesNepote commented Feb 28, 2017

jpmckinney commented Feb 28, 2017

Why does agate use lineterminator LF instead of Python's default CRLF? #669

Why does agate use lineterminator LF instead of Python's default CRLF? #669

Comments

jpmckinney commented Feb 11, 2017

onyxfish commented Feb 27, 2017

jpmckinney commented Feb 27, 2017

CharlesNepote commented Feb 28, 2017

onyxfish commented Feb 28, 2017

CharlesNepote commented Feb 28, 2017

jpmckinney commented Feb 28, 2017