Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix parsing CRLF files, part 2 #4

Merged
merged 1 commit into from
Sep 24, 2021
Merged

fix parsing CRLF files, part 2 #4

merged 1 commit into from
Sep 24, 2021

Conversation

mvdan
Copy link
Contributor

@mvdan mvdan commented Sep 13, 2021

(see commit message)

@mvdan
Copy link
Contributor Author

mvdan commented Sep 13, 2021

FYI @warpfork @rvagg. Before this fix, Windows with CRLF on ipld-prime failed:

--- FAIL: TestSpecFixtures (0.00s)
    --- FAIL: TestSpecFixtures/single-node (0.00s)
        spec_test.go:46: failed to parse fixture data: Invalid byte while expecting start of value: 0x70

because we'd try to dagjson decode the wrong bytes, "plied on.\r\n\r\n".

@rvagg
Copy link

rvagg commented Sep 14, 2021

ooof, nasty
this is why in testmark.js I did a global replace of /\r?\n/ with \n in the "original" document even before parsing so that offsets were properly preserved

trying to do it on a per-line basis may also have some interesting interactions with Patch because you're likely going to get a mix of crlf and lf of the output data. I've just documented that "your original document is stripped of crlf entirely, so round-trip through testmark either for test fixtures of patch will always result in a unix format text file"

maybe the safest fix is to do that global replace before line split here too? https://github.com/rvagg/testmark.js/pull/3/files#diff-3a20575d9588a8fc87b3e20b46ac1213b1b38af7497cb9f0753160165ccb02c7

@mvdan
Copy link
Contributor Author

mvdan commented Sep 14, 2021

I'll leave the actual fix up to Eric, I'm not familiar with the architecture of this whole library. Having Patch do a "dos2unix" might not be a good idea, I'm not sure. My only intuition is that doing a search-and-replace on the whole file is somewhat unnecessary if we can just strip the CR at every line we read.

@warpfork
Copy link
Owner

I think this probably looks right, but I'm also nervous enough about it that I'm gonna go backfill some tests onto the master branch and then might ask you to rebase so we get to make sure those don't break.

I don't recall if our last round of hacks for dealing with \r\n involved taking all the \r's out of the hunk body, but if it doesn't, I kinda reckon it should. Every time I've written a test lately it's been assuming "\n" somewhere in the code, too, so if the fixture body was different because somebody checked out files on windows and let git put \r's in, it would... probably not be what a typical end-user wants, in my figuring.

FWIW, I'm pretty sure Patch is already defacto turning everything into \n-only. (I think. I haven't tested it. I don't remember making a special effort one way or the other, but its reckoning is fundamentally based on lines []string, so.) If it doesn't yet, I think it's probably fine if it does.

@warpfork
Copy link
Owner

warpfork commented Sep 20, 2021

(((Opinions incoming:

My holistic approach to this is grounded in "if you're using files with \r\n endings, you're wrong, and we'll try to make your life slightly less bad, but you're still in the wrong, and there's only so much we can do".

That's not just blind Windows hostility. I developed on Windows for longer in my life than I care to admit, and even a decade ago already, I switched everything in my environment on Windows to use LF-only. It's just the least wrong approach.

Once you start letting tools put CR's back into things (as git will do on checkouts!) all sanity has gone out the window and all actions remaining available are those of damage minimization.

When it comes to damage minimization, I'm willing to merge PRs with that as a goal. (Especially for the read path.) It's just even more important to hew to "simplest thing that works" when it comes to damage minimization code than it is with any other code.

End of Opinions.)))

@mvdan
Copy link
Contributor Author

mvdan commented Sep 20, 2021

I think that's the same approach all other languages take. For example, the Go parser accepts CRLF (it treats CR as whitespace), but when printing/formatting via e.g. gofmt, it just uses LF newlines.

@warpfork
Copy link
Owner

There's now some new tests on the tip of master. I had to put a Skip in the ones for CRLF -- it was broken, but, I imagine it's probably going to be fixed by this :) Can you rebase to include that and see if it flies?

I hadn't noticed we also used len(line) to work out offsets.
Since those are offsets into the input file,
they must use the original line length before trimming.
@mvdan
Copy link
Contributor Author

mvdan commented Sep 20, 2021

Done. The parse works now; the test just fails at a later time with its sanity checks. I'll leave that to you as I'm not sure what should happen.

@warpfork
Copy link
Owner

\r chars in the Hunk.Body? Yeah, let's just remove those.

We can choose between having something that's slightly closer to "binary safe" (but breaks when used on Windows with default git checkout config), or, we can have something that isn't "binary safe", but actually works in extremely common scenarios.

Let's do the latter.

... ugh, this means not being able to just return the whole sub-slice.

..... UGH, it means we even have to look for those bytes and waste time doing it, to decide if we can return a subslice or not, wasting time even for non-Windows users who are basically never going to be having this problem. (Minuscule time, in practice, but, just, ow.)

Despite what I said about not being hostile to Windows, a minute ago, my emotional state is.... turning. 😠

@mvdan
Copy link
Contributor Author

mvdan commented Sep 20, 2021

This is a plaintext file format for test files. I don't think stripping certain whitespace characters is extraordinary :)

@mvdan
Copy link
Contributor Author

mvdan commented Sep 23, 2021

Friendly nudge. I don't have strong opinions here, but we should have a fix :) I'm happy to change the implementation here if Eric wants me to. I'm blocked on continuing work on portable CI for ipld-prime until then.

@rvagg
Copy link

rvagg commented Sep 24, 2021

seems fine to me, I'd still prefer just stripping them entirely and globally before entering processing but this seems to have roughly the same effect in the end

@warpfork
Copy link
Owner

Yeah, okay. I'll merge this, and then I'll do the remaining normalization patches to be able to remove the skip from the tests on master shortly. Here we go.

@warpfork warpfork merged commit 9dc11b3 into warpfork:master Sep 24, 2021
@aschmahmann aschmahmann mentioned this pull request Dec 1, 2021
80 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants