-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split
verb doesn't support TSV records with newline/tab in field values
#1357
Comments
I believe I've found the solution to this. And that is: Don't mix versions of mlr. The data originally had been in CSV format, but I needed TSV. So I converted it all with the command… mlr --c2t cat all_comments.csv > all_comments.tsv That gave me a TSV with values as shown above. I used mlr 6.0.0. I realized a little while ago that maybe the newer mlr 6.8.0 would do it differently. So I ran the conversion again and found that was the case. The line in question now looks like this…
And when I use the data with lines of that format, all the verbs seem to work well. I guess this issue could be closed, but left around as a lesson to others. |
Yeah, I should've realized something was up when mlr 6.0.0 produced a TSV containing newline and tab characters. Shortly before that, I had read those weren't allowed and should be replaced with |
I'm working with TSV data that contains newline and tab characters in some fields of some records. For example, here's the header and one line of the data that illustrates the problem…
Notice that the
Criterion
field's value contains newlines and tab characters. The newlines are fairly easy to see and the two tab characters are immediately following each bullet point character, "•
". The field's values are enclosed in double quotes, as shown here.This is line 4 (as it will be referenced later) of a file that contains 800k–900k records. I wanted to use
split
to break the file into many files based on the values of theCourse
andCourseID
fields. Like this…However, when I run that command, I get the error…
It appears that Miller's
split
sees the newline in thatCriterion
field, thinks it's the end of the record, and reports that it didn't receive enough columns for that record.As an experiment, if I remove the newline characters from that record, I get a different error:
Since there are two tab characters, I can see where the mistake comes from. If I further experiment by removing the tab characters, the
split
command works fine with that line, but it complains about the same problem in another record.I'm very surprised by this because my main reason for using Miller for processing the TSV file is that its other verbs handle those newlines without problem.
For example, I've used theUpdate: I was mistaken. The other verbs have a problem with the newline and tabs, too. When I started working with this data, I was using mlr 6.0.0. However, I had to upgrade to a newer version of mlr (6.1.0 or newer) to get thefilter
andcount
verbs with this data and they've worked correctly, AFAICT.split
verb. Since I've upgraded, my script using mlr no longer works with this data.Is there something wrong with
split
? Is there some other verb I should use to improve the handling of the data?The text was updated successfully, but these errors were encountered: