Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split verb doesn't support TSV records with newline/tab in field values #1357

Closed
lsloan opened this issue Aug 19, 2023 · 3 comments
Closed
Assignees

Comments

@lsloan
Copy link

lsloan commented Aug 19, 2023

I'm working with TSV data that contains newline and tab characters in some fields of some records. For example, here's the header and one line of the data that illustrates the problem…

Course	CourseID	Prompt	PromptID	Author	AuthorID	Reviewer	ReviewerID	Criterion	CriterionID	Comment	CommentID	CommentTimeUTC
STATS 250 SP 18	221366	Write To Learn #1 Initial Submission	515272	author_name	421341	reviewer_name	372296	"The memorandum was to include a proposed data analysis plan, that is, a description of how the data will be summarized, both graphically and numerically, to help address the two study questions. Recall the two study questions were: 

	Which set of instructions are better, on average, in terms of completing the task more quickly? 

	Is there a difference, on average, in the time to complete the task for faculty versus students versus staff?

What aspects of the analysis plan were described well? Do the summaries proposed align with the type of data that is being recorded? Do they generally align well with the corresponding study question? How might the descriptions be improved?"	13	I think that the graphs you included in the data analysis report do a really good job of allowing the director to really analyze the study and be able to visualize the variables in play. I think they align well with the study questions and have good descriptions. overall good job. 	763	2018-05-14 21:32:02.390679

Notice that the Criterion field's value contains newlines and tab characters. The newlines are fairly easy to see and the two tab characters are immediately following each bullet point character, "". The field's values are enclosed in double quotes, as shown here.

This is line 4 (as it will be referenced later) of a file that contains 800k–900k records. I wanted to use split to break the file into many files based on the values of the Course and CourseID fields. Like this…

mlr --tsv --from all_comments.tsv split -g Course,CourseID

However, when I run that command, I get the error…

mlr: mlr: TSV header/data length mismatch 13 != 9 at filename mpr_all_comments.tsv line  4.

It appears that Miller's split sees the newline in that Criterion field, thinks it's the end of the record, and reports that it didn't receive enough columns for that record.

As an experiment, if I remove the newline characters from that record, I get a different error:

mlr: mlr: TSV header/data length mismatch 13 != 15 at filename mpr_all_comments.tsv line  4.

Since there are two tab characters, I can see where the mistake comes from. If I further experiment by removing the tab characters, the split command works fine with that line, but it complains about the same problem in another record.

I'm very surprised by this because my main reason for using Miller for processing the TSV file is that its other verbs handle those newlines without problem. For example, I've used the filter and count verbs with this data and they've worked correctly, AFAICT. Update: I was mistaken. The other verbs have a problem with the newline and tabs, too. When I started working with this data, I was using mlr 6.0.0. However, I had to upgrade to a newer version of mlr (6.1.0 or newer) to get the split verb. Since I've upgraded, my script using mlr no longer works with this data.

Is there something wrong with split? Is there some other verb I should use to improve the handling of the data?

@lsloan
Copy link
Author

lsloan commented Aug 19, 2023

I believe I've found the solution to this. And that is: Don't mix versions of mlr.

The data originally had been in CSV format, but I needed TSV. So I converted it all with the command…

mlr --c2t cat all_comments.csv > all_comments.tsv

That gave me a TSV with values as shown above. I used mlr 6.0.0.

I realized a little while ago that maybe the newer mlr 6.8.0 would do it differently. So I ran the conversion again and found that was the case. The line in question now looks like this…

STATS 250 SP 18	221366	Write To Learn #1 Initial Submission	515272	author_name	421341	reviewer_name	372296	The memorandum was to include a proposed data analysis plan, that is, a description of how the data will be summarized, both graphically and numerically, to help address the two study questions. Recall the two study questions were: \n\n•\tWhich set of instructions are better, on average, in terms of completing the task more quickly? \n\n•\tIs there a difference, on average, in the time to complete the task for faculty versus students versus staff?\n\nWhat aspects of the analysis plan were described well? Do the summaries proposed align with the type of data that is being recorded? Do they generally align well with the corresponding study question? How might the descriptions be improved?	13	I think that the graphs you included in the data analysis report do a really good job of allowing the director to really analyze the study and be able to visualize the variables in play. I think they align well with the study questions and have good descriptions. overall good job. 	763	2018-05-14 21:32:02.390679

And when I use the data with lines of that format, all the verbs seem to work well.

I guess this issue could be closed, but left around as a lesson to others.

@johnkerl johnkerl self-assigned this Aug 19, 2023
@johnkerl
Copy link
Owner

johnkerl commented Aug 19, 2023

@lsloan this makes sense.

@aborruso found that 6.0.0 wasn't doing true TSV and that was fixed in 6.1.0:

Thank you for the analysis!! :)

@lsloan
Copy link
Author

lsloan commented Aug 21, 2023

Yeah, I should've realized something was up when mlr 6.0.0 produced a TSV containing newline and tab characters. Shortly before that, I had read those weren't allowed and should be replaced with \n and \t. However, I thought maybe that wasn't a very strict rule and mlr was simply using a less-preferred format of TSV. I'm very glad you and @aborruso corrected it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants