Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TSV finally true TSV #923

Merged
merged 2 commits into from
Feb 6, 2022
Merged

Make TSV finally true TSV #923

merged 2 commits into from
Feb 6, 2022

Conversation

johnkerl
Copy link
Owner

@johnkerl johnkerl commented Feb 6, 2022

Re-using CSV code, with comma replaced by tab, has never been a good idea due to issues with handling double quotes.

Here we finally make TSV its own format. See also https://miller.readthedocs.io/en/latest/file-formats/#csvtsvasvusvetc.

This is for #922 and #438.

@johnkerl johnkerl merged commit 66c4a07 into main Feb 6, 2022
@johnkerl johnkerl mentioned this pull request Feb 6, 2022
@aborruso
Copy link
Contributor

aborruso commented Feb 6, 2022

Very proud of it (this after the JSONlines).

I will test it, in the next hours.

Thank you very much

@johnkerl johnkerl deleted the tsv-redux branch February 6, 2022 15:53
@masgo
Copy link

masgo commented Mar 11, 2022

I prefer to use tsvlite because of it simplicity. My definition of tsvlite is: fields can contain everything apart from TAB or newline chars. TAB is the field separator. Everything in between is part of the field (i.e., not quoting, no escape characters, nothing special). This makes the file format easy to output and input from/to different systems. The limitation that fields can not contain tabs and newlines was never a big issue.

I have some issues with the new TSV implementation:
First of all, the --tsvlite flag is gone, so I need to change my scripts. Would have preferred if the flag was kept for backward compatibility.

The bigger issue is the new handling of \t \n \ . Is there a way to disable it? For example I might have files which contain Windows file paths with backslashes. This causes some quite unexpected behaviour.

$ cat files.tsv
id      path
1       c:\test\new\file1
2       c:\test\new\file2
3       c:\test\new\file3

$ ./mlr-6.1.0 --tsv put '$test="test"' files.tsv
id      path    test
1       c:\test\new\\file1      test
2       c:\test\new\\file2      test
3       c:\test\new\\file3      test

$ ./mlr-6.1.0 --tsv put '$test=gsub($path,"\\","/")' files.tsv
id      path    test
1       c:\test\new\\file1      c:\test\new/file1
2       c:\test\new\\file2      c:\test\new/file2
3       c:\test\new\\file3      c:\test\new/file3

$ ./mlr-6.0.0 --tsv put '$test=gsub($path,"\\","/")' files.tsv
id      path    test
1       c:\test\new\file1       c:/test/new/file1
2       c:\test\new\file2       c:/test/new/file2
3       c:\test\new\file3       c:/test/new/file3

@johnkerl
Copy link
Owner Author

@masgo ok -- i will restore tsvlite -- my apologies!! and, thanks regarding windows paths -- i had not taken that into account :(

@johnkerl johnkerl mentioned this pull request Mar 11, 2022
@masgo
Copy link

masgo commented Mar 12, 2022

Thank you. When I read the changes, I also did not think of it. Just did a search&replace for tsvlite -> tsv and run my script. Everything looked normal. Turns out there a handful \t in a file with ~250.000 lines. I had actually forgot that the initial input was windows paths and I was doing gsub($path,"\\","/")' since I run the scripts on Linux.

Overall I think it's a great idea to have the \x replacement, as long as it is optional. Maybe a separate function like escape(string) and unescape(string) would be nice.

@johnkerl
Copy link
Owner Author

johnkerl commented Mar 14, 2022

@masgo for the moment (as a temporary workaround) --csvlite --fs tab still exists:

mlr --csvlite --fs tab put '$test="test"' files.tsv
id  path  test
1 c:\test\new\file1 test
2 c:\test\new\file2 test
3 c:\test\new\file3 test

mlr --csvlite --fs tab put '$test=gsub($path,"\\","/")' files.tsv
id  path  test
1 c:\test\new\file1 c:/test/new/file1
2 c:\test\new\file2 c:/test/new/file2
3 c:\test\new\file3 c:/test/new/file3

mlr --csvlite --fs tab put '$test=gsub($path,"\\","/")' files.tsv
id  path  test
1 c:\test\new\file1 c:/test/new/file1
2 c:\test\new\file2 c:/test/new/file2
3 c:\test\new\file3 c:/test/new/file3

@BEFH
Copy link
Contributor

BEFH commented Oct 6, 2022

Since this PR, mlr is silently truncating some files. I do not know which files, and cannot share the one file I know is causing the issue because of data security issues. Here's the output:

original file

$ wc -l temp.tsv
308170 temp.tsv

tsv filetype

$ mlr --tsv cat temp.tsv | wc -l
20408

csv filetype with tab separator

$ mlr --csv --fs "\t" cat temp.tsv | wc -l
308170

version 6.0.0

$ mlr --tsv cat temp.tsv | wc -l
308170

Even if my file is somehow invalid, this probably should throw an error instead of silently truncating the file. Is there something I should be looking for in the file so I can make a reprex for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants