Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean-whitespace verb has no effect when followed by put verb #1461

Closed
OpenFoam-User opened this issue Dec 28, 2023 · 13 comments · Fixed by #1464
Closed

clean-whitespace verb has no effect when followed by put verb #1461

OpenFoam-User opened this issue Dec 28, 2023 · 13 comments · Fixed by #1464
Assignees

Comments

@OpenFoam-User
Copy link

OpenFoam-User commented Dec 28, 2023

Hello,

Here is a csv file that contains one space after each comma:

$ cat test.csv
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8

If I use the following command:

mlr --csv clean-whitespace then put '$f=5; $f+=$b' test.csv

I get error in the new column f as shown below:

a,b,c,d,f
1,2,3,4,(error)
5,6,7,8,(error)

However, if I manually remove the space after each comma manually from the file it works. Or if I use two miller commands as follows:

$ mlr --csv clean-whitespace test.csv | mlr --csv put '$f=5; $f+=$b;'
a,b,c,d,f
1,2,3,4,7
5,6,7,8,11

Is this a bug?

@aborruso
Copy link
Contributor

In your file, all cells are strings, because you have those spaces.

Then you should force to set all to strings:

mlr --csv -S clean-whitespace c.csv

to get

a,b,c,d
1,2,3,4
5,6,7,8

@aborruso
Copy link
Contributor

To run your put verb

mlr --csv -S clean-whitespace then put '$f=5;$f+=int($b)' c.csv

@OpenFoam-User
Copy link
Author

Hi @aborruso
Thank you for your reply. Sorry I don't understand your reply. I am not looking for strings. Please see my updated command that works and the expected result.

Thank you

@aborruso
Copy link
Contributor

aborruso commented Dec 28, 2023

Thank you for your reply. Sorry I don't understand your reply. I am not looking for strings. Please see my updated command that works and the expected result.

If you run the command in one shot, without exit from it, for miller those fields that contain spaces are strings.

If you run

mlrgo --csv clean-whitespace then put '$type_b=typeof($b)' c.csv

you get

a,b,c,d,type_b
1,2,3,4,string
5,6,7,8,string

If exit from it and pass the cleaned output via stdout again to mlr, those strings have become numbers and so everything works.

@OpenFoam-User
Copy link
Author

OpenFoam-User commented Dec 28, 2023

Thank you for the detailed answer. Now I understand. But this behaviour is not documented in the documentations. As a beginner I spent 5 hours to narrow down the issue to clean-whitespace and put command.

@aborruso
Copy link
Contributor

I generally, when I work with CSVs, which is a format with so many flaws, the first thing I do is normalize them. So, with a file like that I usually first apply stand-alone clean-whitespace.

I, for example, did not realize that only the "a" field was without spaces and I was making casting errors.

@aborruso
Copy link
Contributor

If now this issue is ok for you, please close it

@OpenFoam-User
Copy link
Author

If now this issue is ok for you, please close it

Yes, it's clear for me, thanks to your explanation. But I suspect this would be clear for new users. Probably, before closing the issue one should clarify this in the documentation or probably provide a way to force clean-whitespace to write its output to stdout before the next then.

@aborruso
Copy link
Contributor

Yes, it's clear for me, thanks to your explanation. But I suspect this would be clear for new users. Probably, before closing the issue one should clarify this in the documentation or probably provide a way to force clean-whitespace to write its output to stdout before the next then.

You could open a PR to propose a change in the documentation.

@johnkerl
Copy link
Owner

johnkerl commented Jan 1, 2024

@OpenFoam-User @aborruso there's a bit more going on here.

$ cat 1461.sh
cat test.csv
echo
mlr --csv clean-whitespace then put '$e = $d + 5; $t = typeof($d)' test.csv
echo
mlr --csv clean-whitespace test.csv | mlr --csv put '$e = $d + 5; $t = typeof($d)'
$ sh 1461.sh
a, b, c, d
1, 2, 3, 4
5, 6, 7, 8

a,b,c,d,e,t
1,2,3,4,(error),string
5,6,7,8,(error),string

a,b,c,d,e,t
1,2,3,4,9,int
5,6,7,8,13,int
  • One issue is that with space after comma, the field " 4" is type-inferred as string -- whereas without space after the comma, "4" is type-inferred as int.
    • One might argue that this is a design flaw (going all the way back to Miller 1). Maybe Miller should try to strip leading or trailing whitespace before type-inferring.
  • The other issue is that the clean-whitespace verb takes strings to strings.
    • This means that with then and no |, the string " 4" becomes the string "4", and adding 5 to a string is an error.
    • Whereas with the | rather than the then, the string "4" is type-inferred as int.
    • One might argue that clean-whitespace, after having done its job, should re-infer types.

I agree that at very least this needs some documentation -- however, I'm not sure that's enough. I'll think on this a bit.

@johnkerl
Copy link
Owner

johnkerl commented Jan 1, 2024

Of the two suggestions above, namely:

One might argue that this is a design flaw (going all the way back to Miller 1). Maybe Miller should try to strip leading or trailing whitespace before type-inferring.

One might argue that clean-whitespace, after having done its job, should re-infer types.

I think the first one is perhaps too disruptive. I think that if I were starting afresh, this might be the best option. However, I suspect (without proof or data) such a change might break existing scripts and/or expectations for people.

The second one seems elegant and is implemented in PR #1464.

@OpenFoam-User
Copy link
Author

The second one seems elegant and is implemented in PR #1464.

Perfect! I have tried the PR on the test file above, and it works as expected.

Many thanks

@aborruso
Copy link
Contributor

aborruso commented Jan 2, 2024

A personal note. I think Miller is really fantastic for these reasons:

  • it works great and it has important and useful features, which all together are hardly found in other products;
  • it is very well documented;
  • @johnkerl is a very good developer and a very generous person

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants