Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added 'column numbers everywhere' to csvcut #1209

Closed
wants to merge 7 commits into from

Conversation

athalhammer
Copy link

@athalhammer athalhammer commented Sep 26, 2023

I've run frequently into the problem that I'm facing wide CSV files where there is some interesting string/text somewhere in a long line of empty fields or numbers.

Example:

In which columns is ipsum (first row, after header) actually showing up?

field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4

We want to approach this without manually counting the commas.

The new addition provides this and in combination with grep the result will be highlighted:

$ csvcut -N test.csv | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,"45~evil, ipsum, with commas,",46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4

$ csvcut -c 45,66,74 test.csv 
field45,field66,field74
"evil, ipsum, with commas,",ipsum,ipsum

This is not a breaking change as it keeps all columns where they are. It alters all the fields but it's sole purpose is for exploratory reasons.

Also, please let me know if you know a better or any solution how the above can be achieved without this modification.

@athalhammer athalhammer force-pushed the master branch 4 times, most recently from 04189d6 to a8eaadb Compare September 26, 2023 21:06
@athalhammer athalhammer changed the title added 'line numbers everywhere' to csvcut added 'column numbers everywhere' to csvcut Sep 26, 2023
@athalhammer
Copy link
Author

athalhammer commented Oct 1, 2023

@jpmckinney I put it for csvcut as it was the most straightforward code change. Maybe agate would also be a good place to put it but it probably triggers more code changes for a feature that is not frequently sought after (see https://stackoverflow.com/questions/55253173/return-the-column-name-for-a-column-containig-a-text-value)

@athalhammer
Copy link
Author

@jpmckinney fixed the flake8 issue

@athalhammer
Copy link
Author

@jpmckinney for some reason SQLAlchemy 2.x.x is installed in the runner - that triggers the job fails. In the previous jobs it was SQLAlchemy 1.4.x

https://github.com/wireservice/csvkit/actions/runs/6371290078/job/17361039598#step:4:186

@athalhammer
Copy link
Author

culprit here: wireservice/agate-sql#40

@athalhammer
Copy link
Author

@jpmckinney I pinned agate-sql to <0.6.0 to make the tests run for this PR. I believe we should address the migration to agate-sql 0.6.0 and SQLAlchemy>=2.x.x in a separate PR.

@coveralls
Copy link

Coverage Status

coverage: 88.239% (-0.2%) from 88.443% when pulling 61eed27 on athalhammer:master into fa9e0db on wireservice:master.

@coveralls
Copy link

Coverage Status

coverage: 88.362% (-0.08%) from 88.443% when pulling 61eed27 on athalhammer:master into fa9e0db on wireservice:master.

@athalhammer
Copy link
Author

Sorry @jpmckinney , should have added a test on this feature earlier. Coverage should go up again with the last commit

@jpmckinney
Copy link
Member

Hmm, what about putting this in csvlook?

@athalhammer
Copy link
Author

athalhammer commented Oct 4, 2023

I think having the option to combine it with large tables is intriguing, e.g. csvcut -c 700-800 -N ... The output can always be piped to csvlook.

One way how I have come to love this feature is by "collecting" relevant columns in the following way:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -i ipsum

In that way I can quickly collect all relevant "ipsum" columns in a long and wide table and eventually just leave the -N away and switch -C for -c

(see above)

$ csvcut -c 45,66,74 test.csv

@jpmckinney
Copy link
Member

jpmckinney commented Oct 4, 2023

As implemented, this seems more like a new utility – like csvprefix – than a logical addition to csvcut. (Changing prefix to cut is the only additional work when switching between commands.) However, csvkit isn't accepting new utilities.

For your example, you can achieve the same with existing tools:

$ csvgrep -m ipsum -c 1-75 -a test.csv | tail +2 | csvgrep -n | grep ipsum
 45: evil, ipsum, with commas,
 66: ipsum
 74: ipsum

@athalhammer
Copy link
Author

athalhammer commented Oct 4, 2023

I agree with you, my example was not comprehensive enough (single line only):

Let's extend to the following (still not big enough, but I hope it transmits the point)

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

Then I can see in one "glimpse" (without working through the data line by line with tail +x) which numbers pop up in the first x (head -n x) lines:

$ csvcut -C 45 -N test.csv | grep -i ipsum | head | grep -e "\(,\"[0-9]\+~.*ipsum.*\"\)\|,[0-9]\+~[^,]*ipsum[^,]*," -i
1~,2~,3~,4~,5~lorem,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~3,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~43,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.95,41~,42~34,43~,44~,46~,47~,48~,49~,50~,51~,52~,53~,54~,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~ipsum,67~,68~,69~,70~,71~,72~334,73~,74~ipsum,75~4
1~,2~,3~,4~,5~ipsum,6~,7~,8~,9~,10~,11~,12~,13~,14~,15~4,16~,17~,18~,19~,20~,21~,22~,23~,24~,25~41,26~,27~,28~,29~,30~,31~,32~,33~,34~,35~,36~,37~,38~,39~,40~0.98,41~,42~34,43~,44~,46~,47~lorem,48~,49~,50~lorem,51~,52~,53~,54~ipsum,55~,56~,57~,58~,59~,60~,61~,62~,63~,64~,65~,66~,67~,68~,69~,70~,71~,72~3,73~,74~lorem,75~5

Then I can easily add 66,74,5,54 to my -C and I execute again (on a file with > 1000s of lines) to see whether I caught the "bulk of columns where 'ipsum' may show up". Empty result basically means: you have all the "ipsums" and then I can continue working with some "lorems" etc. to make sure I have identified all the columns where "a user gave negative feedback" (or something like that)

@athalhammer
Copy link
Author

athalhammer commented Oct 4, 2023

Maybe it fits more naturally to csvgrep as a parameter which indicates: "match any character but remove columns that don't have a match in any line"... But then you want be able to specify e.g. the primary key column and others to be always there (even though there is no match). It will also work only with -a, so something like:

csvgrep -a -m ipsum -c 1-75 --remove-non-matching-except 1,2

@jpmckinney
Copy link
Member

jpmckinney commented Oct 4, 2023

csvgrep streams the data, so it can't remove non-matching columns.

To work on multiple lines, you can add xargs. I use \x1e (record separator character) in case the cells contain newlines.

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum
 54: ipsum

@athalhammer
Copy link
Author

athalhammer commented Oct 4, 2023

okay, point taken, so the conclusion would be to add a new tool named csvprefix that has the option to use column names or numbers as prefixes in cells and maybe some user-provided string arguments and a separator parameter (default ~) and maybe a "--include-header" parameter?

csvprefix --column-numbers  --include-header test.csv
...

csvprefix --column-numbers --separator "=" test.csv
...

csvprefix --column-names test.csv

csvprefix --prefix "asdf" test.csv
...

@jpmckinney
Copy link
Member

Is there an issue with using xargs? csvkit's philosophy is to compose commands with pipes and to not invent new commands if existing ones are fine.

csvkit is in maintenance mode, so only existing commands are being maintained. Additional commands would have to be created as independent Python packages.

@athalhammer
Copy link
Author

Hm, I'm not an expert and probably there's a way to fix it but I could not get it run with:

$ cat test.csv 
field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field20,field21,field22,field23,field24,field25,field26,field27,field28,field29,field30,field31,field32,field33,field34,field35,field36,field37,field38,field39,field40,field41,field42,field43,field44,field45,field46,field47,field48,field49,field50,field51,field52,field53,field54,field55,field56,field57,field58,field59,field60,field61,field62,field63,field64,field65,field66,field67,field68,field69,field70,field71,field72,field73,field74,field75
,,,,"lorem",,,,,,,,,,3,,,,,,,,,,43,,,,,,,,,,,,,,,0.95,,34,,,"evil, ipsum, with commas,",,,,,,,,,,,,,,,,,,,,,"ipsum",,,,,,334,,ipsum,4
,,,,"ipsum",,,,,,,,,,4,,,,,,,,,,41,,,,"fdsa

 ipsum",,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5

At the same time, it revealed that it may not be the best idea either as it tries to execute content as code:

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo _ | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
,,,,ipsum,,,,,,,,,,4,,,,,,,,,,41,,,,fdsa
sh: 3: ipsum,,,,,,,,,,,0.98,,34,,,,,lorem,,,lorem,,,,ipsum,,,,,,,,,,,,,,,,,,3,,lorem,5: not found
StopIteration: 

@jpmckinney
Copy link
Member

You just need to quote the replstr in xargs:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum

@athalhammer
Copy link
Author

That again returns invalid results for me

$ csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c 'echo "_" | csvcut -n' | grep ipsum
 46:  ipsum
 69: ipsum
 77: ipsum
  5: ipsum

(the file only has 75 columns)

@jpmckinney
Copy link
Member

Ah, yeah, need to use single quotes:

csvgrep -m ipsum -c 1-75 -a test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -I_ sh -c "echo '_' | csvcut -n" | grep ipsum

@jpmckinney
Copy link
Member

jpmckinney commented Oct 4, 2023

Even more robust:

csvgrep -m ipsum -a -c 1- test.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep ipsum

@athalhammer
Copy link
Author

athalhammer commented Oct 4, 2023

The last one does the trick. Thanks for adding it to the docs - I will need to look this up now every time.

Still believe that the -N option (for the 10 loc it adds) was more elegant, has probably many more use cases (in addition to the examined one), and is not completely off to hang out in csvcut (one could ask why -n is there and not in some csvinfo tool, too)

@jpmckinney
Copy link
Member

This would have been better as an issue, but I'll close this PR for now, and perhaps if there is any interest from other users they will still find the PR discussion.

@jpmckinney jpmckinney closed this Oct 4, 2023
@athalhammer
Copy link
Author

I agree with you, @jpmckinney, that this would have been better as an issue. Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants