Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG extsort CSV MODE issues #2391

Closed
datatraveller1 opened this issue Dec 29, 2024 · 6 comments · Fixed by #2412
Closed

BUG extsort CSV MODE issues #2391

datatraveller1 opened this issue Dec 29, 2024 · 6 comments · Fixed by #2412
Labels
bug Something isn't working

Comments

@datatraveller1
Copy link

datatraveller1 commented Dec 29, 2024


Describe the bug
I want to sort a CSV file with extsort in CSV MODE but sometimes get either a message
"io error: invalid record index 18446744073709551615 (there are 16 records)" or the file is sorted wrongly.

To Reproduce
Steps to reproduce the behavior:
Input file test_ids.csv:

pnm,tc_id,pc_id
405,139280,9730000630075
405,139281,9730000630075
131,139282862,9730065908379
138,139282863,9730065908379
138,139282864,9730065908379
405,139282865,9730065908379
138,139282866,9730065908379
138,139282867,9730065908379
138,139282868,9730065908379
138,139282869,9730065908379
138,139282870,9730065908379
138,139282871,9730065908379
252,139282,9730000630075
241,139283,9730000630075
272,139284,9730000630075
273,139285,9730000630075

Commands:

qsv index test_ids.csv
qsv extsort --select tc_id test_ids.csv sorted.csv

=> io error: invalid record index 18446744073709551615 (there are 16 records)

Expected behavior
No error.

Desktop (please complete the following information):

  • OS: Windows 11 64bit
  • qsv Version : qsv 1.0.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.653;prompt;to;polars-0.44.2-31b7bb9;self_update-8-8;4.75 GiB-1.74 GiB-1.44 GiB-5.94 GiB (x86_64-pc-windows-msvc compiled with Rust 1.83) prebuilt

Additional context
In other cases with big files the extsort command works, but
qsv dedup --select tc_id --sorted sorted.csv | qsv select tc_id -o out.csv
shows an error:
Aborting! Input not sorted! ByteRecord(["138" ... is greater than ByteRecord([" ...
=> extsort seems to sort wrongly in these cases.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Jan 2, 2025

Happy New Year and thanks for the detailed report @datatraveller1 .

However, I can't seem to reproduce your issue given the commands above:

qsv extsort --select tc_id test_ids.csv sorted.csv
qsv table sorted.csv                              
pnm  tc_id      pc_id
405  139280     9730000630075
405  139281     9730000630075
131  139282862  9730065908379
138  139282863  9730065908379
138  139282864  9730065908379
405  139282865  9730065908379
138  139282866  9730065908379
138  139282867  9730065908379
138  139282868  9730065908379
138  139282869  9730065908379
138  139282870  9730065908379
138  139282871  9730065908379
252  139282     9730000630075
241  139283     9730000630075
272  139284     9730000630075
273  139285     9730000630075

As to the error in your Additional context:

qsv dedup --select tc_id --sorted sorted.csv | qsv select tc_id -o out.csv
Aborting! Input not sorted! ByteRecord(["138", "139282871", "9730065908379"]) is greater than ByteRecord(["252", "139282", "9730000630075"])

That Is because you sorted on the second column tc_id and pnm - the first column is not sorted:

qsv sort check sorted.csv
not sorted

jqnatividad added a commit that referenced this issue Jan 2, 2025
@datatraveller1
Copy link
Author

datatraveller1 commented Jan 2, 2025

Hi @jqnatividad Thank you very much and a happy new year, too!
I think my additional context was a bit misleading, because it doesn't apply to the example (I'm sorry about that).

However, don't you get the io error: invalid record index message?
I get it if I copy the content above (Input file test_ids.csv) into test_ids.csv and call:

C:\test\qsv>test.bat

C:\test\qsv>qsv index test_ids.csv

C:\test\qsv>qsv extsort --select tc_id test_ids.csv sorted.csv
io error: invalid record index 18446744073709551615 (there are 16 records)

C:\test\qsv>

If you don't get the error, maybe it is a MS Windows issue with the qsv index command?

@datatraveller1
Copy link
Author

Hi @jqnatividad, I have now found out:
The qsv index error happens with test_ids.csv if EOL is CRLF, with LF it works correctly.
Can you please adjust the index command so it also works for CRLF (which is default on MS Windows)?

@jqnatividad
Copy link
Collaborator

Hi @datatraveller1 ,
Can you send me a sample file I can test with? I don't have ready access to Windows.

Also, I'd be interested to know what generates the CSV that's causing qsv index to fail... What does qsv validate report?

The csv crate is supposed to handle this transparently - https://docs.rs/csv/latest/csv/enum.Terminator.html

@datatraveller1
Copy link
Author

Hi @jqnatividad,

I have attached the file with "Attach files":
test_ids.csv

If this doesn't work, I think you can also simply use sed to replace the LF with CRLF:
sed 's/$/\r/' filename > newfile

qsv validate test_ids.csv
shows Valid: 3 Columns: ("pnm", "tc_id", "pc_id"); Records: 16; Delimiter: ,

I'm not sure about where what fails.
With qsv index no error is thrown, so maybe qsv extsort could evaluate the index wrongly.

All that said, now I use for what I wanted to achieve (without the need of index and extsort):
qsv dedup --select tc_id test_ids.csv | qsv select tc_id -o out.csv
This command works properly even for big files.

@jqnatividad jqnatividad added the bug Something isn't working label Jan 5, 2025
@jqnatividad
Copy link
Collaborator

Thanks for the sample file @datatraveller1 .

I can now reproduce it and confirm its an underflow bug. The large number should have tipped me off...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants