Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort CDXJ output before printing/exporting #99

Closed
machawk1 opened this issue Feb 7, 2017 · 5 comments
Closed

Sort CDXJ output before printing/exporting #99

machawk1 opened this issue Feb 7, 2017 · 5 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Feb 7, 2017

From #96, the CDXJ format is required(?) to be sorted but IPWB does not accomplish this. @ibnesayeed recommended to use the LC_ALL flag but something within Python would be preferable.

@machawk1
Copy link
Member Author

machawk1 commented Feb 7, 2017

Assuming a list (which it is currently not), the following can be used to sort it prior to printing:

import locale
locale.setlocale(locale.LC_ALL, "C")
cdxLines.sort(cmp=locale.strcoll)

A test case should be written to current fail with the output generated at this time.

@ibnesayeed
Copy link
Member

I would strongly oppose this approach. Adding sorting in the indexer is not a good idea and may not be desired in some cases. Current approach allows us to process one record of the WARC file at a time and immediately spit the corresponding CDX record into a file. If we start storing the CDX output in a list for the entire WARC file then sort the list and write to a file, it will require a lot of memory for bigger collections. Additionally, if the indexes are to be stored in some key-value database, such sorting may be unnecessary.

A better approach would be to perform the CDX writing the way it is doing right now, then at the end of the script, read the newly written CDX file and sort it. However, this approach also has the limitation (same as the approach you are proposing) that only the WARC file being processed will be sorted and there will be no built-in way to merge with any existing CDX file.

If you really want to avoid the platform specific sorting tools, then you can just write a separate wrapper script that can read from STDIN, sort the lines, and spits the output to STDOUT. Then use pipes as usual.

@machawk1
Copy link
Member Author

machawk1 commented Feb 7, 2017

@ibnesayeed That would be a useful sub-module, potentially for other packages as well, but might already exist embedded in another like pywb.

@machawk1
Copy link
Member Author

machawk1 commented Feb 9, 2017

@ibnesayeed Currently, if multiple WARCs are passed to the indexer, the resulting cdxj is representative of the collection of WARCs. We still have duplicates (#92) and they're not ordered correctly (this ticket) but having another script reading in the result, sorting and de-duping would be better than adding yet-another-task to the indexer. Thoughts?

Also, should writing to a CDXJ be extracted to this hypothetical script, too? I imagine the indexer script to simple generate the representative data but no deal with sorting, de-duping, and file i/o.

@ibnesayeed
Copy link
Member

That script should be independent of what it is sorting. It can be a CDXJ file or something else. The script should only care about providing a cross-platform way of uniquely sorting lines of a text stream with LC_ALL=C locale. The script should read from STDIN and output to STDOUT. The user can decide how to redirect the output in a file or to the next step for further processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants