-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort CDXJ output before printing/exporting #99
Comments
Assuming a list (which it is currently not), the following can be used to sort it prior to printing: import locale
locale.setlocale(locale.LC_ALL, "C")
cdxLines.sort(cmp=locale.strcoll) A test case should be written to current fail with the output generated at this time. |
I would strongly oppose this approach. Adding sorting in the indexer is not a good idea and may not be desired in some cases. Current approach allows us to process one record of the WARC file at a time and immediately spit the corresponding CDX record into a file. If we start storing the CDX output in a list for the entire WARC file then sort the list and write to a file, it will require a lot of memory for bigger collections. Additionally, if the indexes are to be stored in some key-value database, such sorting may be unnecessary. A better approach would be to perform the CDX writing the way it is doing right now, then at the end of the script, read the newly written CDX file and sort it. However, this approach also has the limitation (same as the approach you are proposing) that only the WARC file being processed will be sorted and there will be no built-in way to merge with any existing CDX file. If you really want to avoid the platform specific sorting tools, then you can just write a separate wrapper script that can read from STDIN, sort the lines, and spits the output to STDOUT. Then use pipes as usual. |
@ibnesayeed That would be a useful sub-module, potentially for other packages as well, but might already exist embedded in another like pywb. |
@ibnesayeed Currently, if multiple WARCs are passed to the indexer, the resulting cdxj is representative of the collection of WARCs. We still have duplicates (#92) and they're not ordered correctly (this ticket) but having another script reading in the result, sorting and de-duping would be better than adding yet-another-task to the indexer. Thoughts? Also, should writing to a CDXJ be extracted to this hypothetical script, too? I imagine the indexer script to simple generate the representative data but no deal with sorting, de-duping, and file i/o. |
That script should be independent of what it is sorting. It can be a CDXJ file or something else. The script should only care about providing a cross-platform way of uniquely sorting lines of a text stream with |
From #96, the CDXJ format is required(?) to be sorted but IPWB does not accomplish this. @ibnesayeed recommended to use the LC_ALL flag but something within Python would be preferable.
The text was updated successfully, but these errors were encountered: