Sort CDXJ output before printing/exporting #99

machawk1 · 2017-02-07T14:43:52Z

From #96, the CDXJ format is required(?) to be sorted but IPWB does not accomplish this. @ibnesayeed recommended to use the LC_ALL flag but something within Python would be preferable.

machawk1 · 2017-02-07T17:30:25Z

Assuming a list (which it is currently not), the following can be used to sort it prior to printing:

import locale
locale.setlocale(locale.LC_ALL, "C")
cdxLines.sort(cmp=locale.strcoll)

A test case should be written to current fail with the output generated at this time.

ibnesayeed · 2017-02-07T21:58:10Z

I would strongly oppose this approach. Adding sorting in the indexer is not a good idea and may not be desired in some cases. Current approach allows us to process one record of the WARC file at a time and immediately spit the corresponding CDX record into a file. If we start storing the CDX output in a list for the entire WARC file then sort the list and write to a file, it will require a lot of memory for bigger collections. Additionally, if the indexes are to be stored in some key-value database, such sorting may be unnecessary.

A better approach would be to perform the CDX writing the way it is doing right now, then at the end of the script, read the newly written CDX file and sort it. However, this approach also has the limitation (same as the approach you are proposing) that only the WARC file being processed will be sorted and there will be no built-in way to merge with any existing CDX file.

If you really want to avoid the platform specific sorting tools, then you can just write a separate wrapper script that can read from STDIN, sort the lines, and spits the output to STDOUT. Then use pipes as usual.

machawk1 · 2017-02-07T22:00:28Z

@ibnesayeed That would be a useful sub-module, potentially for other packages as well, but might already exist embedded in another like pywb.

machawk1 · 2017-02-09T02:35:16Z

@ibnesayeed Currently, if multiple WARCs are passed to the indexer, the resulting cdxj is representative of the collection of WARCs. We still have duplicates (#92) and they're not ordered correctly (this ticket) but having another script reading in the result, sorting and de-duping would be better than adding yet-another-task to the indexer. Thoughts?

Also, should writing to a CDXJ be extracted to this hypothetical script, too? I imagine the indexer script to simple generate the representative data but no deal with sorting, de-duping, and file i/o.

ibnesayeed · 2017-02-09T13:23:36Z

That script should be independent of what it is sorting. It can be a CDXJ file or something else. The script should only care about providing a cross-platform way of uniquely sorting lines of a text stream with LC_ALL=C locale. The script should read from STDIN and output to STDOUT. The user can decide how to redirect the output in a file or to the next step for further processing.

machawk1 added the enhancement label Feb 7, 2017

machawk1 added this to the 1.X (Refined, less buggy implementation) milestone Feb 7, 2017

machawk1 mentioned this issue Feb 7, 2017

Allow CDXJ piping from the indexer to replay #96

Closed

machawk1 mentioned this issue Feb 8, 2017

Secondary resources in CDXJ are not resolved by replay system #102

Closed

machawk1 mentioned this issue Feb 15, 2017

Upgrade the CDXJ search to use binsearch instead of linear #103

Closed

machawk1 closed this as completed in 3e49887 Feb 18, 2017

machawk1 mentioned this issue Jan 15, 2019

Unsupported locale on Android #602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort CDXJ output before printing/exporting #99

Sort CDXJ output before printing/exporting #99

machawk1 commented Feb 7, 2017 •

edited

Loading

machawk1 commented Feb 7, 2017 •

edited

Loading

ibnesayeed commented Feb 7, 2017

machawk1 commented Feb 7, 2017

machawk1 commented Feb 9, 2017

ibnesayeed commented Feb 9, 2017

Sort CDXJ output before printing/exporting #99

Sort CDXJ output before printing/exporting #99

Comments

machawk1 commented Feb 7, 2017 • edited Loading

machawk1 commented Feb 7, 2017 • edited Loading

ibnesayeed commented Feb 7, 2017

machawk1 commented Feb 7, 2017

machawk1 commented Feb 9, 2017

ibnesayeed commented Feb 9, 2017

machawk1 commented Feb 7, 2017 •

edited

Loading

machawk1 commented Feb 7, 2017 •

edited

Loading