Releases: bitdruid/python-wayback-machine-downloader
Releases · bitdruid/python-wayback-machine-downloader
1.5.0
#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.
So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)
- Results from the cdx server are now streamed into a .cdx file instead of system memory
- Added a warning (abortable) if the amount of snapshots is on the larger scale
- Added a progress indication for the download status of the cdx-query
- Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
- Added command
--limit
to specify the maximum amount of snapshots to query - Removed command
--debug
; error-log will be always written - #19 fixed an error with timestamp extraction from urls
Feel free to submit bug reports at any time!
1.4.0
- #17
--delay
command can be now used to set a delay between GET for each worker --log
command can now be used to write a logfile (also works with--verbosity progress
)- changes in the logic of
--verbosity
for a future loglevel setting to manage more/less output - more changes in argument-handling
- minor fixes / cleanups
1.3.0
1.2.0
- fixed errors if snapshots colide path<->file #3 #4
- fixed errors where a picture was stored as index.html #7
- added url-encoding #4
- prevent redirect loops #4
- fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
- added custom exception handler
- added
--debug
to log exceptions into an error-log and print out full traceback instead of shortened - replaced batch-lists with queue for workers #9
- added some cdx-queries from example.com to test
- added
--cdxbackup
and--cdxinject
to either store a cdx query for later use or use a backup - added
--skip
-> an existing csv-file will be used to check for already downloaded snapshots - changed user-agent to give archive.org the possibility to know who is scraping #11
1.0.3
v1.0.2
- fixed paths for win #2 #1:
- stripping ports from domain (:80 :443 ...) to prevent WinError
- stripping mailto-prefixes to prevent WinError
- changed url-parsing to prevent the case where subdir==filename caused WinError
- url-encoded spaces in filenames are now decoded #1
- clarified current-path structure in readme - changes may come in the future #1
- optimized the parsing of cdx-query to keep inside a requested path
- increased performance of collection-creation for very large requests
first release
Changes to beta:
--worker
changed to--workers
.--csv
appends requested url to filename to prevent overwriting- cleanup README
- cleanup HELP