Skip to content

Releases: bitdruid/python-wayback-machine-downloader

1.5.0

24 Aug 21:51
Compare
Choose a tag to compare

#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.

So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)

  • Results from the cdx server are now streamed into a .cdx file instead of system memory
  • Added a warning (abortable) if the amount of snapshots is on the larger scale
  • Added a progress indication for the download status of the cdx-query
  • Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
  • Added command --limit to specify the maximum amount of snapshots to query
  • Removed command --debug; error-log will be always written
  • #19 fixed an error with timestamp extraction from urls

Feel free to submit bug reports at any time!

1.4.0

25 Jul 15:38
Compare
Choose a tag to compare
  • #17 --delay command can be now used to set a delay between GET for each worker
  • --log command can now be used to write a logfile (also works with --verbosity progress)
  • changes in the logic of --verbosity for a future loglevel setting to manage more/less output
  • more changes in argument-handling
  • minor fixes / cleanups

1.3.0

29 Jun 11:41
Compare
Choose a tag to compare
  • fixed requirements for win
  • fixed csv for win
  • added "auto" mode
  • fixed some minor bugs

1.2.0

08 Jun 14:57
Compare
Choose a tag to compare
  • fixed errors if snapshots colide path<->file #3 #4
  • fixed errors where a picture was stored as index.html #7
  • added url-encoding #4
  • prevent redirect loops #4
  • fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
  • added custom exception handler
  • added --debug to log exceptions into an error-log and print out full traceback instead of shortened
  • replaced batch-lists with queue for workers #9
  • added some cdx-queries from example.com to test
  • added --cdxbackup and --cdxinject to either store a cdx query for later use or use a backup
  • added --skip -> an existing csv-file will be used to check for already downloaded snapshots
  • changed user-agent to give archive.org the possibility to know who is scraping #11

1.0.3

03 Jun 18:27
Compare
Choose a tag to compare

fixes #3

v1.0.2

31 May 07:44
Compare
Choose a tag to compare
  • fixed paths for win #2 #1:
    • stripping ports from domain (:80 :443 ...) to prevent WinError
    • stripping mailto-prefixes to prevent WinError
    • changed url-parsing to prevent the case where subdir==filename caused WinError
  • url-encoded spaces in filenames are now decoded #1
  • clarified current-path structure in readme - changes may come in the future #1
  • optimized the parsing of cdx-query to keep inside a requested path
  • increased performance of collection-creation for very large requests

first release

22 Apr 07:10
Compare
Choose a tag to compare

Changes to beta:

  • --worker changed to --workers.
  • --csv appends requested url to filename to prevent overwriting
  • cleanup README
  • cleanup HELP