Releases · bitdruid/python-wayback-machine-downloader

24 Aug 21:51

bitdruid

1.5.0

78c6535

1.5.0 Latest

Latest

#20 made me aware of an issue with very large queries to the cdx server. The snapshots received can easily cause the system to run out of memory, resulting in a crash.

So beside some other changes there is a work in progress to reduce memory-usage (but in exchange for some I/O)

Results from the cdx server are now streamed into a .cdx file instead of system memory
Added a warning (abortable) if the amount of snapshots is on the larger scale
Added a progress indication for the download status of the cdx-query
Sometimes the json results of the cdx server were not transferred completely and an exception was thrown. This faulty json data should now be handled appropriately (also #20 )
Added command --limit to specify the maximum amount of snapshots to query
Removed command --debug; error-log will be always written
#19 fixed an error with timestamp extraction from urls

Feel free to submit bug reports at any time!

Assets 2

25 Jul 15:38

bitdruid

1.4.0

b19b22a

1.4.0

#17 --delay command can be now used to set a delay between GET for each worker
--log command can now be used to write a logfile (also works with --verbosity progress)
changes in the logic of --verbosity for a future loglevel setting to manage more/less output
more changes in argument-handling
minor fixes / cleanups

Assets 2

29 Jun 11:41

bitdruid

1.3.0

f81afba

1.3.0

fixed requirements for win
fixed csv for win
added "auto" mode
fixed some minor bugs

Assets 2

08 Jun 14:57

bitdruid

1.2.0

f0966d0

1.2.0

fixed errors if snapshots colide path<->file #3 #4
fixed errors where a picture was stored as index.html #7
added url-encoding #4
prevent redirect loops #4
fixed SIGINT KeyboardInterrupt prevents csv-file from generating #8
added custom exception handler
added --debug to log exceptions into an error-log and print out full traceback instead of shortened
replaced batch-lists with queue for workers #9
added some cdx-queries from example.com to test
added --cdxbackup and --cdxinject to either store a cdx query for later use or use a backup
added --skip -> an existing csv-file will be used to check for already downloaded snapshots
changed user-agent to give archive.org the possibility to know who is scraping #11

Assets 2

03 Jun 18:27

bitdruid

1.0.3

9ac5c53

1.0.3

fixes #3

Assets 2

31 May 07:44

bitdruid

1.0.2

e5129b6

v1.0.2

fixed paths for win #2 #1:
- stripping ports from domain (:80 :443 ...) to prevent WinError
- stripping mailto-prefixes to prevent WinError
- changed url-parsing to prevent the case where subdir==filename caused WinError
url-encoded spaces in filenames are now decoded #1
clarified current-path structure in readme - changes may come in the future #1
optimized the parsing of cdx-query to keep inside a requested path
increased performance of collection-creation for very large requests

Assets 2

22 Apr 07:10

bitdruid

1.0.1

94e49e4

first release

Changes to beta:

--worker changed to --workers.
--csv appends requested url to filename to prevent overwriting
cleanup README
cleanup HELP

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: bitdruid/python-wayback-machine-downloader

1.5.0

1.4.0

1.3.0

1.2.0

1.0.3

v1.0.2

first release