-
Notifications
You must be signed in to change notification settings - Fork 7
APIs
Each API has the following usage limits (thresholds), please check if you are exceeding these limits if you start receiving the HTTP response status Error 429 too many requests:
- Arquivo.pt API (Full-text & URL search): 250 requests/minute from the same IP address
- Image Search API v1.1: 400 requests/minute from the same IP address
- CDX-server API (URL search): 250 requests/minute from the same IP address
- Memento API (URL search): 400 requests/minute from the same IP address
- Training module on Automatic processing of information preserved from the Web (module C)
- Tutorial in Python about how to explore the Arquivo.pt API
If you need to download a large amount of web-archived resources, such as all the URLs archived from a large website along time, we suggest the following methodology:
-
Analyse the Arquivo.pt collections so that you may choose those which may contain the most interesting web-archived data for your use case. If you have any doubt, contact us.
-
Download the CDXJ index files, (what is CDXJ?) of the Arquivo.pt collections you selected to process. For this purpose, analyse the "column A: Collection ID" and the corresponding CDXJ index files on "column H: Collection CDXJ File");
-
Create a list of selected URLs to be downloaded, extracted from the CDXJ index files. E.g. using the Linux grep command to get HTML pages successfully archived (status:200, mime:text/html):
> cat EAWP5.cdxj | grep '\"status\": \"200\"'| grep '\"mime\": \"text/html'| wc
- Download the web-archived resources for the list of selected URLs from Arquivo.pt by using the above APIs or, by building links to directly access the web-archived resources. These links are available on the Technical details of the Options top-right menu when accessing a web-archived page. For instance, for the URL http://publico.pt/ with timestamp 20120201160355 extracted from the CDXJ index file, build the following links to download the:
- original file of the web-archived page (loses replay quality because the original internal links are not rewritten to reference web-archived images or stylesheets), notice that there is a suffix
id_
appended after the timestamp: https://arquivo.pt/noFrame/replay/20120201160355id_/http://publico.pt/ - web-archived page without the Arquivo.pt UI frame (internal links are rewritten to reference web-archived resources): https://arquivo.pt/noFrame/replay/20120201160355/http://publico.pt/
If the client exceeds this limit, it will receive an error "HTTP 429 Too many requests" and should decrease its download rate.
- https://arquivo.pt/wayback: 200 requests/minute from the same IP address
- https://arquivo.pt/noFrame/replay: 200 requests/minute from the same IP address
- https://arquivo.pt/noFrame/patching/record: 200 requests/minute from the same IP address
- https://arquivo.pt//save/now/record: 200 requests/minute from the same IP address
If you have any trouble using our APIs, please contact us so that we can try to help you.
Short link to this page: arquivo.pt/api