Releases: medialab/hyphe
Releases · medialab/hyphe
Early 2024
Back-to-school papercuts
ChangeLog:
- Add a button to export metadata from all pages of a webentity (#318)
- Explicitly separate startpages warnings regarding redirected pages and faulty ones (#379)
- Allow to set a specific User-Agent per crawl within the web interface (#461)
- Display hints on the meaning of the different possible status of a crawl (#474)
- Highlight corresponding webentities when hovering a status or a tag in the network legend (#459)
- Switch User-Agents list used within crawls to relying on https://www.useragents.me/ (#453)
- Various improvements (cleaner backend logs, remove empty traphs directories (#475), updated heuristics for webentity links calculation rhythm, visual fixes (#476, #477)
Hot Summer '23
ChangeLog:
- migrated caching WELinks to (working) files instead of mongo to handle huge corpuses
- allow to set archives pass as ENV variable for docker instances
- display time required by links indexation on overview
Summer '23
ChangeLog:
- Added handling of more webarchives as sources (Arquivo.pt + INA DLWeb) + fixed various webarchives frontend info (#469, #471,
- Added a corpus setting "ignore internal links" to crawl but not record links within the currently crawled webentity in order to fasten drastically indexation of entities with crazy amounts of links (with a cost in terms of functionalities since the network of internal pages is then not available, and entities that are split after a crawl will require to recrawled) (cf #371, #378, #433)
- Better handle frontend warning on pending actions when trying to close a tab (#465, #466)
- Minor fixes (#448, #460, #467, #468, #470, 50d97e8, 85decf2)
Better, faster, stronger traph, there it is!
ChangeLog:
- Switched to breaking new version of hyphe-traph 2.1, which should help fasten indexation on big networks, but requires to rebuild corpuses from start
- Make iterator traph calls less recurrent to leave priority to quick user actions
- Fixed stack on calling empty callback in List Webentities
- Upgraded urllib3 to handle SSL deprecation
- Froze dependencies to maintain python2.7 compat
Summer '22
ChangeLog:
- Upgraded User Agents list
- Added extra default WebEntity CreationRules for Github, Instagram, TikTok, Reddit and a bunch of blog platforms
- Added perma.cc to list of default autofollowlinks
- Diverse fixes and extra features for webarchives (links to archive permalinks, etc.)
- Minor bugfixes
Spring '22
ChangeLog:
- Added a distinction between successful and errored crawled pages to identify Suspicious crawls (#425)
- Fixed frontend compatibility within Hyphe-Browser (medialab/hyphe-browser#212)
- Fixed WebArchives crawling interface (#431) and behavior from BNF's archives (#426)
- Improved network page's interaction using latest sigma.js v2.2 (node highlight etc & #367)
- Allowed frontend to automatically restart a closed corpus when reopening the frontend directly on a specific corpus link (#440)
- Allowed to check contiguous cases in frontend's lists of webentities using the shift key (#438)
- Allowed to tune the frontend's header color from the config (#430)
- Published Hyphe on Zenodo & Software Heritage
- Minor fixes (#397, #388, #432, #429, #437, #343, #341, #444, #325)
Robots sensitive crawls (stabilized)
ChangeLog:
- Fixed environment variable OBEY_ROBOTS for Docker instance
- Added explanation helpers in frontend
- Fixed undeletable corpora
Robots sensitive crawls
WebArchives powered crawls
ChangeLog:
- Allow to start crawls on Web Archives to browse disappeared or modified webentities in the past (#372)
- Allow to setup advanced individual crawl settings (using a specific cookie, adjusting the depth, using a web archive...)
- Allow to display only crawled pages in a webentity's webpages list
- Upgraded fake user agents dependency for more recent UAs
- Add to the API a route to collect crawled webentity's webpages content as clear text instead of zipped base64
- Minor fixes (#397, #416, #418, 8b8f73f, 3b48755, 6aea48a, f3c1e85, e97b9d0, b05d470, 01aac8a, ...)