-
Notifications
You must be signed in to change notification settings - Fork 7
MainFeatures of the Arquivo.pt full text search engine
The main goal of the Arquivo.pt web archive is to preserve and provide access to web content that is longer available online on their original websites. Arquivo.pt was formerly known as the Portuguese Web Archive (PWA).
During the developing of the Arquivo.pt Information Retrieval system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the Archive-access project to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements.
Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The Arquivo.pt search engine is a public service and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date.
The Arquivo.pt search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs.
The software is based on Nutchwax 0.11.0 and Wayback 1.2.1 and also use Hadoop clusters for distributed computing following the map-reduce paradigm.
The principal Features of the Arquivo.pt search engine are:
- Full-text search
- URL search
- Phrase search
- Faceted search (date, format, site)
- Sorting by relevance and date.
- Able to gracefully handle +180M documents spanning through several years.
- A full-text search user interface designed for Web archives
During the developing of the Arquivo.pt IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability, that we needed to solve.
The improvements of the Arquivo.pt search engine over the original code are:
- improved response speed
- added cache mechanisms for highly requested data at runtime
- added distributed and replicated indexes to parallelize load and scale-out the system.
- improved search results' quality
- added new ranking algorithms, including time-aware algorithms
- improved the usability and readability of the interface, specially the search result page
- improved the search interface to allow users to do time restrictions on searches
- added detection of URL on full-text search for contextual search results
- implemented several optimizations and removed bottlenecks
This list should grow as we further improve our software.
This section further details the improvements previously cited.They are divided according to our three main development axis: performance, quality of results, and UI/Usability.
Several performance improvements were needed to achieve our objective of >90% queries returned in <5 seconds.
- pruning indexes to reduce their size and sorting them by an importance measure to return good results without needing to read the full index entries (i.e. posting lists).
- adding cache mechanisms for highly requested data at runtime, such as documents’ timestamps or index statistics.
- distributing and replicating indexes in multiple servers to parallelize load and scale-out the system.
- redution of communication by aggregating requests.
- direct access to archived documents by encapsulating index information in URLs.
- engineering new ranking algorithms - query-term based, term-distance based, URL based, web-graph based.
- enabling to choose the ranking model at runtime or configurable in XML.
- logging and mining of users' search behaviors and patterns.
Several usability analysis and tests were conducted to drive the development of the PWA search engine user interface. The results obtained through the tests performed by the HCIM research group showed that the system achieved an overall average user satisfaction of 70%.
The original interface was lacking needed features and was confusing to our users. Thus, changes and adjustments were performed, such as:
- creation of the advanced search interface.
- bilingual support (in Portuguese and English).
- development of the search suggestions/corrections feature.
- improvement of the search interface to allow users to do time restrictions on searches using both text fields or datepickers widgets. Restricting date ranges on the original interface required an understanding of the inner working of the search engine.
- customization of the jQuery's datepicker to work as users expected for Web archive use cases.
- detection of URL on full-text search for contextual search results. This way we can use a single text field for search queries. The search engine sorts URL from and presents the relevant search result page for each case.
- merge of the Nutchwax(full-text search) and Wayback(URL search) interfaces so text and URL search have a consistent design.
- change of the layout of the search result page so it has a more familiar structure based on current search engine and has a similar look and feel.
- improvement of the readability and usability of the search result page by changing several aspects such as: typeface, font size, color, etc..
- optimization the search engines pages design so they are smaller in size and load quicker.
- fix problems of the top banner on archived pages where the archived page styling interfered with the banner styling.
Check our publications page