Skip to content

Session raw discussion 2

jrault edited this page Dec 21, 2012 · 1 revision

Table of Contents

methodology

crawl scenario

ediaspora atlas

  1. get web entity
  2. crawl at distance 1
    • Q : redefinition of distance and deepness in relation with web entity. A limit to distance 1 is good idea to keep things at hand.
    • Q : what is the default behavior, how to define exceptions, in what scope
  3. get seed in and seed out to decide whether the discovered web ressources at distance 1 does co-link current entity
proposition HCI
  1. import
    1. URL
      • rules to create web entities from URL with/without heuristics and then refine
      • extract stem and consider it as a web entity
      • extract stem and consider it as a web entity AND ask for pages
    2. web entities - grammar has to be defined but stem string might be a good idea.
    3. codebook

various features

  • focus crawling is a special functionality with a prospective aim. This task will require a further distance.
  • history of exploration or traceability - time of the day for web crawling, in what order were pages crawled, what definition and redefinition of web entities can we see, etc.

user interface

  • hide technical details of the crawling and expert terminology
  • eventually create a branch to propose a tool to crawling expert to match their expectations
  • cursor to define web entity

user actions

  • HCI is a live crawling software, it requires a human intervention for each step
  • each step is defined and planned in a task manager
Clone this wiki locally