Skip to content

Latest commit

 

History

History
1025 lines (722 loc) · 43.9 KB

api.md

File metadata and controls

1025 lines (722 loc) · 43.9 KB

API documentation

Hyphe relies on a JsonRPC API that can be controlled easily through the web interface or called directly from a JsonRPC client.

Note: as it relies on the JSON-RPC protocol, it is not quite easy to test the API methods from a browser (having to send arguments through POST), but you can test directly from the command-line using the dedicated tools, see the Developers' documentation.

Data & Query format

The current JSON-RPC 1.0 implementation requires to provide arguments as an ordered array of the methods arguments. Call with named arguments is possible but not well handled and not recommanded until we migrate to REST.

The API will always answer as such:

  • Success:
{
  "code": "success",
  "result": "<The actual expected result, possibly an objet, an array, a number, a string, ...>"
}
  • Error:
{
  "code": "fail",
  "message": "<A string describing the possible cause of the error.>"
}

Summary

  • Default API commands (no namespace)
  • Commands for namespace: "crawl."
    • deploy_crawler
    • delete_crawler
    • cancel_all
    • start
    • cancel
    • get_job_logs
  • Commands for namespace: "store."
    • DEFINE WEBENTITIES
      • get_lru_definedprefixes
      • declare_webentity_by_lruprefix_as_url
      • declare_webentity_by_lru
      • declare_webentity_by_lrus_as_urls
      • declare_webentity_by_lrus
    • EDIT WEBENTITIES
      • basic_edit_webentity
      • rename_webentity
      • set_webentity_status
      • set_webentities_status
      • set_webentity_homepage
      • add_webentity_lruprefixes
      • rm_webentity_lruprefix
      • add_webentity_startpages
      • add_webentity_startpage
      • rm_webentity_startpages
      • rm_webentity_startpage
      • merge_webentity_into_another
      • merge_webentities_into_another
      • delete_webentity
    • RETRIEVE AND SEARCH WEBENTITIES
      • get_webentity
      • get_webentity_by_lruprefix
      • get_webentity_by_lruprefix_as_url
      • get_webentity_for_url
      • get_webentity_for_url_as_lru
      • get_webentities
      • search_webentities
      • wordsearch_webentities
      • get_webentities_by_status
      • get_webentities_by_name
      • get_webentities_by_tag_value
      • get_webentities_by_tag_category
      • get_webentities_mistagged
      • get_webentities_uncrawled
      • get_webentities_page
      • get_webentities_ranking_stats
    • TAGS
      • rebuild_tags_dictionary
      • add_webentity_tag_value
      • add_webentities_tag_value
      • rm_webentity_tag_value
      • rm_webentities_tag_value
      • edit_webentity_tag_value
      • get_tags
      • get_tag_namespaces
      • get_tag_categories
      • get_tag_values
    • PAGES, LINKS AND NETWORKS
      • get_webentity_pages
      • paginate_webentity_pages
      • get_webentity_mostlinked_pages
      • get_webentity_subwebentities
      • get_webentity_parentwebentities
      • get_webentity_pagelinks_network
      • paginate_webentity_pagelinks_network
      • get_webentity_referrers
      • get_webentity_referrals
      • get_webentity_ego_network
      • get_webentities_network
    • CREATION RULES
      • get_default_webentity_creationrule
      • get_webentity_creationrules
      • delete_webentity_creationrule
      • add_webentity_creationrule
      • simulate_creationrules_for_urls
      • simulate_creationrules_for_lrus
    • VARIOUS
      • trigger_links_build
      • get_webentities_stats

Default API commands (no namespace)

CORPUS HANDLING

  • test_corpus:
    • corpus (optional, default: "--hyphe--")

Returns the current status of a corpus: "ready"/"starting"/"missing"/"stopped"/"error".

  • list_corpus:
    • light (optional, default: true)

Returns the list of all existing corpora with metas.

  • get_corpus_options:
    • corpus (optional, default: "--hyphe--")

Returns detailed settings of a corpus.

  • set_corpus_options:
    • corpus (optional, default: "--hyphe--")
    • options (optional, default: null)

Updates the settings of a corpus according to the keys/values provided in options as a json object respecting the settings schema visible by querying get_corpus_options. Returns the detailed settings.

  • create_corpus:
    • name (optional, default: "--hyphe--")
    • password (optional, default: "")
    • options (optional, default: {})

Creates a corpus with the chosen name and optional password and options (as a json object see set/get_corpus_options). Returns the corpus generated id and status.

  • start_corpus:
    • corpus (optional, default: "--hyphe--")
    • password (optional, default: "")

Starts an existing corpus possibly password-protected. Returns the new corpus status.

  • stop_corpus:
    • corpus (optional, default: "--hyphe--")

Stops an existing and running corpus. Returns the new corpus status.

  • get_corpus_tlds:
    • corpus (optional, default: "--hyphe--")

Returns the tree of TLDs rules built from Mozilla's list at the creation of corpus.

  • backup_corpus:
    • corpus (optional, default: "--hyphe--")

Saves locally on the server in the archive directory a timestamped backup of corpus including 4 json backup files of all webentities/links/crawls and corpus options.

  • ping:
    • corpus (optional, default: null)
    • timeout (optional, default: 3)

Tests during timeout seconds whether an existing corpus is started. Returns "pong" on success or the corpus status otherwise.

  • reinitialize:
    • corpus (optional, default: "--hyphe--")

Resets completely a corpus by cancelling all crawls and emptying the Traph and Mongo data.

  • destroy_corpus:
    • corpus (optional, default: "--hyphe--")

Backups, resets, then definitely deletes a corpus and anything associated with it.

  • force_destroy_corpus:
    • corpus (optional, default: "--hyphe--")

Deletes completely and definitely a corpus without restarting it (backup may be less complete).

  • clear_all:
    • except_corpus_ids (optional, default: [])

Resets Hyphe completely: starts then resets and destroys all existing corpora one by one except for those whose ID is given in except_corpus_ids.

CORE AND CORPUS STATUS

  • get_status:
    • corpus (optional, default: "--hyphe--")

Returns global metadata on Hyphe's status and specific information on a corpus.

BASIC PAGE DECLARATION (AND WEBENTITY CREATION)

  • declare_page:
    • url (mandatory)
    • corpus (optional, default: "--hyphe--")

Indexes a url into a corpus. Returns the (newly created or not) associated WebEntity.

  • declare_pages:
    • list_urls (mandatory)
    • corpus (optional, default: "--hyphe--")

Indexes a bunch of urls given as an array in list_urls into a corpus. Returns the (newly created or not) associated WebEntities.

BASIC CRAWL METHODS

  • listjobs:
    • list_ids (optional, default: null)
    • from_ts (optional, default: null)
    • to_ts (optional, default: null)
    • light (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns the list and details of all "finished"/"running"/"pending" crawl jobs of a corpus. Optionally returns only the jobs whose id is given in an array of list_ids and/or that was created after timestamp from_ts or before to_ts. Set light to true to get only essential metadata for heavy queries.

  • propose_webentity_startpages:
    • webentity_id (mandatory)
    • startmode (optional, default: "default")
    • categories (optional, default: false)
    • save_startpages (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns a list of suggested startpages to crawl an existing WebEntity defined by its webentity_id using the "default" startmode defined for the corpus or one or an array of either the WebEntity's preset "startpages", "homepage" or "prefixes" or most seen "pages-". Returns them categorised by type of source if "categories" is set to true. Will save them into the webentity if save_startpages is True.

  • crawl_webentity:
    • webentity_id (mandatory)
    • depth (optional, default: 0)
    • phantom_crawl (optional, default: false)
    • status (optional, default: "IN")
    • proxy (optional, default: null)
    • cookies_string (optional, default: null)
    • user_agent (optional, default: null)
    • phantom_timeouts (optional, default: {})
    • webarchives (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Schedules a crawl for a corpus for an existing WebEntity defined by its webentity_id with a specific crawl depth [int]. Optionally use PhantomJS by setting phantom_crawl to "true" and adjust specific phantom_timeouts as a json object with possible keys timeout/ajax_timeout/idle_timeout. Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status ("undecided"/"out"/"discovered"). Optionally add a HTTP proxy specified as "domain_or_IP:port". Also optionally add known cookies_string with auth rights to a protected website and/or specific user_agent. Optionally use some webarchives by defining a json object with keys date/days_range/option, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr". Will use the WebEntity's startpages if it has any or use otherwise the corpus' "default" startmode heuristic as defined in propose_webentity_startpages (use crawl_webentity_with_startmode to apply a different heuristic, see details in propose_webentity_startpages).

  • crawl_webentity_with_startmode:
    • webentity_id (mandatory)
    • depth (optional, default: 0)
    • phantom_crawl (optional, default: false)
    • status (optional, default: "IN")
    • startmode (optional, default: "default")
    • proxy (optional, default: null)
    • cookies_string (optional, default: null)
    • user_agent (optional, default: null)
    • phantom_timeouts (optional, default: {})
    • webarchives (optional, default: {})
    • save_startpages (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Schedules a crawl for a corpus for an existing WebEntity defined by its webentity_id with a specific crawl depth [int]. Optionally use PhantomJS by setting phantom_crawl to "true" and adjust specific phantom_timeouts as a json object with possible keys timeout/ajax_timeout/idle_timeout. Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status ("undecided"/"out"/"discovered"). Optionally add a HTTP proxy specified as "domain_or_IP:port". Also optionally add known cookies_string with auth rights to a protected website and/or specific user_agent. Optionally define the startmode strategy differently to the corpus "default one (see details in propose_webentity_startpages). Optionally use some webarchives by defining a json object with keys date/days_range/option, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr".

  • get_webentity_jobs:
    • webentity_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus crawl jobs that has run for a specific WebEntity defined by its webentity_id.

  • cancel_webentity_jobs:
    • webentity_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Cancels for a corpus all running or pending crawl jobs that were booked for a specific WebEntity defined by its webentity_id.

  • get_webentity_logs:
    • webentity_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus crawl activity logs on a specific WebEntity defined by its webentity_id.

HTTP LOOKUP METHODS

  • lookup_httpstatus:
    • url (mandatory)
    • timeout (optional, default: 30)
    • corpus (optional, default: "--hyphe--")

Tests a url for timeout seconds using a corpus specific connection (possible proxy for instance). Returns the url's HTTP code.

  • lookup:
    • url (mandatory)
    • timeout (optional, default: 30)
    • corpus (optional, default: "--hyphe--")

Tests a url for timeout seconds using a corpus specific connection (possible proxy for instance). Returns a boolean indicating whether lookup_httpstatus returned HTTP code 200 or a redirection code (301/302/...).

Commands for namespace: "crawl."

  • deploy_crawler:
    • corpus (optional, default: "--hyphe--")

Prepares and deploys on the ScrapyD server a spider (crawler) for a corpus.

  • delete_crawler:
    • corpus (optional, default: "--hyphe--")

Removes from the ScrapyD server an existing spider (crawler) for a corpus.

  • cancel_all:
    • corpus (optional, default: "--hyphe--")

Stops all "running" and "pending" crawl jobs for a corpus.

Cancels all current crawl jobs running or planned for a corpus and empty related mongo data.

  • start:
    • webentity_id (mandatory)
    • starts (mandatory)
    • follow_prefixes (mandatory)
    • nofollow_prefixes (mandatory)
    • follow_redirects (optional, default: null)
    • depth (optional, default: 0)
    • phantom_crawl (optional, default: false)
    • phantom_timeouts (optional, default: {})
    • download_delay (optional, default: 1)
    • proxy (optional, default: null)
    • cookies_string (optional, default: null)
    • user_agent (optional, default: null)
    • webarchives (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Starts a crawl for a corpus defining finely the crawl options (mainly for debug purposes):

  • a webentity_id associated with the crawl a list of starts urls to start from
  • a list of follow_prefixes to know which links to follow
  • a list of nofollow_prefixes to know which links to avoid
  • a depth corresponding to the maximum number of clicks done from the start pages
  • phantom_crawl set to "true" to use PhantomJS for this crawl and optional phantom_timeouts as an object with keys among timeout/ajax_timeout/idle_timeout
  • a download_delay corresponding to the time in seconds spent between two requests by the crawler.
  • an HTTP proxy specified as "domain_or_IP:port"
  • a known cookies_string with auth rights to a protected website
  • a specific user_agent. Optionally use some webarchives by defining a json object with keys date/days_range/option, the latter being one of ""/"web.archive.org"/"archivesinternet.bnf.fr".
  • cancel:
    • job_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Cancels a crawl of id job_id for a corpus.

  • get_job_logs:
    • job_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus activity logs of a specific crawl with id job_id.

Commands for namespace: "store."

DEFINE WEBENTITIES

  • get_lru_definedprefixes:
    • lru (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all possible LRU prefixes shorter than lru and already attached to WebEntities.

  • declare_webentity_by_lruprefix_as_url:
    • url (mandatory)
    • name (optional, default: null)
    • status (optional, default: null)
    • startpages (optional, default: [])
    • lruVariations (optional, default: true)
    • tags (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for the LRU prefix given as a url and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startpages. Returns the newly created WebEntity.

  • declare_webentity_by_lru:
    • lru_prefix (mandatory)
    • name (optional, default: null)
    • status (optional, default: null)
    • startpages (optional, default: [])
    • lruVariations (optional, default: true)
    • tags (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a lru_prefix and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startpages. Returns the newly created WebEntity.

  • declare_webentity_by_lrus_as_urls:
    • list_urls (mandatory)
    • name (optional, default: null)
    • status (optional, default: null)
    • startpages (optional, default: [])
    • lruVariations (optional, default: true)
    • tags (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a set of LRU prefixes given as URLs under list_urls and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startpages. Returns the newly created WebEntity.

  • declare_webentity_by_lrus:
    • list_lrus (mandatory)
    • name (optional, default: null)
    • status (optional, default: "")
    • startpages (optional, default: [])
    • lruVariations (optional, default: true)
    • tags (optional, default: {})
    • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a set of LRU prefixes given as list_lrus and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startpages. Returns the newly created WebEntity.

EDIT WEBENTITIES

  • basic_edit_webentity:
    • webentity_id (mandatory)
    • name (optional, default: null)
    • status (optional, default: null)
    • homepage (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Changes for a corpus at once the name, status and homepage of a WebEntity defined by webentity_id.

  • rename_webentity:
    • webentity_id (mandatory)
    • new_name (mandatory)
    • corpus (optional, default: "--hyphe--")

Changes for a corpus the name of a WebEntity defined by webentity_id to new_name.

  • set_webentity_status:
    • webentity_id (mandatory)
    • status (mandatory)
    • corpus (optional, default: "--hyphe--")

Changes for a corpus the status of a WebEntity defined by webentity_id to status (one of "in"/"out"/"undecided"/"discovered").

  • set_webentities_status:
    • webentity_ids (mandatory)
    • status (mandatory)
    • corpus (optional, default: "--hyphe--")

Changes for a corpus the status of a set of WebEntities defined by a list of webentity_ids to status (one of "in"/"out"/"undecided"/"discovered").

  • set_webentity_homepage:
    • webentity_id (mandatory)
    • homepage (optional, default: "")
    • corpus (optional, default: "--hyphe--")

Changes for a corpus the homepage of a WebEntity defined by webentity_id to homepage.

  • add_webentity_lruprefixes:
    • webentity_id (mandatory)
    • lru_prefixes (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds for a corpus a list of lru_prefixes (or a single one) to a WebEntity defined by webentity_id.

  • rm_webentity_lruprefix:
    • webentity_id (mandatory)
    • lru_prefix (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes for a corpus a lru_prefix from the list of prefixes of a WebEntity defined by `webentity_id. Will delete the WebEntity if it ends up with no LRU prefix left.

  • add_webentity_startpages:
    • webentity_id (mandatory)
    • startpages_urls (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds for a corpus a list of startpages_urls to the list of startpages to use when crawling the WebEntity defined by webentity_id.

  • add_webentity_startpage:
    • webentity_id (mandatory)
    • startpage_url (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds for a corpus a startpage_url to the list of startpages to use when crawling the WebEntity defined by webentity_id.

  • rm_webentity_startpages:
    • webentity_id (mandatory)
    • startpages_urls (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes for a corpus a list of startpages_urls from the list of startpages to use when crawling the WebEntity defined by `webentity_id.

  • rm_webentity_startpage:
    • webentity_id (mandatory)
    • startpage_url (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes for a corpus a startpage_url from the list of startpages to use when crawling the WebEntity defined by `webentity_id.

  • merge_webentity_into_another:
    • old_webentity_id (mandatory)
    • good_webentity_id (mandatory)
    • include_tags (optional, default: false)
    • include_home_and_startpages_as_startpages (optional, default: false)
    • include_name_and_status (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Assembles for a corpus 2 WebEntities by deleting WebEntity defined by old_webentity_id and adding all of its LRU prefixes to the one defined by good_webentity_id. Optionally set include_tags and/or include_home_and_startpages_as_startpages and/or include_name_and_status to "true" to also add the tags and/or startpages and/or name&status to the merged resulting WebEntity.

  • merge_webentities_into_another:
    • old_webentity_ids (mandatory)
    • good_webentity_id (mandatory)
    • include_tags (optional, default: false)
    • include_home_and_startpages_as_startpages (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Assembles for a corpus a bunch of WebEntities by deleting WebEntities defined by a list of old_webentity_ids and adding all of their LRU prefixes to the one defined by good_webentity_id. Optionally set include_tags and/or include_home_and_startpages_as_startpages to "true" to also add the tags and/or startpages to the merged resulting WebEntity.

  • delete_webentity:
    • webentity_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes from a corpus a WebEntity defined by webentity_id (mainly for advanced debug use).

RETRIEVE AND SEARCH WEBENTITIES

  • get_webentity:
    • webentity_id (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a WebEntity defined by its webentity_id.

  • get_webentity_by_lruprefix:
    • lru_prefix (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity having lru_prefix as one of its LRU prefixes.

  • get_webentity_by_lruprefix_as_url:
    • url (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity having one of its LRU prefixes corresponding to the LRU fiven under the form of a url.

  • get_webentity_for_url:
    • url (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity to which a url belongs (meaning starting with one of the WebEntity's prefix and not another).

  • get_webentity_for_url_as_lru:
    • lru (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity to which a url given under the form of a lru belongs (meaning starting with one of the WebEntity's prefix and not another).

  • get_webentities:
    • list_ids (optional, default: [])
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: false)
    • light_for_csv (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all existing WebEntities or only the WebEntities whose id is among list_ids. Results will be paginated with a total number of returned results of count and page the number of the desired page of results. Returns all results at once if list_ids is provided or count is -1 ; otherwise results will include metadata on the request including the total number of results and a token to be reused to collect the other pages via get_webentities_page. Other possible options include:

  • order the results with sort by inputting a field or list of fields as named in the WebEntities returned objects; optionally prefix a sort field with a "-" to revert the sorting on it; for instance: ["-indegree", "name"] will order by maximum indegree first then by alphabetic order of names;
  • set light or semilight or light_for_csv to "true" to collect lighter data with less WebEntities fields.
  • search_webentities:
    • allFieldsKeywords (optional, default: [])
    • fieldKeywords (optional, default: [])
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: true)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities matching a specific search using the allFieldsKeywords and fieldKeywords arguments. Returns all results at once if count _ (optional, default: = -1 ; otherwise results will be paginated with count results per page, using page as index of the desired page. Results will include metadata on the request including the total number of results and a token to be reused to collect the other pages via get_webentities_page.`)

  • allFieldsKeywords should be a string or list of strings to search in all textual fields of the WebEntities ("name", "lru prefixes", "startpages" & "homepage"). For instance ["hyphe", "www"]
  • fieldKeywords should be a list of 2-elements arrays giving first the field to search into then the searched value or optionally for the field "indegree" an array of a minimum and maximum values to search into (notes: this does not work with undirected_degree and outdegree ; only exact values will be matched when querying on field status field). For instance: [["name", "hyphe"], ["indegree", [3, 1000]]]
  • see description of sort, light and semilight in get_webentities above.
  • wordsearch_webentities:
    • allFieldsKeywords (optional, default: [])
    • fieldKeywords (optional, default: [])
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: true)
    • corpus (optional, default: "--hyphe--")

Same as search_webentities except that search is only matching exact full words

  • _and that allFieldsKeywords` query also search into tags values.
  • get_webentities_by_status:
    • status (mandatory)
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: true)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having their status equal to status (one of "in"/"out"/"undecided"/"discovered"). Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_by_name:
    • name (mandatory)
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having their name equal to name. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_by_tag_value:
    • value (mandatory)
    • namespace (optional, default: null)
    • category (optional, default: null)
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having at least one tag in any namespace/category equal to value. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_by_tag_category:
    • namespace (mandatory)
    • category (mandatory)
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having at least one tag in a specific category for a specific namespace. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_mistagged:
    • status (optional, default: 'IN')
    • missing_a_category (optional, default: false)
    • multiple_values (optional, default: false)
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities of status status with no tag of the namespace "USER" or multiple tags for some USER categories if multiple_values is true or no tag for at least one existing USER category if missing_a_category is true. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_uncrawled:
    • sort (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: false)
    • semilight (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all IN WebEntities which have no crawljob associated with it. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see search_webentities for explanations on sort count and page.

  • get_webentities_page:
    • pagination_token (mandatory)
    • n_page (mandatory)
    • idNamesOnly (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the page number n_page of WebEntities corresponding to the results of a previous query ran using any of the get_webentities or search_webentities methods using the returned pagination_token. Returns only an array of [id, name] arrays if idNamesOnly is true.

  • get_webentities_ranking_stats:
    • pagination_token (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus histogram data on the indegrees of all WebEntities matching a previous query ran using any of the get_webentities or search_webentities methods using the return pagination_token.

TAGS

  • rebuild_tags_dictionary:
    • corpus (optional, default: "--hyphe--")

Administrative function to regenerate for a corpus the dictionnary of tag values used by autocompletion features

  • _`mostly a debug function which should not be used in most cases.
  • add_webentity_tag_value:
    • webentity_id (mandatory)
    • namespace (mandatory)
    • category (mandatory)
    • value (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds for a corpus a tag namespace:category_ (optional, default: value to a WebEntity defined by webentity_id.`)

  • add_webentities_tag_value:
    • webentity_ids (mandatory)
    • namespace (mandatory)
    • category (mandatory)
    • value (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds for a corpus a tag namespace:category_ (optional, default: value to a bunch of WebEntities defined by a list of webentity_ids.`)

  • rm_webentity_tag_value:
    • webentity_id (mandatory)
    • namespace (mandatory)
    • category (mandatory)
    • value (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes for a corpus a tag namespace:category_ (optional, default: value associated with a WebEntity defined by webentity_id if it is set.`)

  • rm_webentities_tag_value:
    • webentity_ids (mandatory)
    • namespace (mandatory)
    • category (mandatory)
    • value (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes for a corpus a tag namespace:category_ (optional, default: value to a bunch of WebEntities defined by a list of webentity_ids.`)

  • edit_webentity_tag_value:
    • webentity_id (mandatory)
    • namespace (mandatory)
    • category (mandatory)
    • old_value (mandatory)
    • new_value (mandatory)
    • corpus (optional, default: "--hyphe--")

Replaces for a corpus a tag namespace:category_ (optional, default: old_value into a tag namespace:category=new_value for the WebEntity defined by webentity_id if it is set.`)

  • get_tags:
    • namespace (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a tree of all existing tags of the webentities hierarchised by namespaces and categories. Optionally limits to a specific namespace.

  • get_tag_namespaces:
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing namespaces of the webentities tags.

  • get_tag_categories:
    • namespace (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing categories of the webentities tags. Optionally limits to a specific namespace.

  • get_tag_values:
    • namespace (optional, default: null)
    • category (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing values in the webentities tags. Optionally limits to a specific namespace and/or category.

PAGES, LINKS AND NETWORKS

  • get_webentity_pages:
    • webentity_id (mandatory)
    • onlyCrawled (optional, default: true)
    • corpus (optional, default: "--hyphe--")

Warning: this method can be very slow on webentities with many pages, privilege paginate_webentity_pages whenever possible. Returns for a corpus all indexed Pages fitting within the WebEntity defined by webentity_id. Optionally limits the results to Pages which were actually crawled setting onlyCrawled to "true".

  • paginate_webentity_pages:
    • webentity_id (mandatory)
    • count (optional, default: 5000)
    • pagination_token (optional, default: null)
    • onlyCrawled (optional, default: false)
    • include_page_metas (optional, default: false)
    • include_page_body (optional, default: false)
    • body_as_plain_text (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus count indexed Pages alphabetically ordered fitting within the WebEntity defined by webentity_id and returns a pagination_token to reuse to collect the following pages. Optionally limits the results to Pages which were actually crawled setting onlyCrawled to "true". Also optionally returns complete page metadata (http status, body size, content_type, encoding, crawl timestamp\ and crawl depth) when include_page_metas is set to "true". Additionally returns the page's zipped body encoded in base64 when include_page_body is "true" (only possible when Hyphe is configured with store_crawled_html_content to "true"); setting body_as_plain_text to "true" decodes and unzip these to return them as plain text.

  • get_webentity_mostlinked_pages:
    • webentity_id (mandatory)
    • npages (optional, default: 20)
    • max_prefix_distance (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the npages (defaults to 20) most linked Pages indexed that fit within the WebEntity defined by webentity_id and optionnally at a maximum depth of max_prefix_distance.

  • get_webentity_subwebentities:
    • webentity_id (mandatory)
    • light (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all sub-webentities of a WebEntity defined by webentity_id (meaning webentities having at least one LRU prefix starting with one of the WebEntity's prefixes).

  • get_webentity_parentwebentities:
    • webentity_id (mandatory)
    • light (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all parent-webentities of a WebEntity defined by webentity_id (meaning webentities having at least one LRU prefix starting like one of the WebEntity's prefixes).

  • get_webentity_pagelinks_network:
    • webentity_id (optional, default: null)
    • include_external_links (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Warning: this method can be very slow on webentities with many pages or links, privilege paginate_webentity_pagelinks_network whenever possible. Returns for a corpus the list of all internal NodeLinks of a WebEntity defined by webentity_id. Optionally add external NodeLinks (the frontier) by setting include_external_links to "true". Will not return much of anything if the corpus was configured with ignore_internal_links set to "true".

  • paginate_webentity_pagelinks_network:
    • webentity_id (optional, default: null)
    • count (optional, default: 10)
    • pagination_token (optional, default: null)
    • include_external_outlinks (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus internal page links for count source pages of a WebEntity defined by webentity_id and returns a pagination_token to reuse to collect the following links. Optionally add external NodeLinks (the frontier) by setting include_external_outlinks to "true". Will not return much of anything if the corpus was configured with ignore_internal_links set to "true".

  • get_webentity_referrers:
    • webentity_id (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: true)
    • semilight (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities with known links to webentity_id ordered by decreasing link weight. Results are paginated and will include a token to be reused to collect the other entities via get_webentities_page: see search_webentities for explanations on count and page.

  • get_webentity_referrals:
    • webentity_id (optional, default: null)
    • count (optional, default: 100)
    • page (optional, default: 0)
    • light (optional, default: true)
    • semilight (optional, default: false)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities with known links from webentity_id ordered by decreasing link weight. Results are paginated and will include a token to be reused to collect the other entities via get_webentities_page: see search_webentities for explanations on count and page.

  • get_webentity_ego_network:
    • webentity_id (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all weighted links between webentities linked to webentity_id.

  • get_webentities_network:
    • include_links_from_OUT (optional, default: INCLUDE_LINKS_FROM_OUT)
    • include_links_from_DISCOVERED (optional, default: INCLUDE_LINKS_FROM_DISCOVERED)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the list of all agregated weighted links between WebEntities.

CREATION RULES

  • get_default_webentity_creationrule:
    • corpus (optional, default: "--hyphe--")

Returns for a corpus the default WebEntityCreationRule.

  • get_webentity_creationrules:
    • lru_prefix (optional, default: null)
    • corpus (optional, default: "--hyphe--")

Returns for a corpus all existing WebEntityCreationRules or only one set for a specific lru_prefix.

  • delete_webentity_creationrule:
    • lru_prefix (mandatory)
    • corpus (optional, default: "--hyphe--")

Removes from a corpus an existing WebEntityCreationRule set for a specific lru_prefix.

  • add_webentity_creationrule:
    • lru_prefix (mandatory)
    • regexp (mandatory)
    • corpus (optional, default: "--hyphe--")

Adds to a corpus a new WebEntityCreationRule set for a lru_prefix to a specific regexp or one of "subdomain"/"subdomain-N"/"domain"/"path-N"/"prefix+N"/"page" N being an integer. It will immediately by applied to past crawls.

  • simulate_creationrules_for_urls:
    • pageURLs (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns an object giving for each URL of pageURLs (single string or array) the prefix of the theoretical WebEntity the URL would be attached to within a corpus following its specific WebEntityCreationRules.

  • simulate_creationrules_for_lrus:
    • pageLRUs (mandatory)
    • corpus (optional, default: "--hyphe--")

Returns an object giving for each LRU of pageLRUs (single string or array) the prefix of the theoretical WebEntity the LRU would be attached to within a corpus following its specific WebEntityCreationRules.

VARIOUS

  • trigger_links_build:
    • corpus (optional, default: "--hyphe--")

Will initiate a links calculation update (useful especially when a corpus crashed during the links calculation and no more crawls is programmed).

  • get_webentities_stats:
    • corpus (optional, default: "--hyphe--")

Returns for a corpus a set of statistics on the WebEntities status repartition of a corpus each 5 minutes.