Skip to content

Query Setup. Generic

HenriekeKotthoff edited this page May 13, 2022 · 20 revisions

While the other modules are tailor-made for specific APIs (e.g. YouTube, Twitter), the Generic Module can be used to fetch data from other APIs and you can upload and download files as well as scrape HTML data. Within the Generic Module you can use Presets (e.g. for subreddit posts or Crowdtangle). The presets that ship with Facepager provide a starting point, you can edit and create presets yourself.

Layout of the Generic Modul

Requesting data on the web – whether for downloading and uploading media files, web scraping or interacting with APIs – typically consist of four different parts:

  • The URL is the address of a specific resource on the web. In the URL, the protocol ("https://") is followed by a path and query parameters. In Facepager, this corresponds to the inputs for base path, resource and parameters.
  • The header defines additional information such as authorization data that is not included in the URL.
  • The method indicates the type of intended interaction, e.g. requesting or sending data.
  • The body contains data you send to the server. Whether you can send data depends on the selected method. In Facepager, a payload field for defining the request body is shown when you select the post, put or patch method.

The Generic Module allows you to assemble these different parts, send them to a web server, and process the response. Facepager provides assistance with some standard tasks when compiling requests, such as pagination and authorization. Further, you can define how to process the response of the server and whether you want to save the downloaded data in the database or in a download folder on your disk.

Building the URL

The base path is used to define the URL of the requested data (e.g. the URL to an API or URL to a website). The resource value simply is appended to the base path. Splitting up the URL into these two fields is not necessary - in Facepager usually the fixed parts of an API or website URL are put into the base path field and the path components that vary from request to request are defined in the resource field. You can also use placeholders to refer to node data. For an overview see the help section about URLs, Placeholders, Nodes and Keys.

By using parameters you define the name-value pairs in the query string. For example, the query string of the URL https://gdata.youtube.com/feeds/api/videos?q=Facepager&alt=json&v=2 is ?q=Facepager&alt=json&v=2. You don’t need to use the parameter fields, the full URL may be entered into the URL path field as well. The parameter field is just a convenient way to assemble query strings and to define custom placeholders.

In the parameter settings, custom placeholders can be defined as name-value-pairs. For example, if you put <searchterm> into the left side and <Object ID> into the right, you can use your custom placeholder <searchterm> in all other fields. It will be replaced by the placeholder <Object ID> which in turn is replaced by the Object ID value of the processed node. Organizing the settings with custom placeholders can keep your setup clean. You can easily change the placeholder value by changing the right hand value.

Sending headers

In addition to an URL you can send so called headers to a server. Headers are used to set the preferred language settings, to send cookies, the user agent and lots of other things. See https://en.wikipedia.org/wiki/List_of_HTTP_header_fields for a general explanation of headers.

For defining headers, you can copy them from the developer tools of your web browser. For example, if you use the browsers Mozilla Firefox or Chrome, you can hit F12 on your keyboard, navigate to the "Network" tab, click on one of the requests and inspect the header data in the upcoming detail view.

If you want to send information about a used web browser with your query, you can copy your user-agent from the header data.

Selecting a method

Interaction with a resource takes place via so called HTTPS-Verbs. Facepager supports five of these methods:

  • GET for receiving data from a specified resource, e.g. a Wikipedia page.
  • HEAD is similar to GET, but doesn't download the response body. The method can be used to check whether a webpage exists.
  • POST for uploading data to a platform, e.g. into the Google Cloud Storage.
  • PUT is similar to POST, usually it is used to update or create data on a website.
  • PATCH for updating and modifying existing resources.
  • DELETE to remove the requested resource.

Usually, you will use GET for collecting data.

Assistance for pagination and authorization

Sometimes webpages are split up into multiple pages. For example, search results are typically presented in chunks of 10 or more results. Therefore we need pagination procedures to make web requests. There are two pagination standards: offset-based pagination defines parameters for a page number that you can use to select specific pages in the query. Cursor-based pagination sets a cursor (sometimes called a continuation token) that indicates the starting point of the request. Cursor-based pagination is more efficient in large databases on the API provider's side: while offset based pagination would need to search all records in order to calculate the page number, cursor based pagination just starts at a given value.

To download multiple pages, Facepager provides three modes of paging parameters:

  • Key can be used if each page of the response contains data about how to access the next page (a page number or a cursor). The value is extracted from each response using the paging key, then added as the query parameter you define as param.
  • Count simply counts up the page number and appends it in the query parameter you define as param.
  • URL is used when the response contains a link to the next page.

If you use a paging key (either parameter or URL), you can choose a stop key, that stops fetching data as soon as the given key is present but empty or false. Usually you don't need a stop key, since fetching will stop anyway when the paging key is empty.

For count you can define a start value (first page or offset number), a step (amount to increase for each page) and maximum pages (number of maximum pages).

If successful, each of the procedures results in a new URL. Facepager issues new requests with these URLs as long as the number of maximum pages is not succeeded.

Authorization and authentication Some APIs and websites require you to authenticate yourself by logging into the website. Facepager assists you with several scenarios such as OpenAuthorization and Cookies using the settings and the login button.

Typically, by authenticating you get an access token or a cookie that works like a password. In every request to the server, you send the token or cookie either as a query parameter or as a header value. The specific procedure depends on the API or website, the server will only respond with data if you are authorized.

Processing the response

After sending a request to a server you have to decide where to store the response. You can store the collected data in the Facepager database, in files or in both.

In the response field, you can choose between several file formats:

  • JSON - Choose JSON (JavaScript Object Notation) if the API returns JSON data, this is the default option. Facepager stores the JSON data in the node.
  • xml - If the API returns XML data, it will be parsed and converted to JSON.
  • html - The returned data will be parsed as HTML and converted to JSON.
  • text - Instead of parsing the data, you can store the raw text of the servers response. You will find the response in the text key of a node (see the Detail View). This is useful for webscraping: You can process the text afterwards using the builtin parsers and the modifier syntax.
  • links - The response is parsed as XML and all links are extracted. You will find them in the links key of the node. If you want to slice the links into different nodes, which is highly recommended, set the key to extract to links and the key for Object ID to url). This setting is useful for web crawling, i.e. if you want to follow the links in a web page.
  • file - The data will be downloaded to files, specify the folder and a filename.

Whatever you choose, at least one new node with some basic information is stored in the database. Further, regardless of you selection, if you provide a download folder and set a filename value (e.g. <Object ID>), the response will be saved to your disk.

Before the response is stored in the database, you can extract data and slice it into smaller chunks by setting the key to extract field. Facepager will extract the data addressed by the key and create a childnode. If the addressed data consists of a list, one child node for every list item is created. The remaining data will be put into an additional childnode of the type "offcut". To find out the right value, leave the field blank, and send a request. Then inspect the Data View (top right) to find the key which contains the data you aim for.
If your data contains unique IDs for every node, you can define the corresponding key for object ID. It represents the first column of each node.

Clone this wiki locally