Skip to content

GSoC 2021 Work Product Submission

krishna edited this page Aug 18, 2021 · 2 revisions

Mentors: @afeena, @mzfr

Hello there! Over the summer of 2021, I worked on improving the cloning and serving capabilities of Snare.

Specifically, I worked on introducing headless cloning through Pyppeteer, upgrading aiohttp to a version compatible with Tanner and adding support for newer versions of Python. Apart from these, I also worked on a few issues that helped make Snare more complete and provide a better overall user experience.

Headless cloning

In some cases, the classic method of curl-ing or requests.get-ing might not provide us with the complete webpage. This can be due to a variety of reasons - User-Agent, Viewport, lazy loading or AJAX calls that are fired based on the cursor movement. Even though the user-agent can be spoofed, it still leaves us with a few issues that cannot be solved the conventional way. Enter headless browsing.

In a nutshell, headless browsing is making use of an actual browser instance, without a GUI, whose actions can be programmed and automated. Selenium is one such battle-tested tool for browser automation and we initially chose it for this very reason. However, it later struck us that the entirety of Snare runs asynchronously and Selenium is meant to be run synchronously. Hence, we shifted to Pyppeteer, the python port of Puppeteer from JavaScript, which worked asynchronously and fit in very well.

Headless cloning can now be enabled by adding the --headless flag to the cloner call like clone --target http://example.com --path example-site --headless.

Link to PR: #294

Architectural redesign

To incorporate changes for headless cloning, the data-fetching part had to be split into a separate function, but that was not all. We had one too many functions under a single class that served different purposes; this called for separate classes.

There were 2 ways to proceed -

  1. Keep the Cloner class as such and introduce a HeadlessCloner class that overloads the fetch_data method.
  2. Separate the core functionalities of the cloner into BaseCloner, an abstract class, and define fetch_data for SimpleCloner and HeadlessCloner. Finally provide a common interface through CloneRunner.

We collectively decided to proceed with the 2nd approach in lieu of cleaner design practices.

Retrying URLs

Headless cloning brought along a few challenges, one of them being request failures (from timeouts for example). Initially, quite a number of requests err-ed out, resulting in the webpage not being scraped. To tackle this, a try_count key has been added to the URL item (a dictionary) in the URLs queue. If there is an error in fetching the data, the same URL is added to the queue again with the try_count increased by 1. A single URL can be tried for a maximum of 3 times before it is discarded.

url_item = {
    "url": "example.com",
    "level": 0,
    "try_count": 1
}

A change, as small and simple as this, resulted in an increase in the number of pages cloned and provided better reliability in the cloning process.

Link to PR: #298

aiohttp and Python 3.9

Two of the most daunting tasks in my proposal were upgrading the aiohttp library to v3.7.4, the same version used by Tanner, and adding support for newer versions of Python - v3.8 and v3.9. As encountered in #244, there was an issue with Snare serving empty pages or connection reset errors with the root cause being the Python and aiohttp versions. To everyone's relief, the task turned to be very easy as Snare worked out of the box with Python 3.9 and aiohttp v3.7.4. 😄

Error handling

While testing, I came across a strange issue where the meta info was not written into meta.json in the event a KeyboardInterrupt was raised. After some research and the help of my mentors, we identified the issue to be with exception handling in asyncio event loops. asyncio.run_until_complete delegates an exception from the point where it raised to the point where the loop-run is invoked. This meant that the keyboard-interrupt could not be handled within Cloner.

To overcome this issue, a close method was introduced in the CloneRunner class to close all open connections and write the meta info.

Link to PRs:

Redirects

In cases where sites redirected, cloner had a tough time fixing and following links. For example, there were a lot of problems with broken relative links when the home URL was shifted. To enable redirects, the return URL is compared with the requested URL and a key in the meta info is added accordingly. If a "redirect" key is present, a 302 exception is raised and the site redirects to the new URL.

For example, if / redirected to /new/home/, meta.json would look like this:

{
  "/": {
    "redirect": "/new/home/"
  },
  "/new/home/": {
    "hash": "abc123",
    "headers": [
      {
        "Server": "ABC"
      }
    ]
  }
}

Link to PRs:

Fingerprinting

Snare claims to be a Nginx web server on the outside with aiohttp running under the hood. To solidify this claim, it was crucial to make sure the Snare web server did not leak the Server header. Basic fingerprinting methods involved checking the order of response headers and sending malformed requests to trigger various exceptions.

Though a 400 exception cannot be completely handled right now in aiohttp, 302, 404 and 500 now send proper headers. Additionally, Snare has been configured to drop the Server header altogether if it exposes the aiohttp server banner.

Link to PR: #308

Unit Testing

The architectural changes to accomodate headless cloning required the tests to be rewritten partially. I learnt a lot about writing proper tests from my mentors during this period.

Link to PR: #304

Bug fixes and minor improvements

CSS validation by cloner now properly logs errors and warnings into the log file instead of stdout. This was done to reduce the visual clutter while running cloner.

Link to PR: #297

There was an issue with the Transfer-Encoding header while serving webpages with Snare. Websites can opt to transfer data in chunks so that data from various sources can reach the viewer reliably. However, when data was sent in chunks, the Content-Length header must not be present as all the relevant info for data transfer is present in the chunks themselves. Since all of the site data is aggregated in a single file by cloner, the Transfer-Encoding header was dropped.

Since newer versions of libraries might contain crucial vulnerability fixes, it is always good to update them but this is a hassle for the developers and maintainers. Creating requirements.txt without version specifications might lead to breaking changes from newer major versions while setting them might prevent minor updates and bug fixes. To establish a middle-ground, the tilde (~) specifier can be used. Refer to the description of the PR below for a better explanation.

Link to PR: #306

Documentation

Documentation is the backbone of any software. @mzfr suggested the use of docstrings, similar to what had been done in Tanner. Docstrings of the sphinx format can be used to autogenerate developer documentation as Snare's documentation also uses Sphinx.

Since we have moved past Python 3.5, type hinting has also been used.

Link to PRs:

Future

In this 10-week period, there were a few ideas that we discussed but could not proceed with. One such idea was framework integration.

Framework integration

Currently, given a website, Snare clones and serves it, working in tandem with Tanner. The idea here is to leverage Snare's capabilities to communicate with Tanner and prepare responses, and integrate it into another website's source.

At the moment, Flask and Django are good candidates for integration since Snare is written in Python.

This idea is in its infancy and thus, an approach can be decided only after a healthy amount of discussion. Please visit Snare's issues section for further discussion on this.