04 BlackWidow

Introduction

BlackWidow is a .NET library based on SharpGrabber. Rather than relying on .NET assemblies, BlackWidow executes scripts written specifically for grabbing.

Why use BlackWidow?

BlackWidow gives you the following advantages over the traditional NuGet package approach:

Always Up-to-date: The scripts are always kept up-to-date at runtime; so the functionality of the host application won't break as the sources change - at least not for long!
ECMAScript Support: Supports JavaScript/ECMAScript out of the box.
Easy Maintenance: JavaScript is darn easy to write and understand! This helps contributors to quickly write new grabbers or fix the existing ones.
Secure: The scripts are executed in a sandbox environment, and they only have access to what the BlackWidow API exposes to them.
Highly Customizable: Almost everything is open for extension or replacement. Make new script interpreters, custom grabber repositories, or roll out your own interpreter APIs

How does it work?

To understand how BlackWidow works, first we need to introduce Grabber Repositories. A Grabber Repository is a collection of scripts written specifically for grabbing, each with its own identifier, version information and metadata. Grabber Repositories store and read from various sources; such as local disk, GitHub repository etc.

The BlackWidow library provides a service that takes two repositories, a local one, and a remote one. Since grabbers are always loaded locally, the service tries to download new scripts from the remote repository and save them locally. It also watches the local repository for changes, and whenever a script is added or changed, it reads the JavaScript files. After loading them into memory, they are registered as implementations of IGrabber.

Installation

Install the NuGet package.

Install-Package SharpGrabber.BlackWidow -Version 1.1

Official Grabber Repository

This GitHub repository provides a collection of grabber scripts. You can find all of the officially available grabber scripts here. All contributions to the grabber scripts should be done on this directory and merged through PRs.

Usage

1. Creating BlackWidow Service

var service = await BlackWidowBuilder.New()
                .SetScriptHost(ScriptHost = new())
                .ConfigureInterpreterService(icfg => icfg.AddJint())
                .ConfigureLocalRepository(cfg => cfg.UsePhysical(@"blackwidow/repo"))
                .ConfigureRemoteRepository(cfg => cfg.UseOfficial())
                .BuildAsync();

SetScriptHost sets the IScriptHost, which is an object with callback methods that are called directly from the script, such as console.log.
ConfigureInterpreterService configures the BlackWidow service to add JavaScript support using Jint.
Using ConfigureLocalRepository and UsePhysical we specify a local path relative to current directory.
Using ConfigureRemoteRepository and UseOfficial we declare that our remote source-of-truth is actually the official SharpGrabber GitHub repository.

2. Use the dynamic grabber

BlackWidow service provides an IGrabber instance. This special implementation of IGrabber is dynamic, meaning that it has an always up-to-date internal list of grabbers, each associated with a local script.

// while building the Multi-Grabber
var grabber = GrabberBuilder.New()
               .UseDefaultServices()
               .Add(service)

Writing Scripts

This section describes how one can write grabber scripts in JavaScript.

Basics

A grabber script should expose two methods to its BlackWidow host, named supports and grab - as if implementing IGrabber. The host defines a variable named grabber in the global scope. The script should set the grab methods on this value.

grabber.supports (url: string): bool

This method tests if this grabber supports the input URL.

grabber.supports = url => /^https?:\/\/(www\.)?example\.com/.test(url)

grabber.grab (request: GrabRequest, result: GrabResult): bool

This method processes request by grabbing information from the provided URL, and updates result.

grabber.grab = (request, result) => {
    const url = request.url
    result.title = '<title of the video!>'
    result.grab('info', {
	author: 'Some guy',
	length: 60, // 1m
    })
    ...
}

Work in progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly