CocoCrawler 🥥

CocoCrawler is an easy to use web crawler, scraper and parser in C#. By combing PuppeteerSharp and AngleSharp it brings the best of both sides, and merges them into an easy to use API.

It provides an simple API to get started

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://old.reddit.com/r/csharp", pageOptions => pageOptions
        .ExtractList(containersSelector: "div.thing.link.self", [
            new("Title","a.title"),
            new("Upvotes", "div.score.unvoted"),
            new("Datetime", "time", "datetime"),
            new("Total Comments","a.comments"),
            new("Url","a.title", "href")
        ])
        .AddPagination("span.next-button > a")
        .ConfigurePageActions(options => // Only for showing the possibilities, not needed for running sample
        {
            options.ScrollToEnd();
            options.Wait(2000);
            // options.Click("span.next-button > a");
        })
        .AddOutputToConsole()
        .AddOutputToCsvFile("results.csv")
    )
    .ConfigureEngine(options =>
    {
        options.UseHeadlessMode(false);
        options.PersistVisitedUrls();
        options.WithLoggerFactory(loggerFactory);
        options.WithCookies([
            new("auth-cookie", "l;alqpekcoizmdfugnvkjgvsaaprufc", "thedomain.com")
        ]);
    })
    .BuildAsync(cancellationToken);

await crawlerEngine.RunAsync(cancellationToken);

This examples starts at page https://old.reddit.com/r/csharp scrapes all the posts, then continues to the next page and scrapes everything again, and on and on. And outputs everything scraped to the console and a csv file.

With this library it's easy to

Scrape Single Page Apps
Scrape Listings
Add pagination
Alternative to list is open each post and scrape the page and continue with pagination
Scrape multiple pages in parallel
Add custom outputs
Customize Everything

Scraping pages

With each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to:

`.ExtractObject(...)`

   var crawlerEngine = await new CrawlerEngineBuilder()
       .AddPage("https://github.com/", pageOptions => pageOptions
           .ExtractObject([
                new(Name: "Title", Selector: "div.title > a > span"),
                new(Name: "Description", Selector: "div.title > a > span"),
            ])
        .BuildAsync(cancellationToken);

Which scrapes the title and description of the page and outputs it.

`.ExtractList(...)`

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
    .BuildAsync(cancellationToken);

ExtractList scrapes a list of objects. The containersSelector is the selector for the container that holds the objects. And all selectors after that are relative to the container. Each object in the list is inidividually send to the output.

`.OpenLinks(...)`

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .OpenLinks(linksSelector: "div.example-link-to-repose", subPage => subPage
            .ExtractObject([
                new("Title","div.sitetable.linklisting a.title"),
            ])))
    .BuildAsync(cancellationToken);

OpenLinks opens each link in the linksSelector and scrapes that page. It's usually combined with .ExtractObject(...) and .AddPagination(...). linksSelector expects a list of a tags. It's also possible to chain multiple .OpenLinks(...).

`.AddPagination(...)`

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
        .AddPagination("span.next-button > a")
    .BuildAsync(cancellationToken);

AddPagination adds pagination to the page. It expects a selector to the next page. It's usually the Next button.

Multiple Pages

It's possible to add multiple pages to scrape with the same Tasks.

   var crawlerEngine = await new CrawlerEngineBuilder()
       .AddPages(["https://old.reddit.com/r/csharp", "https://old.reddit.com/r/dotnet"], pageOptions => pageOptions
           .OpenLinks("div.thing.link.self a.bylink.comments", subPageOptions =>
           {
                subPageOptions.ExtractObject([
                       new("Title","div.sitetable.linklisting a.title"),
                       new("Url","div.sitetable.linklisting a.title", "href"),
                       new("Upvotes", "div.sitetable.linklisting div.score.unvoted"),
                       new("Top comment", "div.commentarea div.entry.unvoted div.md"),
               ]);
               subPageOptions.ConfigurePageActions(ops =>
                {
                    ops.ScrollToEnd();
                    ops.Wait(4000);
                });
           })
           .AddPagination("span.next-button > a")
        .BuildAsync(cancellationToken);

   await crawlerEngine.RunAsync(cancellationToken);

This example starts at https://old.reddit.com/r/csharp and https://old.reddit.com/r/dotnet and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And then it continues with the next pagination page.

PageActions - A way to interact with the browser

Page Actions are a way to interact with the browser. It's possible to add page actions to each page. It's possible to click away popups, or scroll to bottom. The following actions are available:

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .ExtractList(containersSelector: "div > div.repos", [
            new(Name: "Title", Selector: "div.title > a > span"),
            new(Name: "Description", Selector: "div.title > a > span"),
        ]))
        .ConfigurePageActions(ops =>
        {
            ops.ScrollToEnd();
            ops.Click("button#load-more");
            ops.Wait(4000);
        });
    .BuildAsync(cancellationToken);

Outputs

It's possible to add multiple outputs to the engine. The following outputs are available:

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("https://github.com/", pageOptions => pageOptions
        .OpenLinks(linksSelector: "div.example-link-to-repose", subPage => subPage
            .ExtractObject([
                new("Title","div.sitetable.linklisting a.title"),
            ])))
        .AddOutputToConsole()
        .AddOutputToCsvFile("results.csv")    
    .BuildAsync(cancellationToken);

You can add your own output by implementing the ICrawlOutput interface.

public interface ICrawlOutput
{
    Task Initiaize(CancellationToken cancellationToken);
    Task WriteAsync(JObject jObject, CancellationToken cancellationToken);
}

Initialize is called once before the engine starts. WriteAsync is called for each object that is scraped.

On Page level it's possible to add custom outputs

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage("", p => p.AddOutput(new MyCustomOutput()))
    .BuildAsync(cancellationToken);

Configuring the Engine

Cookies

It's possible to add cookies to all request

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithCookies([
            new("auth-cookie", "l;alqpekcoizmdfugnvkjgvsaaprufc", "thedomain.com"),
            new("Cookie2", "def", "localhost")
        ]);
    })
    .BuildAsync(cancellationToken);

Setting the User Agent

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithUserAgent("linux browser - example user agent");
    })
    .BuildAsync(cancellationToken);

Default User Agent is from Chrome browser.

Ignoring URLS

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.WithIgnoreUrls(["https://example.com", "https://example2.com"]);
    })    
    .BuildAsync(cancellationToken);

Stopping the engine

The engine stops when the

The total number of pages to crawl is reached.
2 minutes have passed since the last job was added

Persisting visited pages

It's possible to persist visited pages to a file. Once persisted the engine will skip the pages next time.

var crawlerEngine = await new CrawlerEngineBuilder()
    .AddPage(...)
    .ConfigureEngine(options =>
    {
        options.PersistVisitedUrls();
    })
    .BuildAsync(cancellationToken);

Other notable options

The engine can be configured with the following options:

UseHeadlessMode(bool headless): If the browser should be headless or not
WithLoggerFactory(ILoggerFactory loggerFactory): The logger factory to use, to enable logging.
TotalPagesToCrawl(int total): The total number of pages to crawl
WithParallelismDegree(int parallelismDegree) : The number of browser tabs it can open in parallel

Extensibility

The library is designed to be extensible. It's possible to add custom IParser, IScheduler, IVisitedUrlTracker and ICrawler implementations.

using the engine builder it's possible to add custom implementations

.ConfigureEngine(options =>
{
    options.WithCrawler(new MyCustomCrawler());
    options.WithScheduler(new MyCustomScheduler());
    options.WithParser(new MyCustomParser());
    options.WithVisitedUrlTracker(new MyCustomParser());
})

Interfaces	Description
`IParser`	IParser uses default AngleSharp. If you want to use something else then CSS selector, overwrite this.
`IVisitedUrlTracker`	Default uses in memory tracker. It's possible to persist to a file. Those two options are available in the libary.
`IScheduler`	Holds the current Jobs.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
CocoCrawler		CocoCrawler
Examples		Examples
Tests/CocoCrawler.IntegrationTests		Tests/CocoCrawler.IntegrationTests
.gitattributes		.gitattributes
.gitignore		.gitignore
CocoCrawler.sln		CocoCrawler.sln
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CocoCrawler 🥥

Scraping pages

`.ExtractObject(...)`

`.ExtractList(...)`

`.OpenLinks(...)`

`.AddPagination(...)`

Multiple Pages

PageActions - A way to interact with the browser

Outputs

Configuring the Engine

Cookies

Setting the User Agent

Ignoring URLS

Stopping the engine

Persisting visited pages

Other notable options

Extensibility

About

Languages

License

Marcel0024/CocoCrawler

Folders and files

Latest commit

History

Repository files navigation

CocoCrawler 🥥

Scraping pages

.ExtractObject(...)

.ExtractList(...)

.OpenLinks(...)

.AddPagination(...)

Multiple Pages

PageActions - A way to interact with the browser

Outputs

Configuring the Engine

Cookies

Setting the User Agent

Ignoring URLS

Stopping the engine

Persisting visited pages

Other notable options

Extensibility

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`.ExtractObject(...)`

`.ExtractList(...)`

`.OpenLinks(...)`

`.AddPagination(...)`