Skip to content

Wikipedia scraper and API that serves historical battles, commanders and their factions.

License

Notifications You must be signed in to change notification settings

sasalatart/battles-and-commanders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Battles and Commanders · m b

About

Battles and Commanders is a Wikipedia scraper and API that serves historical battles, commanders and their factions. You can try the API and read its documentation here. This project was built with Go and Postgres.

The work here is still in progress. Also, consider that scrapers are brittle: they are subject to the webpage's HTML structure updates, and if Wikipedia decides to alter its HTML, this scraper may stop working if not properly updated.

Development setup

This project requires Docker and docker-compose to be installed.

In development, Docker is used together with Make to let you spin up your local environment without worrying about complex commands and installing dependencies. You may still run the code natively with Go without Docker, although you will lose some features such as auto-reload. For a list and description of all the available Make commands, just run make help.

Most settings may be changed by editing the file in config/config.yaml, although you will probably not need to change them. You might, however, want to override some, such as the database password.

Scraper

# Run the scraper inside a Docker container
$ make scrape

The resulting data.json file at the root dir of this project will contain normalized battles, factions and commanders. You may use this file for seeding the API (see next section), or for some other project.

API

# Turn on the API (http://localhost:3000) and a Postgres container (port 14000). The API has
# auto-reload configured
$ make dev_up

# Run the seeder (just needed once). Alternatively, you may run "make dev_seed_local" if you have
# the scraper results file in the root dir of this project
$ make dev_seed_url

# (Optional) remove Docker containers and volumes created by "make dev_up"
$ make dev_destroy

Installing for use with your own Go projects

Some of the functionality used by both the scraper and the API is publicly available for use outside this project, inside the pkg dir. To install, simply run:

$ go get github.com/sasalatart/batcoms

Some usage examples include:

  1. Scraping a list of potential battles (names and urls only, false positives may be included):

    package main
    
    import (
       "github.com/sasalatart/batcoms/pkg/logger"
       "github.com/sasalatart/batcoms/pkg/scraper/list"
    )
    
    func main() {
       loggerService := logger.NewDiscard() // Or anything that implements logger.Interface
       potentialBattles := list.Scrape(loggerService)
    }
  2. Scraping specific battles:

    package main
    
    import (
       "github.com/sasalatart/batcoms/pkg/logger"
       "github.com/sasalatart/batcoms/pkg/scraper/battles"
    )
    
    func main() {
       loggerService := logger.NewDiscard() // Or anything that implements logger.Interface
       scraperService := battles.NewScraper(loggerService)
    
       austerlitz, err := scraperService.ScrapeOne("https://en.wikipedia.org/wiki/Battle_of_Austerlitz")
       // Handle error and optionally do something with normalized Battle of Austerlitz...
       actium, err := scraperService.ScrapeOne("https://en.wikipedia.org/wiki/Battle_of_Actium")
       // Handle error and optionally do something with normalized Battle of Actium...
    
       // Each battle contains normalized data (ids of factions and commanders), so we export everything
       data := scraperService.Data()
       // Do something with data.BattlesByID, data.FactionsByID and/or data.CommandersByID...
    }
  3. Cleaning scraped text:

    package main
    
    import "github.com/sasalatart/batcoms/pkg/strclean"
    
    func main() {
       input := "Soviet victory:[1]\n\nDestruction of the German 6th Army"
       output := strclean.Apply(input) // "Soviet victory: Destruction of the German 6th Army"
    }
  4. Parsing Wikipedia's Info Box date text (accuracy improvements are still WIP...):

    package main
    
    import "github.com/sasalatart/batcoms/pkg/dates"
    
    func main() {
       d1 := "January-March 309 B.C."
       parsed1, err := dates.Parse(d1)
       // []dates.Historic{
       //   dates.Historic{ Year: 309, Month: 1, Day: 0, IsBCE: true },
       //   dates.Historic{ Year: 309, Month: 3, Day: 0, IsBCE: true },
       // }
       // Handle error and do something with parsed1...
    
       d2 := "July 6, 1950; 69 years ago (1950-07-06)"
       parsed2, err := dates.Parse(d2)
       // []dates.Historic{dates.Historic{ Year: 1950, Month: 7, Day: 6, IsBCE: false }}
       // Handle error and do something with parsed2...
    }

Testing

# Shell 1: Turn on the API in test mode (http://localhost:8888) and a Postgres container (port 14001)
$ make test_up

# Shell 2: Run the actual tests inside the container
$ make test

Just like when running the API in dev mode, you may run the make test_destroy command to remove Docker containers and volumes created for running tests.

Credits

Special thanks to Wikipedia and the content-creators that have provided the historical data served and scraped by this app, and which are licensed as Creative Commons Attribution-ShareAlike 3.0 Unported License.

License

Copyright (c) 2020, Sebastián Salata Ruiz-Tagle

Battles and Commanders is MIT licensed.

Packages

No packages published

Languages