Skip to content

Releases: fmacpro/horseman-article-parser

0.9.0

14 Nov 18:53
e53e187
Compare
Choose a tag to compare
  • Allows passing of rules for returning an articles title & contents. This is useful in a case
    where the parser is unable to return the desired title or content e.g.
rules: [
  {
    host: 'www.bbc.co.uk',
    content: () => {
      var j = window.$
      j('article section, article figure, article header').remove()
      return j('article').html()
    }
  },
  {
    host: 'www.youtube.com',
    title: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[0].videoPrimaryInfoRenderer.title.runs[0].text
    },
    content: () => {
      return window.ytInitialData.contents.twoColumnWatchNextResults.results.results.contents[1].videoSecondaryInfoRenderer.description.runs[0].text
    }
  }
]

0.8.54

31 Aug 19:37
b28a125
Compare
Choose a tag to compare
  • get site icon url

0.8.53

06 Aug 19:30
697034c
Compare
Choose a tag to compare
  • BBC article scraping fixed
  • Dependencies updated

0.8.52

07 Jan 19:21
4bde87f
Compare
Choose a tag to compare
  • sidebar keyword removed from unlikely candidates regex & handled unexpected redirects ( fixes #47 )
  • article body identification rules (regexes) moved to options
  • exposed original html of document on response object ( #48 )
  • dependency security updates
  • amended the default puppeteer.goto waitUntil option to be networkidle2 rather than domcontentloaded

0.8.51

07 Aug 23:54
28d204f
Compare
Choose a tag to compare
  • dependencies updated

0.8.5

09 Jul 19:32
Compare
Choose a tag to compare
  • Allow compromise plugins to be passed in
  • Update docs

Compromise is the natural language processor that allows horseman-article-parser to return
topics e.g. people, places & organisations. You can now pass custom plugins to compromise to modify or add to the word lists like so:

/** add some names
let testPlugin = function(Doc, world) {
  world.addWords({
    'rishi': 'FirstName',
    'sunak': 'LastName',
  })
}

const options = {
  url: 'https://www.theguardian.com/commentisfree/2020/jul/08/the-guardian-view-on-rishi-sunak-right-words-right-focus-wrong-policies',
  enabled: ['lighthouse', 'screenshot', 'links', 'sentiment', 'entities', 'spelling', 'keywords'],
  nlp: {
    plugins: [testPlugin]
  }
}

This allows us to match - for example - names which are not in the base compromise word lists.

0.8.4

08 Jul 17:12
8a8fc35
Compare
Choose a tag to compare
  • Removed title manipulation logic

The title manipulation isn't good enough. I think this is better done in the application using the package if required where logic specific to the site being crawled can be applied.

0.8.3

07 Jul 23:41
209934c
Compare
Choose a tag to compare
  • Refactor title processing

Title processing can now be turned on and is off by default. It is now also possible to configure the title processing functionality as below

var options = {
  title: {
    useBestTitlePart: true, // true turns on the title processing
    commonSeparatingCharacters: [' | ', ' _ ', ' - ', '«', '»', ' — ', ' — ', ' – '],
    minimumTitlePartLength: 10
  }
}

0.8.2

07 Jul 18:53
0d7cf5a
Compare
Choose a tag to compare
  • Improve title handling

0.8.1

27 Jun 11:43
353db9e
Compare
Choose a tag to compare
  • use latest puppeteer
  • utilise workaround for stealth plugin compatibility
  • fixed .close causing UnhandledPromiseRejectionWarning: Error: WebSocket is not open: readyState 2 (CLOSING)
  • ignore content security policy