Releases · bnjmacdonald/hansardparser

Implements a dockerized plenaryparser that parses a TXT or PDF Hansard transcript receive via an HTTP POST request.

The plenaryparser invokes two sub-tasks: hansard_line_type4 and hansard_line_speaker_span.

The hansard_line_type4 task predicts whether a line in a hansard transcript is a "header", "speech", "scene", or "garbage" line. If hansard_line_type4=="rule", then the prediction is made via regular expressions. If hansard_line_type4=="supervised", then the prediction is made via a served tensorflow model.

The hansard_line_speaker_span task labels each character in a Hansard transcript line using BIO tagging, where the "B" tag corresponds to the beginning of a speaker name in the line, the "I" tag corresponds to a character in the speaker name, and the "O" tag corresponds to a character not part of the speaker name. If hansard_line_speaker_span=="rule", then the prediction is made via regular expressions. If hansard_line_type4=="supervised", then the prediction is made via a served tensorflow model. If hansard_line_type4=="hybrid", then a binary prediction is made via a served tensorflow model for whether the line contains a speaker name, and then, if the line is predicted to contain a speaker name, the BIO tags are assigned using regular expressions. The "hybrid" approach saves a great deal of time on inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: bnjmacdonald/hansardparser

Dockerized supervised- and rule-based plenaryparser

initial release