Skip to content

Latest commit

 

History

History
63 lines (48 loc) · 1.96 KB

README.md

File metadata and controls

63 lines (48 loc) · 1.96 KB

finnish-parliament-scripts

Scripts for retrieving and aligning speech and meeting transcripts from the web portal of the Parliament of Finland (https://www.eduskunta.fi)

Dependencies:

  • sox
  • avconv
  • sclite
  • python3
  • python3-lxml
  • wget

ASR system is also required to produce first-pass hypotheses

Download videos and meeting transcripts and save into DATA-FOLDER:

retrieve/retrieve_sessions.py DATA-FOLDER

Four different files will be saved for each session:

  • *.mp4 - video of the session
  • *.wav - audio file stored in wav-format (16kHz,mono)
  • *.transcript - meeting transcript with speaker information for each paragraph
  • *.metadata - metadata file containing date information and links to the original video and meeting transcript

EDIT: Currently the retrieval of the meeting transcripts fails because the publishing format has changed.

Produce first-pass recognition output with an ASR system (preferably train a biased LM with the meeting transcripts).

Store recognition output in the following format:

  • start-time-in-seconds end-time-in-seconds word

Align the first-pass output with the meeting transcript using sclite:

align/asr_align_2_elan.py asr-output transcript-file metadata-filename elan-filename

The output is in the Elan EAF-format.

Test the alignment script with example files:

align/asr_align_2_elan.py test/session_79_2008.asr test/session_79_2008.transcript test/session_79_2008.metadata test/session_79_2008.eaf

Extract individual speech segments from a list of EAF-files:

extract/elan_wav_extractor.py eaf-list wav-segment-dir

Stores both audio file (.wav) and transcript (.trn)

Extract individual speech segments from a list of metadata files:

extract/corpus_extractor.py metadata-file-list 

Stores audio file (.wav)

André Mansikkaniemi, andre.mansikkaniemi@aalto.fi