Swedish parliamentary proceedings --- 1867--today --- v2024.09.13

Westac Project, 2020--2024 | Swerik Project, 2023--2025

The data set

The full data set consists of multiple parts, which are version controlled independently from eachother. For convenience, the most up-to-date versions of these data sets are zipped and made available as a package on the release page any time there's an update. These components are:

records_vX.X.X.zip (swerik-project/riksdagen-records/) -- Parliamentary records (riksdagens protokoll) from 1867 until today in the Parla-clarin format
persons_vX.X.X.zip (swerik-project/riksdagen-persons/) -- Comprehensive list of members of parliament, ministers and governments during this period + associated metadata (mandate periods, party info, etc)
dumps_v20XX.XX.XX.zip -- various files containing merged / filtered / wrangled (meta)data
[comming soon] -- An annotated catalog of motions submitted to the parliament with linked metadata
[comming soon] -- An annotated catalog of Interpellation questions submitted to the government and Interpellation debates within the parliament

Version compatibility

The table below is a record of semantically versioned repositories that are known to be compatible a the time of dated releases here:

Dated Release	Repository Versions
v2024.09.13	pyriksdagen: v1.4.0 riksdagen-persons: v1.1.0 riksdagen-records: v1.2.0
v2024.06.19	pyriksdagen: v1.2.0 riksdagen-persons: v1.1.0 riksdagen-records: v1.1.0
v2024.04.26	pyriksdagen: v1.2.0 riksdagen-persons: v1.0.0 riksdagen-records: v1.0.0

Basic use

Get the most recent version of the data can be found here. It has the following structure

Annual Parliamentary record (protocol) files organized in subdirectories according to parliament years
Structured metadata on members of parliament, ministers, and governments

Archives (.zip files) can be downloaded, extracted, and used in whatever way. We offer some examples and tools for working with the corpus in Python and R.

Pyriksdagen: a Python module

Pyriksdagen is a Python module developed in parallel with the corpus, designed spedifically for working with the corpus. It can be installed via PyPi in the ordinary way

(venv) ~$ pip install pyriksdagen

A simple workflow is demonstrated in this Google Colab notebook.

rcr: an R module

There's an R package; to install, run:

library(remotes)
remotes::install_github('swerik-project/rcr')

As a first step, we point to the directory where the corpus files are stored.

set_riksdag_corpora_path("[THE PATH TO THE CORPORA HERE]")

To extract speeches, we use extract_speeches_from_records(). Below is an example that assumes that the corpora path has been set and extracts the speeches from three different records.

fps <-
  c("protocols/1896/prot-1896--ak--042.xml",
    "protocols/1951/prot-1951--fk--029.xml",
    "protocols/1975/prot-1975--036.xml")
sp <- extract_speeches_from_records(fps)

Design choices of the project

The Riksdagen corpus is released as an iterative process, where the corpus is continuously curated and expanded. Semantic versioning is used for the whole corpus, following the established major-minor-patch practices as they apply to data. For each major and minor release, a battery of unit tests are run and a statistical sample is drawn, annotated and quantitatively evaluated to ensure integrety and quality of updated data. Errors are fixed as they are detected in order of priority. Moreover, the edit history is kept as a traceable git repository.

While the contents of the corpus will change due to curation and expansion, we aim to keep the deliverable API, the corpus/ folder, as stable as possible. This means we avoid relocating files or folders, changing formats, changing columns in metadata files, or any other changes that might break downstream scripts. Conversely, files outside the corpus/ folder are internal to the project. End users may find utility in them but we make no effort to keep them consistent.

The data in the corpus is delivered as TEI XML files to follow established practices. The metadata is delivered as CSV files, following a normal form database structure while allowing for a legible git history. A more detailed description of the data and metadata structure and formats can be found in the README files in the corpus/ folder.

Documentation

Documentation and example usage of Pyriksdagen and rcr can be found in their respective repositories. Additionally some documentation about the curation process can be found in the scripts repository.

Descriptive statistics at a glance

Currently, we have an extensive set of Parliamentary Records (Riksdagens Protokoll) from 1867 until now. We are in the process of preparing Motions for inclusion in the corpus and other document types will follow.

	v2024.09.13	v2024.06.19	v2024.04.26
Corpus size (GB)	11.21	11.17	11.06
Number of parliamentary records	17935	17935	17800
Total parliamentary record pages*	1067858	1067858	1056361
Total parliamentary record speeches	1033991	1034498	1022014
Total parliamentary record words	455943546	450383213	446349968
Number of Motions	0	0	0
Total motion pages	0	0	0
Total motion words	0	0	0
Number of people with MP role	5975	5975	5975
Number of people with minister role	546	546	546

* Digital original parliamentary records for some years in the 1990s are not paginated and thus do not contribute to the page count.See also §Number of Pages in Parliamentary Records.

Parliamentary Records over time

This section plots information about the parliamentary records from the riksdagen-records repository v1.2.0.

Number of Parliamentary Records

Number of Pages in Parliamentary Records

Number of Speeches in Parliamentary Records

Number of Words in Parliamentary Records

Quality assessment

Speech-to-speaker mapping

We check how many speakers in the parliamentary records our algorithms idenify in each release. From the riksdagen-records repository v1.2.0.

Correct number of MPs over time

We check the number of MPs with a mandate on a given day against he baseline number of MPs that we know should be sitting in parliament. From the riksdagen-persons repository v1.1.0.

This plot illustrates the mean daily number of MPs in the metadata compared to the baseling.

For more granularity, the plot below shows a box plot distribution of the daily number of MPs in each year agaist the baseline; mostly they are not visible, as they are tightly underneath the mean line (red). Colored dots represent outlier days.

Segment classification

The parliamentary records are subdivided into various components, including utterances, notes, and speaker introductions. As of the riksdagen-records repository v1.0.0, the segment classification accuracy was 0.9499.

OCR accuracy

As of v1.0.0 of the riksdagen-records corpus, the cumulative character error rate for 0.0311, and the word error rate is 0.0869, i.e., roughly 3 per cent of the characters and 9 per cent of the words are incorrect due to OCR errors.

Participate!

If you would like to participate in the curation or quality control of data contained in the Swedish Parliament Corpus, please be in touch!

Acknowledgement of support

Westac funding: Vetenskapsrådet 2018-0606
Swerik funding:Riksbankens Jubileumsfond IN22-0003

Last update: 2024-09-13, 06:56:43

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github/workflows		.github/workflows
docs/decisions		docs/decisions
dumps		dumps
plots		plots
readme		readme
stats		stats
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swedish parliamentary proceedings --- 1867--today --- v2024.09.13

The data set

Version compatibility

Basic use

Pyriksdagen: a Python module

rcr: an R module

Design choices of the project

Documentation

Descriptive statistics at a glance

Parliamentary Records over time

Number of Parliamentary Records

Number of Pages in Parliamentary Records

Number of Speeches in Parliamentary Records

Number of Words in Parliamentary Records

Quality assessment

Speech-to-speaker mapping

Correct number of MPs over time

Segment classification

OCR accuracy

Participate!

Acknowledgement of support

About

Releases 16

Packages

Contributors 4

Languages

swerik-project/the-swedish-parliament-corpus

Folders and files

Latest commit

History

Repository files navigation

Swedish parliamentary proceedings --- 1867--today --- v2024.09.13

The data set

Version compatibility

Basic use

Pyriksdagen: a Python module

rcr: an R module

Design choices of the project

Documentation

Descriptive statistics at a glance

Parliamentary Records over time

Number of Parliamentary Records

Number of Pages in Parliamentary Records

Number of Speeches in Parliamentary Records

Number of Words in Parliamentary Records

Quality assessment

Speech-to-speaker mapping

Correct number of MPs over time

Segment classification

OCR accuracy

Participate!

Acknowledgement of support

About

Resources

Stars

Watchers

Forks

Releases 16

Packages 0

Contributors 4

Languages

Packages