Transcription D-LUCEA

This GitHub repository contains the code used to transcribe parts of the Database of the Longitudinal Utrecht Collection of English Accents (D-LUCEA) using WhisperAI speech-to-text. The main aim was to enable the transcription of only relevant parts of each recording. Additionally, the repository includes scripts for processing the transcripts to enhance their usability for the subsequent corpus analysis using regular expressions in python.

Getting Started

Clone this repository to get a copy on your PC.

git clone https://github.com/UtrechtUniversity/transcription-d-lucea.git

Prerequisites

To install and run this project you need to have the following prerequisites installed.

ffmpeg
pandas
librosa
soundfile
whisper
whisperX

It is recommended to use a GPU powered computer (or server) for creating transcripts. With a recent GPU you can produce transcripts using the state-of-the-art Whisper models in a few seconds. At UtrechtUniversity you can get access to a VRE with a GPU via RDM support.

Installation

We recommend using Conda to create a Python environment for this project.
First follow the setup instructions for WhisperX to create an environment.

Additionally install Whisper and ffmpeg using the setup instructions for Whisper

Lastly, install librosa and soundfile using pip:

pip install librosa
pip install soundfile

And on a Linux operating system, you will need to install libsndfile to work with soundfile:

sudo apt-get install libsndfile1

Project structure

When reading data files, the scripts are assuming the project structure below; the scripts use relative paths so should run out-of-the-box if you use the same structure:

.
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── src              <- main folder for all source code
│   └── trim_and_transcribe.ipynb
├── data             <- All project data, not published in git
│   ├── audio_files
│   ├── trimmed_audio
│   └── Recordings.xlsx         
└── output
    └── transcripts

Usage

The jupyter notebook that is used for creating transcripts can be found in the src/ folder.

About the Project

Date: Month Year

Researcher(s):

Name of researcher 1 (researcher.1@uu.nl)
Name of researcher 2 (researcher.2@uu.nl)

Research Software Engineer(s):

Name of RSE 1 (rse.1@uu.nl)
Name of RSE 2 (rse.2@uu.nl)

License

The code in this project is released under MIT license.

Attribution and academic use

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

To contribute:

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transcription D-LUCEA

Getting Started

Prerequisites

Installation

Project structure

Usage

About the Project

License

Attribution and academic use

Contributing

Contact

About

Releases

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

UtrechtUniversity/transcription-d-lucea

Folders and files

Latest commit

History

Repository files navigation

Transcription D-LUCEA

Getting Started

Prerequisites

Installation

Project structure

Usage

About the Project

License

Attribution and academic use

Contributing

Contact

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Contributors 2

Languages