VitalRecordAnalyzer

This repos host the code of our paper French vital records data gathering and analysis through image processing and machine learning algorithms.

Abstract : Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.

If you find this code useful and use it for your own production, please cite the following paper:

@article{plateauholleville:hal-03189188,
  TITLE = {{French vital records data gathering and analysis through image processing and machine learning algorithms}},
  AUTHOR = {Plateau-Holleville, Cyprien and Bonnot, Enzo and Gechter, Franck and Heyberger, Laurent},
  URL = {https://hal.archives-ouvertes.fr/hal-03189188},
  JOURNAL = {{Journal of Data Mining and Digital Humanities}},
  PUBLISHER = {{Episciences.org}},
  VOLUME = {2021},
  YEAR = {2021},
  MONTH = Jul,
  DOI = {10.46298/jdmdh.7327},
  KEYWORDS = {Handwritten Text Recognition ; Machine Learning ; Optical Character Recognition ; Historical Data},
  HAL_ID = {hal-03189188},
  HAL_VERSION = {v3},
}

Features

DatasetBuilder

Auto segmentation of a text sample image
GUI to help and speed up the translation of image text into ascii by the user
- Shortcut (Enter to validate translation, Escape to skip the current image)
Provide JSON file containing translated text and segmented fragment into the selected folder

CivilRegistryAnalyser

Auto segmentation of a text sample image
Text recognition on fragmented images
Text features extractions with NLP.

Installation

Dependencies

OpenCv (Tested with >= 4.0.1)
Qt (Tested with >= 5.11.2)
Python 3

You might also need CUDA if you want to use Tensorflow's GPU version.

In order to use conda environment inside the software you need to provide the following environment variables :

PYTHONHOME = <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>
PYTHONPATH =
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\DLLs\
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\Lib\
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\Lib\site-packages
Add the following paths in your PATH :
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\Library\bin
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\Scripts
- <PathToAnacondaInstallation>\Anaconda\envs\<TheEnvironmentYouWantToUse>\Library

Clone and Build

git clone --recurse-submodules -j8 https://github.com/PlathC/CivilRegistryAnalyser.git
cd CivilRegistryAnalyser
pip install -r requirements.txt

mkdir build && cd build
cmake ..
cmake --build .

Assets

For any other dataset than the one created using the DatasetBuilder you can follow this tutorial

Dataset transformation

Once the image transcription is finished and that you have enough data create a raw folder in the py folder and a civil folder in the raw folder. Once all the folders are created, put all the folders created by the DatasetBuilder and the transcription.json file in the civil folder. Then you can run the transform.py that will create the HDF5 file that will be read for the training. You will find this file in the data folder that has been created.

Model training

To train the model on the dataset you've just created you just need to run the training.py file.

If when training your model on the dataset created thanks to our dataset builder you have an error such as No valid path found for the loss calculator, try to go to the definition of the function ctc_batch_cost() used line 261 in the ctc_loss_lambda_func() function in the src/network/model.py. Once you're in the function ctc_batch_cost() add the parameter ignore_longer_outputs_than_inputs=True to the ctc.ctc_loss() function that should be on lines 5763-5764.

Model testing

Once the training is finished you can run tests to see if the training you have done is efficient. To do so you just need to run the test.py file. If you want to replace the model used in the CivilRegistryAnalyser you need to replace the checkpoint_weights.hdf5 file in the py folder by the one that has been created in output/civil/flor.

EAST model

For text segmentation we used the EAST (An Efficient and Accurate Scene Text Detector) model with a dataset that we created. All the files used for the EAST model and training can be found in the py/EAST folder. To train the model the file used is multigpu_train.py and the dataset path should be provided in the icdar.py for the training_data_path flag. Once the model has been trained the graph can be frozen with the convert_graph.py and then used with opencv to run the text detection with the eval_frozen_graph.pyfile.

Based on

Contributors

Enzo Bonnot

Cyprien Plateau--Holleville

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
cmake/modules		cmake/modules
images		images
include		include
py		py
samples		samples
src		src
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VitalRecordAnalyzer

Features

DatasetBuilder

CivilRegistryAnalyser

Installation

Dependencies

Clone and Build

Assets

Dataset transformation

Model training

Model testing

EAST model

Based on

Contributors

About

Releases

Packages

Contributors 2

Languages

PlathC/VitalRecordAnalyser

Folders and files

Latest commit

History

Repository files navigation

VitalRecordAnalyzer

Features

DatasetBuilder

CivilRegistryAnalyser

Installation

Dependencies

Clone and Build

Assets

Dataset transformation

Model training

Model testing

EAST model

Based on

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages