This project was a journey towards exploring and extracting trends from the historical climate events reporting in the Gazette de Lausanne, a daily newspaper, with over 4 millions articles spanning from 1798 to 1990. Given the scale of the project, we leveraged natural language processing (NLP) techniques to efficiently process the data.
The core of the project was the development of a specific manually annotated dataset and the creation of a tailored language model (LM), CatastroBERT. This LM identified approximately 15,000 pertinent articles, demonstrating not only a high degree of precision and efficiency but also an ability to generalize and predict extreme weather events in years not included in its training. This robustness underscores CatastroBERT’s potential for a wide array of future research applications.
CatastroBERT and its experimental multilingual variant, CatastroBERT-M, are now accessible for future research on HuggingFace. While CatastroBERT-M shows promise, it may require further tuning to optimize its performance across languages, reflecting our commitment to continually enhancing these tools’ capabilities. This project provides valuable tools and insights for ongoing and future research in the field.
See the notebook for a detailed example of how to use CatastroBERT on larger datasets.
Before you begin, make sure you have Python installed on your system. This script was tested with Python 3.8 and above.
It is recommended to use a virtual environment for Python projects to manage dependencies effectively. You can set up a virtual environment as follows:
python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\Scripts\activate`
then install the required packages with the following command:
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "epfl-dhlab/CatastroBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification(model_name)
def predict(text):
# Prepare the text data
inputs = tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
return_token_type_ids=True,
padding=True,
max_length=512,
truncation=True,
return_tensors='pt'
)
ids = inputs['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
mask = inputs['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')
# Get predictions
with torch.no_grad():
outputs = model(ids, mask)
logits = outputs.logits
# Apply sigmoid function to get probabilities
probs = torch.sigmoid(logits).cpu().numpy()
# Return the probability of the class (1)
return probs[0][0]
#example usage
text = "Un violent ouragan est passé cette nuit sur Lausanne."
print(f"Prediction: {predict(text)}")