-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.rst
249 lines (179 loc) · 10.1 KB
/
README.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
.. image:: https://img.shields.io/badge/OS-Climate-blue
:alt: An OS-Climate Project
:target: https://os-climate.org/
.. image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
:alt: Join OS-Climate on Slack
:target: https://os-climate.slack.com
.. image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-transformer-based-extractor
.. image:: https://img.shields.io/pypi/v/osc-transformer-based-extractor.svg
:alt: PyPI package
:target: https://pypi.org/project/osc-transformer-based-extractor/
.. image:: https://api.cirrus-ci.com/github/os-climate/osc-transformer-based-extractor.svg?branch=main
:alt: Built Status
:target: https://cirrus-ci.com/github/os-climate/osc-transformer-based-extractor
.. image:: https://img.shields.io/badge/PDM-Project-purple
:alt: Built using PDM
:target: https://pdm-project.org/latest/
.. image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/
===============================
osc-transformer-based-extractor
===============================
This folder contains a set of scripts and notebooks designed to process data, train a sentence transformer model, and perform inferences to detect the relevance of folder contents. Below is a detailed description of each file and folder included in this repository.
## How to Use This Repository
1. **Get Training Data**:
- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:
- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:
| question | context | label | company | source_file | source_page | kpi_id | year | answer | data_type | annotator | Index |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | --------------------- | ----- |
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | 0 | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | train_anno_large.xlsx | 1022 |
- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.
2. **Train the Model**:
- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.
- To train the model using function calling:
```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```
**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.
- To train the model from the command line, run `fine_tune.py` with the required arguments:
```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
```
3. **Perform Inference**:
- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
- For programmatic inference, you can use the function provided in `inference.py`:
```python
from inference import get_inference
result = get_inference(question="What is the relevance?", context="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```
## Repository Contents
### Python Scripts
1. **`inference.py`**
- This script contains the function to make inferences using the trained model.
- **Usage**: Import this script and use the provided function to predict the relevance of new data.
- **Example**:
```python
from inference import get_inference
result = get_inference(question="What is the relevance?", context="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```
**Parameters**:
- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
- `tokenizer_path (str)`: Path to the tokenizer of the pre-trained model.
2. **`train_sentence_transformer.py`**
- This script defines a function to train a sentence transformer model, which can be called from other scripts or notebooks.
- **Usage**: Import and call the `fine_tune_model` function to train your model.
- **Example**:
```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```
**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.
3. **`fine_tune.py`**
- This script allows you to train a sentence transformer model from the command line.
- **Usage**: Run this script from the command line with the necessary arguments.
- **Example**:
```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
```
### Jupyter Notebooks
1. **`inference_demo.ipynb`**
- A notebook to demonstrate how to perform inferences using a custom model and tokenizer.
- **Features**: Allows specifying model and tokenizer paths, which can be local paths or HuggingFace paths.
- **Usage**: Open this notebook and follow the instructions to test inference with your own models.
2. **`train_sentence_transformer.ipynb`**
- A notebook to train a sentence transformer model and save the trained model locally.
- **Usage**: Open and execute this notebook to train your model using the prepared data and save the trained model for inference.
### Data Folder
- **`Data/`**
- This folder contains the processed training data obtained from the `curator` module. It serves as the input for training the sentence transformer model.
## Setting Up the Environment
To set up the working environment for this repository, follow these steps:
1. **Clone the repository**:
```bash
git clone https://github.com/yourusername/folder-relevance-detector.git
cd folder-relevance-detector
```
2. **Create a new virtual environment and activate it**:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. **Install PDM**:
```bash
pip install pdm
```
4. **Sync the environment using PDM**:
```bash
pdm sync
```
5. **Add any new library**:
```bash
pdm add <library-name>
```
## Requirements
- Python 3.x
- Jupyter Notebook
- Required Python packages (install via `pdm` as described above)
## Contributing
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on <http://developercertificate.org/>. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).
---
For further details and documentation, please refer to the individual scripts and notebooks provided in this repository.
---