Releases: cverluise/PatCit
Releases · cverluise/PatCit
🏷️ v0.3.1
0.3.0
🏷 v0.3.0
Data
- Major improvement of
bibliographical_reference
schema (harmonize grobid & crossref) for seamless analysis - Enrichment of
intext.patent
- Add domain specific front page tables (
norm_standard
,database
,wiki
)
Community
- Revisit BQ project architecture
- Add Colab notebooks integration
- Revisit
README.md
Code
- Lighter API
- Lighter dependencies
Models
- Add information extraction models
- Add models and training data DVC support
Validation
- Validation of in-text extraction models
Thanks
Special thanks to:
- Gabriele Cristelli (EPFL)
- Kyle Higham (Hitotsubashi University)
- Lucas Violon (HEC Paris)
v0.2-npl
🏷 v0.2-npl
The v0.2-npl
introduces 2 major improvements:
npl_class
field. This field is predicted using a multi-class text classification model based on spaCy textCategorizer with the npl text as input. See focus and models binaries below.- Propagate
ISSN
usingtitle_j
to bibliographical references with the sametitle_j
but no match.
Focus on npl_class
en_core_web_sm_npl-class-ensemble-0.8
ensemble
model (bow+cnn with bagging)- trained on 80% of the "gold" dataset and evaluated on remaining 20% (hold-out)
See in models/npl_class_training/ for more
Average performance
accuracy | precision | recall | f1 |
---|---|---|---|
0.9 | 0.89 | 0.88 | 0.88 |
Class performance
precision | recall | f1 | support |
---|---|---|---|
BIBLIOGRAPHICAL_REFERENCE | 0.92 | 0.95 | 0.93 |
SEARCH_REPORT | 1.0 | 0.92 | 0.96 |
OFFICE_ACTION | 0.99 | 0.93 | 0.96 |
DATABASE | 0.89 | 0.73 | 0.8 |
WEBPAGE | 0.53 | 0.53 | 0.53 |
PATENT | 0.91 | 0.94 | 0.93 |
NA | 1.0 | 1.0 | 1.0 |
PRODUCT_DOCUMENTATION | 0.44 | 0.43 | 0.44 |
NORM_STANDARD | 0.86 | 0.6 | 0.71 |
LITIGATION | 0.25 | 0.11 | 0.15 |
en_core_web_sm_npl-class-ensemble-1.0
Same as en_core_web_sm_npl-class-ensemble-1.0
but trained on full dataset to maximize performance. Model used to create the npl_class
field.