This github repository includes code for Dorothy AI Patent Classifier
- Data generation and preprocesss
- Data location and summary
- Machine learning model
- Evaluation
- Visualization
- Web app
- Other
Step 1: We generate our dataset from all granted patents up to September 2019, the total number of patents in the dataset is 4,363,544. To regenerate this dataset, such command could be used
$ sbatch /pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_*.job
or mannuly sbatch from
/pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_a.job
to
/pylon5/sez3a3p/yyn1228/json_process_jobs/json_process_sin_h.job
The extracted dataset is stored in the /pylon5/sez3a3p/yyn1228/data/json_reparse
, this path is defined in the file database_reparse.py.
Step 2: We parse the cpc field into labels we need (section, classs, subclass, etc.), convert the text into a list of tokens, and split the data into train, valid, and test datasets by the ratio of 8:1:1. This step also removes all punctuations and convert all uppercase letters into lower case. This can be done by running the file data_preprocess/text_preprocess.py, for example:
$ python3 -u data_preprocess/text_preprocess.py \
/pylon5/sez3a3p/yyn1228/data/json_reparse \
/pylon5/sez3a3p/yyn1228/data/all_data
Step 3: We further preprocess the data into a format that can be used by the machine learning libraries. This can be done by running the file data_preprocess/create_training_data.py. Note that the file takes 6 arguments:
- input directory
- output directory
- text field: 'title', 'abstraction', 'claims', 'brief_summary' ('description' is too large to include)
- level name: 'section', 'class', 'subclass', 'main_group', 'subgroup'
- whether to remove stop words: True means remove stop words
- whether to follow fasttext format: True means FastText format, False means Tecent format
For example:
$ python3 -u data_preprocess/create_training_data.py \
/pylon5/sez3a3p/yyn1228/data/all_data \
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_group \
brief_summary main_group false true
Processed data after Step 1, which includes 91 files and most of which have 50,000 patents.
/pylon5/sez3a3p/yyn1228/data/json_reparse
Processed data after Step 2, which includes three files: train.json, valid.json, and test.json.
/pylon5/sez3a3p/yyn1228/data/all_data
Smaller datasets for valid and test: created by shuffling valid.json and test.json above and taking the first 60,000 records. These data have the following fields:
- all_labels: all true labels at the lowest subgroup level
- title, abstraction, claims, brief_summary, description: text split into list of tokens for various cpc text fields
/pylon5/sez3a3p/yyn1228/data/all_data_small
Processed data after Step 3 for brief summary and subclass level, in Tecent's format. Note that these data do not have stop words.
/pylon5/sez3a3p/yyn1228/data/all_summary_nonstop
Processed data after Step 3 for brief summary and all levels, in FastText's format. Note that these data include stop words.
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_section
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_class
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext (this is subclass)
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_group
/pylon5/sez3a3p/yyn1228/data/all_summary_fasttext_subgroup
Smaller datasets initially used for testing purposes. Note that these data were generated by legacy code and may not be easily reproduced.
/pylon5/sez3a3p/yyn1228/data/summary_only
/pylon5/sez3a3p/yyn1228/data/summary_only_fasttext
/pylon5/sez3a3p/yyn1228/data/summary_only_nonstop
This section introduces how we use various libraries to train machine learning models. All the models are trained using the brief summary text field.
We use Facebook's FastText library to train the well-known FastText model. This method first converts words into word embeddings and then average word embeddings to create the document embedding. Note that this does not consider the order of the words. To keep some information regarding the order, it includes 2-grams into the vocabulary. Because this model is relatively simple and Facebook uses a lot of tricks to speed up the training, the training can be done by using CPUs instead of GPUs in a couple of hours. To account for the hierarchical information, we borrow the idea from HFT-CNN: we first train the section level and pass the word embeddings into the next word as pretrained word embeddings.
To train FastText on PSC, first run the training job to train the data on the section level
$ sbatch model/FastText/summary_all_section/train_fasttext.job
Then save the word embeddings by running
$ sbatch model/FastText/summary_all_section/bin_to_vec.job
And then do the same thing for the class, subclass, group, and subgroup levels. To change the hyperparameters, just edit the train.py file in the corresponding folder.
We use Tencent's NeuralClassifier library to train the classic CNN/RNN/RCNN text classification models. This model accounts for the hierarchical structure by adding a loss that is calculated based on the label tree, which forces closer leaves in the tree to have closer losses. Note that the library supports many models but we also tried the classic CNN/RNN/RCNN models. We edit some code to allow for using existing vocabulary.
A detailed README on how to train the model using NeuralClassifier is saved here: README.md. All models are saved in the "/pylon5/sez3a3p/yyn1228/Dorothy-Ymir/model/NeuralClassifier/output/xxx/checkpoint_dir_cpc" folders on PSC.
Because there are many hyperparameters to tune, we include a summary of all the models we trained with their corresponding hyperparameters:
We also use the HFT-CNN library to train another model. The idea is to train a CNN model on each level, which pass word embeddings and parameters in early layers to the CNN model on the next level. We add some code to support multi-GPU training. Follow the README.md here to train the model. Subclass level model is saved in the "/pylon5/sez3a3p/yyn1228/Dorothy-Ymir/model/HFT-CNN/CNN/CNN/PARAMS/" folders on PSC.
The detailed evaluation is saved in notebooks/prob_evaluate.ipynb. It also includes methods to ensemble different models. See below a summary of the model results below. The best recall at n ≈ 5 is 91.6%.
To see how the model works on other text fields, we also evaluate the model using title, abstract, and claims, although the model is trained using brief summary. Note that we have not used description to evaluate the model because it would explode the storage, but it is worth trying to evaluate the model using the first 1,000 tokens of the description. Also note that these evaluations only use the FastText model.
Text Field | Precision @ 1 | Recall @ 1 | Precision @ 5 | Recall @ 5 |
---|---|---|---|---|
Title | 0.107 | 0.568 | 0.098 | 0.603 |
Abstract | 0.675 | 0.401 | 0.190 | 0.709 |
Claims | 0.699 | 0.379 | 0.247 | 0.710 |
Title + Abstract + Claims | 0.749 | 0.403 | 0.251 | 0.755 |
Brief Summary | 0.851 | 0.453 | 0.216 | 0.856 |
We also train FastText models at all the 5 levels. Note that it is only plausible to use the FastText model to train for group and subgroup because there are too many labels. At the subclass level, there are 666 labels and it takes hours to train a non-FastText model; at the subgroup level, there are 200,000 labels, which means if we still use the same model, it would take weeks to finish the training. For group and subgroup, we use the "hierarchical softmax loss" in the FastText model, which is a trick developed by Facebook and significantly shortens training time but lowers the performance a little bit.
Level | Precision @ 1 | Recall @ 1 | Precision @ 5 | Recall @ 5 |
---|---|---|---|---|
Section | 0.921 | 0.623 | 0.271 | 0.992 |
Class | 0.886 | 0.535 | 0.257 | 0.929 |
Subclass | 0.851 | 0.453 | 0.216 | 0.856 |
For the group:
Level | Recall @ 1 | Recall @ 10 | Recall @ 100 |
---|---|---|---|
Group | 0.220 | 0.661 | 0.912 |
For subgroup:
Level | Recall @ 1 | Recall @ 10 | Recall @ 100 | Recall @ 1000 |
---|---|---|---|---|
Subgroup | 0.054 | 0.208 | 0.468 | 0.750 |
This notebook at notebooks/visualization.ipynb includes visualizations of patent embeddings and word embeddings. The patent embedding figure is the one used in the presentation. The word embedding figure does not show very clear clusters, because the word lists and categories we choose are too general. The notebook includes all the code to generate the word embeddings and the word lists can be changed easily. If more representative word lists and categories are found, change the word lists and rerun the word embedding part in the notebook to get the word embedding visualization.
Inorder to obtain an intuitive feeling of the result, we built an web app that could predict the corresponding CPC code given by any text in real time, and generate an tree plot. One thing to note is that the WebApp is under the visualization
branch, and the model impl is under master
branch.
The backend for our Web app is Django, the frontend is built by React and the project is deployed on the AWS. The user could easily type in any text the describe one tech utility, and the predicted cpc codes will be render in the form of tree in secondes.
For the backend, we load the model and make the prediction in models.py
at the very first time of the prediction, so for the following predictions, we can get rid of loading the giant model file too much times.
But the prediction also need to be structed and parsed into the format that could be used for frontend, and this part of work is done by treebuilders.py
and views.py
.
Also, in the backend, we need to rank the predcitions according to their confidence scores made by our model for the frontend render work, so the total data we return back to front is :
res = {'tree': tree, 'ordered_labels': ordered_labels}
where tree
is the parsed predictions in tree structure format, and ordered_labels
is the prediction labels ranked by their confidence scores.
In the front end, we use these two data to render a tree chart, and we also provide an adjust bar that could change the tree leafnode numbers for analyse.
The implementation is well documented, so it is easy for further integration.
- notebooks/CPC_Preliminary_Data_Analysis.ipynb: this notebook includes some preliminary data analysis of the CPC MCF data (e.g. average number of labels, duplicate issues, number of labels at each level, etc.)
- notebooks/CPC_Text_Data.ipynb: this notebook has some preliminary data analysis of the CPC text data (e.g. average number of tokens of each text field)
- notebooks/evaluate.ipynb: this notebook has some old evaluation methods (e.g. macro and micro F1, precision and recall at different percentiles, etc.)