What's In A Name? Evaluating Assembly-Part Semantic Knowledge in Language Models through User-Provided Names in CAD Files
Contents:
The full paper is available on arXiv at: https://arxiv.org/abs/2304.14275
If you use any of the code or techniques from this repository, please cite the following:
Peter Meltzer, Joseph G. Lambourne and Daniele Grandi. ‘What's In A Name? Evaluating Assembly-Part Semantic Knowledge in Language Models through User-Provided Names in CAD Files’. In Journal of Computing and Information Science in Engineering (JCISE), 2023. https://arxiv.org/abs/2304.14275
Data is available from https://whats-in-a-name-dataset.s3.us-west-2.amazonaws.com/whats_in_a_name_dataset.zip.
This can be downloaded into place using:
$ ./download_data.sh
(requires curl and unzip).
Ensure the data and source code are in the same root directory, i.e.:
whats_in_a_name_code_data
├── README.md
├── data
│ └── abc
│ ├── abc_text_data_003.json
│ │ ...
│ └── validation_pairs.csv
├── requirements.txt
└── src
├── abcpartnames
│ ├── __init__.py
│ ├── data_processing
│ │ ├── __init__.py
│ │ │ ...
│ │ └── generate_pairs.py
│ │ ...
│ └── transforms
│ ├── __init__.py
│ │ ...
│ └── vectorizers.py
└── setup.py
- python 3.9
We recommend using a virtual environment, then:
pip install -r requirements.txt
abc_text_data_003.json
is the complete set of data extracted with default part names removed. It contains a
dictionary with an entry for each OnShape document in ABC dataset. This is deduplicated based on part names and feature
names. Where names or descriptions are unavailable empty lists and empty strings are used.
{
"OnShape_document_id": {
"body_names": List[str],
"feature_names": List[str],
"document_name": str,
"document_description": str
},
...
}
validation_pairs.csv
, test_pairs.csv
are the generated positive and negative pairs for "Two Parts" experiment.
label_1...label_N
are 1 for a positive pair (two parts from same document) or 0 for a negative pair (parts are not
from same document). part_a_1...part_a_N
and part_b_1...part_b_N
are the part name strings. All positive pairs are
first (i.e. rows 1...k) and negative pairs last (i.e. rows k+1...N).
label_1,part_a_1,part_b_1
...
label_N,part_a_N,part_b_N
train_corpus_complete.txt
is the training corpus generated (from the train set) for fine-tuning DistilBERT or fully
training FastText.
“An assembly with name {DOCUMENT NAME} contains the following parts: {PART 1}, ..., {PART N}.\n”
...
train_val_test_split.json
is the master lists of splits for this dataset. Each train/validation/test entry contains
a list of OnShape document ids for the documents corresponding to that split.
{
"train": List[str],
"validation": List[str],
"test": List[str]
}
The following files are derived from the splits defined in train_val_test_split.json
and follow the same format:
train_val_test_descriptions.json
- documents having a non-empty descriptiontrain_val_test_featurenames.json
- documents with at least 1 non-default feature nametrain_val_test_partnames.json
- documents with at least 1 non-default part nametrain_val_test_partnames_and_featurenames.json
- documents with at least 1 non-default part and feature nametrain_val_test_two_or_more_partnames.json
- documents with at least 2 non-default part names
Not all of these files were used in our experiments, but we share them for future use.
src/abcpartnames/
:
data_processing/
create_train_val_test_split.py
: create train, validation and test splitsgenerate_corpus.py
: create fine-tuning corpus (also used for training FastText)generate_pairs.py
: generate positive and negative pairs from document parts
datasets/
ABCTextDataset.py
: allDataset
classes andDataLoaders
etc.
models/
mlp.py
: MLP used in "Two Parts" experimentset_model.py
: end-to-end module (SetModuleNamesWithParts
) that combines the encoder with the SetTransformerset_model.py
: implementation of SetTransformer model used in "Missing Part" and "Document Name" experiments
scripts/
train_vectorizers.py
,evaluate_vectorizers.py
: train/evaluate the BOW vectorizers on "Two Parts" experimentevaluate_technet.py
: evaluate TechNet on "Two Parts" experimenttrain_word2vec.py
,evaluate_fasttext_models.py
: train/evaluate FastText on "Two Parts" experimentevaluate_embeddings.py
: evaluate DistilBERT(-FT) on "Two Parts" experimenttrain_and_eval_set_model.py
: evaluate any of TechNet, FastText, DistilBERT and DistilBERT-FT on "Missing Part" or "Document Name" experimentsrun_mlm.py
: (script from huggingface) fine-tune DistilBERT
transforms/
hf.py
: transforms for huggingface models (DistilBERT + DistilBERT-FT), includes poolinglower_and_replace_transform.py
: transform to convert strings to lowercase and replace_
withvectorizers.py
: BOW Vectorizers, FastText encoder and TechNet encoder
For all scripts, run from the project root directory and use -h
flag to see possible arguments and default options.
$ python src/abcpartnames/scripts/run_mlm.py \
--model_name_or_path distilbert-base-uncased \
--line_by_line \
--train_file data/abc/train_corpus_complete.txt \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--do_train \
--output_dir language_models/distilbert_fine-tuned_003
For training the vectorizers (TF-IDF BOW and Frequency BOW) use $ python src/abcpartnames/scripts/train_vectorizers.py
,
and after for evaluation use $ python src/abcpartsnames/scripts/evaluate_vectorizers.py
.
For TechNet use $ python src/abcpartnames/scripts/evaluate_technet.py
.
For FastText training use $ python src/abcpartnames/scripts/train_word2vec.py
, and for evaluation use $ python src/abcpartnames/scripts/evaluate_fasttext_models.py
.
For DistilBERT and DistilBERT-FT evaluation use $ python src/abcpartnames/scripts/evaluate_embeddings.py
(passing path to root dir of fine-tuned model as --model
for DistilBERT-FT). For example,
# evaluate DistilBERT on "Two Parts" experiment
$ python src/abcpartnames/scripts/evaluate_embeddings.py \
--batch_size 512 \
--num_workers 8 \ # set to 0 for debugging
--accelerator gpu \
--gpus 1 \
--trial 0 \
--exp pairs_distilbert \ # this will be used for saving lightning_logs
--model distilbert-base-uncased # this is the pre-trained model (it will be downloaded if it is not already)
# evaluate DistilBERT-FT on "Two Parts" experiment
$ python src/abcpartnames/scripts/evaluate_embeddings.py \
--batch_size 512 \
--num_workers 8 \ # set to 0 for debugging
--accelerator gpu \
--gpus 1 \
--trial 0 \
--exp pairs_distilbert-ft \ # this will be used for saving lightning_logs
--model ./language_models/distilbert_fine-tuned_003 # this is the fine-tuned model (assumes model is already saved to this path as above)
All experiments can be run using the script src/abc/partnames/scripts/train_and_eval_set_model.py
.
Examples:
# train and evaluate pre-trained DistilBERT on "Missing Part" experiment
$ python src/abcpartnames/scripts/train_and_eval_set_model.py \
--batch_size 512 \
--trial 0 \
--exp missing-part_distilbert \
--model distilbert-base-uncased \
--dim_hidden 768 \
--num_heads 8 \
--num_inds 64
# train and evaluate fine-tuned DistilBERT-FT on "Missing Part" experiment
$ python src/abcpartnames/scripts/train_and_eval_set_model.py \
--batch_size 512 \
--trial 0 \
--exp missing-part_distilbert-ft \
--model ./language_models/distilbert_fine-tuned_003 \
--dim_hidden 768 \
--num_heads 8 \
--num_inds 64
# train and evaluate pre-trained DistilBERT on "Document Name" experiment
$ python src/abcpartnames/scripts/train_and_eval_set_model.py \
--batch_size 512 \
--trial 0 \
--exp document-name_distilbert \
--model distilbert-base-uncased \
--dim_hidden 512 \
--num_heads 8 \
--num_inds 64 \
--pred_names True
# train and evaluate fine-tuned DistilBERT-FT on "Document Name" experiment
$ python src/abcpartnames/scripts/train_and_eval_set_model.py \
--batch_size 512 \
--trial 0 \
--exp document-name_distilbert-ft \
--model ./language_models/distilbert_fine-tuned_003 \
--dim_hidden 512 \
--num_heads 8 \
--num_inds 64 \
--pred_names True
For all other models and options see below:
$ python src/abcpartnames/scripts/train_and_eval_set_model.py -h
usage: train_and_eval_set_model.py [-h] [--trial TRIAL] [--model MODEL] [--exp EXP] [--results RESULTS] [--workers WORKERS] [--stop_on STOP_ON] [--num_inds NUM_INDS] [--dim_input DIM_INPUT]
[--dim_hidden DIM_HIDDEN] [--dim_out DIM_OUT] [--num_heads NUM_HEADS] [--ln] [--batch_size BATCH_SIZE] [--bin_width BIN_WIDTH] [--names NAMES]
[--gensim_path GENSIM_PATH] [--pred_names PRED_NAMES] [--no_lower] [--replace_]
optional arguments:
-h, --help show this help message and exit
--trial TRIAL trial (used for random seed) (default: 0)
--model MODEL pre-trained language model (default: distilbert-base-uncased) # or 'technet', 'fasttext', or path to fine-tuned distilbert
--exp EXP experiment_name (default: test)
--results RESULTS file to save results (default: set_results.csv)
--workers WORKERS num workers for dataloaders (default: 8)
--stop_on STOP_ON 'loss' or 'acc' (default: loss)
SetModule:
--num_inds NUM_INDS number inducing points (default: 32)
--dim_input DIM_INPUT
input dimensions to set transformer (default: 768)
--dim_hidden DIM_HIDDEN
set transformer hidden units size (default: 512)
--dim_out DIM_OUT set transformer output dimensions (default: 768)
--num_heads NUM_HEADS
number of attention heads (default: 4)
--ln use layer norm (default: False)
--batch_size BATCH_SIZE
batch size (default: 512)
--bin_width BIN_WIDTH
width of bins to use in batched masking (default: 128)
--names NAMES include assembly names with input parts (default: False)
ToVectorized:
--gensim_path GENSIM_PATH
path to saved FastText model (default: None)
ABCNamesWithParts:
--pred_names PRED_NAMES
predict assembly names using all parts as input (default: False)
--no_lower do not convert strings to lower case (default: False)
--replace_ Replace _ characters with spaces (default: False)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.