kaggle-USPPPM

U.S. Patent Phrase to Phrase Matching

Help Identify Similar Phrases in U.S. Patents

Describe

In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. For example, if one invention claims "television set" and a prior publication describes "TV set", a model would ideally recognize these are the same and assist a patent attorney or examiner in retrieving relevant documents. This extends beyond paraphrase identification; if one invention claims a "strong material" and another uses "steel", that may also be a match. What counts as a "strong material" varies per domain (it may be steel in one domain and ripstop fabric in another, but you wouldn't want your parachute made of steel). We have included the Cooperative Patent Classification as the technical domain context as an additional feature to help you disambiguate these situations.

Can you build a model to match phrases in order to extract contextual information, thereby helping the patent community connect the dots between millions of patent documents?

Dataset

In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning).
Download kaggle dataset
Download additional dataset

Evaluation

Submissions are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores.

Result

Model	Validation strategy	Loss	Tricks	CV-score	Public-score	Private-score
BERT-base	StratifiedKFold	MSELoss	-	0.792	0.743	0.754
BERT-base	StratifiedGroupKFold	MSELoss	-	0.734	0.746	0.757
BERT-base	StratifiedGroupKFold	CorrLoss	-	0.728	0.740	0.753
DeBERTa-v3-large	StratifiedGroupKFold	MSELoss	-	0.822	0.836	0.852
DeBERTa-v3-large	StratifiedGroupKFold	MSELoss	LLRD	0.826	0.837	0.852

MSELoss = Mean Squared Error Loss
CorrLoss = Correlational Coefficience Loss
LLRD = Layer-wise Learning Rate Decay

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
uspppm-infernece.ipynb		uspppm-infernece.ipynb
uspppm-train.ipynb		uspppm-train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaggle-USPPPM

U.S. Patent Phrase to Phrase Matching

Describe

Dataset

Evaluation

Result

About

Releases

Packages

Languages

earthlovebpt/kaggle-USPPPM

Folders and files

Latest commit

History

Repository files navigation

kaggle-USPPPM

U.S. Patent Phrase to Phrase Matching

Describe

Dataset

Evaluation

Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages