Domain-specific word embeddings for patent classification

Risch, Julian; Krestel, Ralf

doi:10.1108/DTA-01-2019-0002

Purpose Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. Design/methodology/approach To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained wordPurpose Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. Design/methodology/approach To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Findings Experiments on a standardized evaluation data set show that the approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the authors further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge. Originality/value The proposed approach fulfills the need for domain-specific word embeddings for downstream tasks in the patent domain, such as patent classification or patent analysis.… show more

Author details:	Julian Risch GND, Ralf Krestel ORCiD GND
DOI:	https://doi.org/10.1108/DTA-01-2019-0002
ISSN:	2514-9288
ISSN:	2514-9318
Title of parent work (English):	Data Technologies and Applications
Publisher:	Emerald Group Publishing Limited
Place of publishing:	Bingley
Publication type:	Article
Language:	English
Date of first publication:	2019/02/04
Publication year:	2019
Release date:	2021/04/09
Tag:	Deep learning; Document classification; Patents; Word embedding
Volume:	53
Issue:	1
Number of pages:	15
First page:	108
Last Page:	122
Organizational units:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer review:	Referiert

Domain-specific word embeddings for patent classification

Export metadata

Additional Services