Domain-specific word embeddings for patent classification

Risch, Julian; Krestel, Ralf

doi:10.1108/DTA-01-2019-0002

Purpose Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. Design/methodology/approach To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained wordPurpose Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class. Automatic classification would be highly beneficial, because of the large volume of patents and the domain-specific knowledge needed to accomplish this costly manual task. However, a challenge for the automation is patent-specific language use, such as special vocabulary and phrases. Design/methodology/approach To account for this language use, the authors present domain-specific pre-trained word embeddings for the patent domain. The authors train the model on a very large data set of more than 5m patents and evaluate it at the task of patent classification. To this end, the authors propose a deep learning approach based on gated recurrent units for automatic patent classification built on the trained word embeddings. Findings Experiments on a standardized evaluation data set show that the approach increases average precision for patent classification by 17 percent compared to state-of-the-art approaches. In this paper, the authors further investigate the model’s strengths and weaknesses. An extensive error analysis reveals that the learned embeddings indeed mirror patent-specific language use. The imbalanced training data and underrepresented classes are the most difficult remaining challenge. Originality/value The proposed approach fulfills the need for domain-specific word embeddings for downstream tasks in the patent domain, such as patent classification or patent analysis.… zeige mehr

Verfasserangaben:	Julian Risch GND, Ralf Krestel ORCiD GND
DOI:	https://doi.org/10.1108/DTA-01-2019-0002
ISSN:	2514-9288
ISSN:	2514-9318
Titel des übergeordneten Werks (Englisch):	Data Technologies and Applications
Verlag:	Emerald Group Publishing Limited
Verlagsort:	Bingley
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Erstveröffentlichung:	04.02.2019
Erscheinungsjahr:	2019
Datum der Freischaltung:	09.04.2021
Freies Schlagwort / Tag:	Deep learning; Document classification; Patents; Word embedding
Band:	53
Ausgabe:	1
Seitenanzahl:	15
Erste Seite:	108
Letzte Seite:	122
Organisationseinheiten:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer Review:	Referiert

Domain-specific word embeddings for patent classification

Metadaten exportieren

Weitere Dienste