Methods and Algorithms for Extraction, Linking, Vectorisation, and Disambiguation of Lexical-Semantic Graphs
Методы и алгоритмы для извлечения, связывания, векторизации и разрешения неоднозначности лексико-семантических графов
Abstract
The traditional lexical resources, such as Princeton's WordNet, contain precise manually-encoded information about lexical items (such as words and phrases) and relations between them (such as synonyms and hypernyms) yet coverage i.e. recall and actuality of these resources is often inherently limited. This is due to the expensive, long and usually purely manual process of resource creation and keeping it up-to-date. Besides, some domain-specific terms may be simply out of scope even for the largest lexicographical collaboratively created resources, such as Wiktionary.
The purpose of the dissertation is the development of methods for computational lexical semantics which would bridge the gap between the precise well-interpretable manually created lexical resources with low lexical coverage and noisy non-interpretable automatically induced from text distributional lexical representations with high lexical coverage. This includes (i) development of new algorithms for processing large linguistic networks constructed from both manually created lexical resources and graphs induced from text, (ii) development of method for induction of lexical semantic structures of various kinds from text, most notably word senses, (iii) development of techniques for making the induced structures interpretable in the way they are in manually constructed resources, (iv) development of methods for effective disambiguation in context with respect to the induced sense representations, (v) development of effective vectorisation of lexical semantic graphs for the use in various application.
Contributions presented in this dissertation cover a wide range of task related to lexical computational semantics: they form a solid methodological framework for learning, population, linking, disambiguation and vectorisation of word senses and relations between them. Methodologically, many developed methods are graph-based, more specifically graph clustering algorithms, including newly proposed, are used to process linguistic networks of various kinds. The use of graph representation is fairly natural as each lexical semantic resource can be represented as graph with nodes being word senses / terms and edges being semantic relations between them. As the modern NLP methods heavily rely on neural networks, dealing with such linguistic graphs required their vectorisation. Towards this end, methods for node embedding of linguistic graphs, such as WordNets and Knowledge Graphs (KG) were developed to solve various tasks, such as completion of linguistic resources and word sense disambiguation.
The findings span period from 2016 until 2023 published in 42 conference papers and journal articles, including top international venues of rank CORE A* and Q1.
About the author
Alexander Panchenko is an Associate Professor in Skoltech since April 2019. He has a background of 15+ years of research and developments in the field of Natural Language Processing (NLP) with a special focus on graph-based methods for NLP. He was a postdoc in Germany at the University of Hamburg and TU Darmstadt, receiving a Ph.D. from the Université catholique de Louvain (Belgium). He co-authored more than 140+ peer-reviewed research publications, including top conferences of rank CORE A*, such as ACL or EMNLP and Q1 journals such as MIT's Computational Linguistics. Alexander was involved in organisation of international shared tasks, conferences, and workshops, such as CLEF and ACL-based events, such as TextGraphs.
Links