Text Embeddings Optimized for Distance Computations (MSc Thesis)

The Word Mover’s Distance (WMD) [1] is a recently proposed distance metric for text that works by computing the minimum distance that the embedded individual words of one document need to “travel” to reach the embedded words of another document. The WMD achieves state-of-the-art performance on Natural Language Processing tasks such as document nearest neighbour search. However, the WMD is very costly to compute, making its use prohibitive for larger datasets. Learned individual embeddings (vectors) of larger textual units (e.g. sentences, paragraphs, documents) are an attractive, more efficient alternative of the WMD. Approximate methods already exist that can quickly find nearest neighbours by computing the Euclidean distance between embedding vectors.

In this thesis, you will develop new embedding methods for text that are trained to approximate sophisticated distance measures, such as the WMD. This approach was recently demonstrated to be viable for images [2]. We will explore the use of this method in the text domain. We will develop models that produce embeddings whose Euclidean distance approximates the WMD. The aim is to encapsulate the properties of the WMD within an individual embedding, while preserving the high accuracy of the WMD metric. We will evaluate our methods on common benchmarks for text nearest neighbour search, finding a good trade-off between speed, efficiency and accuracy.

Start date is flexible.

References:
[1] Kusner, Matt, et al. "From word embeddings to document distances." International Conference on Machine Learning. 2015.
[2] Courty, Nicolas, Rémi Flamary, and Mélanie Ducoffe. "Learning Wasserstein Embeddings." arXiv preprint arXiv:1710.07457 (2017).

Requirements

good Python skills, knowledge in natural language processing

Contact

Nikola Nikolov (niniko ini.ethz.ch)

© 2024 Institut für Neuroinformatik