Word2vec

What is Word2vec?

Word2vec is a neural networkwhich is used for text analysis by means of Word embedding (in English Word Embedding) is used. To do this, Word2vec converts the words of a text into numerical vectors and can use these numbers to mathematically calculate and recognise connections and the context of the words to each other. Through supervised learning (Supervised Learning) increases Word2vec's ability to recognise contexts and out admitted, such as:

The "sun" is for the "day" what the "moon" is.nd" is for the "night".

or

Berlin" is to "Germany" what "Tokyo" is to "Japan".an" is

In addition, affiliations such as "spoon", "fork" and "knife" are recognised and these words are grouped together.

Word2vec was introduced in 2013 by a team of researchers from Google led by Tomas Mikolov. The research paper entitled "Efficient Estimation of Word Representations in Vector Space". describes two possible methods for learning the context:

Continuous Bag-of-Words Model (CBOW)

The target word is predicted based on the adjacent context words. The context consists of some words before and after the searched (middle) word. It is Bag-of-words model because the order of the words is not relevant in the context. The CBOW model is particularly good at this, syntactic relations between two words to capture.

Continuous Skip Gram Model

In this model, several context words are issued based on one input word. It basically works the other way around like the CBOW model. The Continuous Skip Gram model is better at it, semantic relations between two words than the CBOW model.

An example using the word "fish

The CBOW model will output the plural "fish" as the next vector. The Skip-Gram model, on the other hand, will find an independent, but semantically relevant, word like "fishing rod".

This difference makes the Skip-gram model the more popular of the two as it has greater utility for most applications.

What are the applications of Word2vec?

Word2vec, like other word embeddings, can be used for many online applications. It forms the Basis for search engine suggestions and recommendations in online shops. Through context analysis, optimised suggestions can be made for the user in order to output the best possible results. This makes Word2vec essential for areas such as e-commerce and customer relationship management. But also for Creating content or for scientific research it is very helpful.

Word2vec in Python

To use Word2vec in Python the two Modules gensim and nltk required.

Gensim is a Open Source Library in Python and is used for semantic text analysis and document comparison. The abbreviation nltk stands for "Natural Language Toolkit" and includes computational linguistics libraries and programmes for Python.

It is possible to use the Continuous Bag-of-Words model or the Continuous Skip-Gram model to train the context. Many already trained models can also be found online.

Data Navigator Newsletter