Bag-of-words model

What is a Bag-of-Words model?

A bag-of-words model is a simplifying representation used in natural language processing and information retrieval. In this model, a text is represented as a bag of its words, without taking into account grammar and even word order, but maintaining multiplicity.

One application of this artificial intelligence is email filtering. The number of identical words is stored. These must be the words with the highest number of occurrences, not the most important words, because "the", "the", "the" and "a", "one" frequently occur without these words having much relevance. For the purpose of classification, supervised alternatives are developed to yield a class label of a document.

There is a bigram model in which the text is parsed into units. Hashing can also be used to save memory. Further there is a Bayes spam filterwhere the email message is split into an unordered collection of words from two probability distributions. One represents spam and the other represents legitimate emails, so-called "ham". Thus, there are two bags of words. One bag is filled with words present in spam messages and the other with words present in legitimate emails.

What is Bag-of-Words?

Bag-of-words is a certain way to extract features from a text that are used to model this text with machine learning algorithms. The approach is very simple and flexible. It can be used in many ways to extract features from a document.

A bag-of-words is a representation of text that describes the frequency of words within a document. On the one hand, there is a vocabulary of known words, and on the other hand, there is a measurement of existing known words. This model is called a bag because the order or structure of the words is omitted. It only looks at whether a word occurs, but not where it is in the document.

How is text converted to vectors?

Language modelling and document classification can easily be done using bag-of-words models. Machine learning cannot work directly with the plain text, but a conversion to numbers is done. By counting word occurrences and hashing, sentences can be converted into vectors. Bag-of-words is one of the best-known methods used to construct feature spaces. Feature vectors are generated in the course of this procedure.

Data Navigator Newsletter