The basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the following parsing and extraction steps are needed.
- Ignoring case, extract all unique words from the entire set of documents.
- Eliminate non-content-bearing ``stopwords'' such as ``a'', ``and'', ``the'', etc. For sample lists of stopwords, see [#!frakes:baeza-yates!#, Chapter 7].
- For each document, count the number of occurrences of each word.
- Using heuristic or information-theoretic criteria, eliminate non-content-bearing ``high-frequency'' and ``low-frequency'' words [#!salton:book!#].
- After the above elimination, suppose unique words remain. Assign a unique identifier between and to each remaining word, and a unique identifier between and to each document.
The above preprocessing yields the number of occurrences of word in document , say, , and the number of documents which contain the word , say, . Using these counts, we can represent the -th document as a -dimensional vector as follows. For , set the -th component of , to be the product of three terms
There are many schemes for selecting the term, global, and normalization components, see [#!kolda:thesis!#] for various possibilities. In this paper we use the popular scheme known as normalized term frequency-inverse document frequency. This scheme uses , and . Note that this normalization implies that , i.e., each document vector lies on the surface of the unit sphere in . Intuitively, the effect of normalization is to retain only the proportion of words occurring in a document. This ensures that documents dealing with the same subject matter (that is, using similar words), but differing in length lead to similar document vectors.