The basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the following parsing and extraction steps are needed.
- Ignoring case, extract all unique words from the entire set of documents.
- Eliminate non-content-bearing ``stopwords'' such as ``a'', ``and'', ``the'', etc. For sample lists of stopwords, see [#!frakes:baeza-yates!#, Chapter 7].
- For each document, count the number of occurrences of each word.
- Using heuristic or information-theoretic criteria, eliminate non-content-bearing ``high-frequency'' and ``low-frequency'' words [#!salton:book!#].
- After the above elimination, suppose
unique words remain. Assign a unique identifier between
and
to each remaining word, and a unique identifier between
and
to each document.
The above preprocessing yields the number of occurrences of word in document
, say,
, and the number of documents which contain the word
, say,
. Using these counts, we can represent the
-th document as a
-dimensional vector
as follows. For
, set the
-th component of
, to be the product of three terms










There are many schemes for selecting the term, global, and normalization components, see [#!kolda:thesis!#] for various possibilities. In this paper we use the popular scheme known as normalized term frequency-inverse document frequency. This scheme uses
,
and
. Note that this normalization implies that
, i.e., each document vector lies on the surface of the unit sphere in
. Intuitively, the effect of normalization is to retain only the proportion of words occurring in a document. This ensures that documents dealing with the same subject matter (that is, using similar words), but differing in length lead to similar document vectors.