[ lucene扩展 ] Spelling Checker using Lucene

My initial interest in spell checking algorithms started when I had to fix some bugs in the spell checking code at work over a year ago. We use Jazzy, a Java implementation of GNU Aspell, as our spell checking library. Jazzy uses a combination of Metaphone and Levenshtein distance (aka Edit distance) to match a misspelled word to a set of words in its dictionary.

An alternative approach to spell checking is the use of n-grams. Basically you break up your dictionary word into sequences of characters of size n, moving your pointer forward one character at a time, and store it in an index. When the user enters a misspelled word, you do the same thing with his input word, then match the ngrams generated to the ngrams in your dictionary. The idea is that a misspelled word would have only one or two or three characters misspelled, transposed or missing, so by comparing the n-grams, you would get matches to correctly spelled words that are close to it. This is a newer approach to spell checking, which has become feasible with the availability of open source search libraries such as Lucene, since it requires the functionality to tokenize your dictionary words and store them in an index for quick retrieval.

I read about this approach first on Tom White's article "Did you mean Lucene?" on java.net quite some time back, but I never completely understood it then, probably because I was new to Lucene. However, what I found attractive about it was its compactness and performance.

You see, the spell checking on most search sites is surfaced as a one line "Did you mean..." component at the top of your search results. Compared to the actual search results, its pretty much irrelevant, unless its suggestions are way off the mark, at which point it has entertainment value to the user and embarrassment value to the site serving the search results. So you want spell checking code thats quick to maintain and execute, even though its internals are likely to be much more complex than the code serving up the search results.

And besides, as a search company heavily invested in Lucene, it was probably more appropriate for us to be using a Lucene based approach, if one existed and was on par, performance and results-wise, with other approaches.

I recently came across this wiki page on the jakarta-lucene site, where the n-gram approach is more succinctly described. You can download the code implementing this algorithm from the original Bugzilla submission. It is written by Nicolas Maisonneuve (and inspired by David Spencer's code according to the class Javadocs). However, the code is written against the Lucene 1.x API and I was using Lucene 2.2, so it would not compile for me.

However, the algorithm is quite simple and intuitive. Basically, each dictionary word is tokenized and stored in an index with the following fields.

Field	Description	Example
word	The dictionary word	cancer
gram3	Multi-field of all 3-grams for the word	can, anc, nce, cer
gram4	Multi-field of all 4-grams for the word	canc, ance, ncer
start3	The first 3 characters of the word, for edge boosting	can
end3	The last 3 characters of the word, for edge boosting	cer
start4	The first 4 characters of the word, for edge boosting	canc
end4	The last 4 characters of the word, for edge boosting	ncer

The search query is a BooleanQuery consisting of all the fields (except word) of the index, OR'd together. If edge boosting is desired (on the assumption that people make fewer mistakes at the start and end of words), then the start* and end* field queries are given a higher boost in the query.

Since I had the algorithm anyway, I decided to rewrite the code using Nicolas's code as a guide. For one thing, quite some time has passed since he wrote it, and now n-gram tokenizers are available from the Lucene sandbox, so I could simply grab that instead of having to write one. Second, I found he had implemented a custom version of Edit Distance, where I preferred using the getLevenshtiensDistance() method from the Jakarta commons-lang project. Third, his method of loading up a dictionary was from an existing index, whereas I wanted something more generic. For these reasons, I built my own Lucene based Spell checker, the code for which is shown below:

I first built an NGramAnalyzer using the NGramTokenizer provided by the Lucene Analyzers sandbox project.

// NGramAnalyzer.java
package com.mycompany.spellcheck;

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

  private int minGram;
  private int maxGram;
  
  public NGramAnalyzer(int minGram, int maxGram) {
    this.minGram = minGram;
    this.maxGram = maxGram;
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new NGramTokenizer(reader, minGram, maxGram);
  }
}

Next I built a SpellChecker which exposes two methods: addToDictionary() and suggestSimilar(). The first allows you to add words to the spelling dictionary (which is a Lucene index with the fields described in the SpellChecker wiki page), and the second returns a List of spelling suggestions for a given word. In addition (unlike Nicolas's code), it allows results that have Soundex equivalence, since sometimes the user's spelling may be wildly off the mark, but he knows what the word sounds like, and we want to catch that too.

// LuceneSpellChecker.java
package com.mycompany.spellcheck;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.codec.EncoderException;
import org.apache.commons.codec.language.Soundex;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.FSDirectory;

public class LuceneSpellChecker {
  
  private String spellIndexDir;
  private boolean boostEdges;
  private int editDistanceCutoff = 3;
  private int soundexDiffCutoff = 4;
  
  private IndexWriter spellIndexWriter;
  private IndexSearcher spellIndexSearcher;
  
  private class AnalysisResult {
    public String input;
    public List<String> gram3s = new ArrayList<String>();
    public List<String> gram4s = new ArrayList<String>();
    public String start3 = "";
    public String start4 = "";
    public String end3 = "";
    public String end4 = "";
  }
  
  public void setBoostEdges(boolean boostEdges) {
    this.boostEdges = boostEdges;
  }
  
  public void setEditDistanceCutoff(int editDistanceCutoff) {
    this.editDistanceCutoff = editDistanceCutoff;
  }
  
  public void setSoundexDiffCutoff(int soundexDiffCutoff) {
    this.soundexDiffCutoff = soundexDiffCutoff;
  }
  
  public void setSpellIndexDir(String spellIndexDir) {
    this.spellIndexDir = spellIndexDir;
  }
  
  public void init() throws Exception {
    File spellIndexDirFile = new File(spellIndexDir);
    this.spellIndexWriter = new IndexWriter(
      FSDirectory.getDirectory(spellIndexDirFile), new StandardAnalyzer());
    this.spellIndexSearcher = new IndexSearcher(spellIndexDir);
  }
  
  public void destroy() throws Exception {
    spellIndexWriter.close();
    spellIndexSearcher.close();
  }

  public void flush() throws Exception {
    spellIndexWriter.optimize();
  }
  
  public void addToDictionary(String dictionaryEntry) throws IOException {
    AnalysisResult result = analyze(dictionaryEntry);
    Document doc = new Document();
    doc.add(new Field("word", result.input, Store.YES, Index.TOKENIZED));
    for (String gram3 : result.gram3s) {
      doc.add(new Field("gram3", gram3, Store.YES, Index.TOKENIZED));
    }
    for (String gram4 : result.gram4s) {
      doc.add(new Field("gram4", gram4, Store.YES, Index.TOKENIZED));
    }
    doc.add(new Field("start3", result.start3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("start4", result.start4, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end3", result.end3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end4", result.end4, Store.YES, Index.TOKENIZED));
    spellIndexWriter.addDocument(doc);
  }
  
  public List<String> suggestSimilar(String input, int maxSuggestions) 
      throws IOException, EncoderException {
    AnalysisResult result = analyze(input);
    BooleanQuery query = new BooleanQuery();
    addGramQuery(query, "gram3", result.gram3s);
    addGramQuery(query, "gram4", result.gram4s);
    addEdgeQuery(query, "start3", result.start3);
    addEdgeQuery(query, "end3", result.end3);
    addEdgeQuery(query, "start4", result.start4);
    addEdgeQuery(query, "end4", result.end4);
    Set<String> words = new HashSet<String>();
    Hits hits = spellIndexSearcher.search(query);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      Document doc = hits.doc(i);
      String suggestion = doc.get("word");
      // if the suggestion is the same as the input, ignore it
      if (suggestion.equalsIgnoreCase(input)) {
        continue;
      }
      // if the suggestion is within the specified levenshtein's distance, include it
      if (StringUtils.getLevenshteinDistance(input, suggestion) < editDistanceCutoff) {
        words.add(suggestion);
      }
      // if they sound the same, include it
      Soundex soundex = new Soundex();
      if (soundex.difference(input, suggestion) >= soundexDiffCutoff) {
        words.add(suggestion);
      }
    }
    List<String> wordlist = new ArrayList<String>();
    wordlist.addAll(words);
    return wordlist.subList(0, Math.min(maxSuggestions, wordlist.size()));
  }

  private AnalysisResult analyze(String input) throws IOException {
    AnalysisResult result = new AnalysisResult();
    result.input = input;
    Analyzer analyzer = new NGramAnalyzer(3, 4);
    TokenStream tokenStream = analyzer.tokenStream("dummy", new StringReader(input));
    Token token;
    while ((token = tokenStream.next()) != null) {
      String text = token.termText();
      if (text.length() == 3) {
        result.gram3s.add(text);
      } else if (text.length() == 4) {
        result.gram4s.add(text);
      } else {
        continue;
      }
    }
    result.start3 = input.substring(0, Math.min(input.length(), 3));
    result.start4 = input.substring(0, Math.min(input.length(), 4));
    result.end3 = input.substring(Math.max(0, input.length() - 3), input.length());
    result.end4 = input.substring(Math.max(0, input.length() - 4), input.length());
    return result;
  }
  
  private void addGramQuery(BooleanQuery query, String fieldName, List<String> grams) {
    for (String gram : grams) {
      query.add(new TermQuery(new Term(fieldName, gram)), Occur.SHOULD);
    }
  }
  
  private void addEdgeQuery(BooleanQuery query, String fieldName, String fieldValue) {
    TermQuery start3Query = new TermQuery(new Term(fieldName, fieldValue));
    if (boostEdges) {
      start3Query.setBoost(2.0F);
    }
    query.add(start3Query, Occur.SHOULD);
  }
}

Finally, here is the test case which a client for the SpellChecker could use. It is basically the JUnit test I used to test the code above. The test opens a file of pairs of misspelled word and the expected correct spelling and runs through it. The first test puts the correct word into the index, and the next searches for them. The SpellChecker class is instantiated in the test, but could also be accessed from a Spring BeanFactory, since the SpellChecker class is set up to use setter injection.

// LuceneSpellCheckerTest.java
package com.mycompany.spellcheck;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.List;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class LuceneSpellCheckerTest {

  private static final String INPUT_FILE = "src/main/resources/my_tests.txt";
  private static final String INDEX_DIR = "/tmp/spellindex";
  private static final Log LOG = LogFactory.getLog(LuceneSpellCheckerTest.class);

  private LuceneSpellChecker spellChecker;
  
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    File indexDirFile = new File(INDEX_DIR);
    if (indexDirFile.exists()) {
      FileUtils.forceDelete(new File(INDEX_DIR));
    }
  }
  
  @Before
  public void setUp() throws Exception {
    spellChecker = new LuceneSpellChecker();
    spellChecker.setSpellIndexDir(INDEX_DIR);
    spellChecker.setBoostEdges(true);
    spellChecker.setEditDistanceCutoff(3);
    spellChecker.setSoundexDiffCutoff(4);
    spellChecker.init();
  }
  
  @After
  public void tearDown() throws Exception {
    spellChecker.destroy();
  }
  
  @Test
  public void testIndexing() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String[] parts = StringUtils.split(line, "=");
      if (parts.length == 2) {
        String correctSpelling = StringUtils.split(line, "=")[1];
        if (StringUtils.isNotBlank(correctSpelling)) {
          LOG.debug("Adding to index:" + correctSpelling);
          spellChecker.addToDictionary(correctSpelling);
        }
      }
    }
    spellChecker.flush();
    reader.close();
  }

  @Test
  public void testSearching() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String wrongSpelling = StringUtils.split(line, "=")[0];
      List<String> suggestions = spellChecker.suggestSimilar(wrongSpelling, 2);
      LOG.debug("suggestion[" + wrongSpelling + "]=" + suggestions);
    }
    reader.close();
  }
}

The above code is not as complete as Nicolas's, since it does not have the code to do popularity based boosting in suggestSimilar(). Essentially, this means that if the spell checker has two or more suggestions, the caller can specify that the suggestions be returned in order of decreasing popularity in the index. Internally, the code would call docFreq() for each of the suggestions and reorder accordingly.

From my minimal testing, it appears that multi-word dictionary entries are handled as well as single-word entries, but I cannot say for sure. If this is true, then it would make sense to have the popularity functionality. With Jazzy this is not the case, so the application has to deal with figuring out how to get the "best" multi-word suggestion based on a cartesian join of all single-word suggestions. If the same holds true for this approach, then it may make sense to actually apply the popularity metric on the combination of single word suggestions instead.The above code is not as complete as Nicolas's, since it does not have the code to do popularity based boosting in suggestSimilar(). Essentially, this means that if the spell checker has two or more suggestions, the caller can specify that the suggestions be returned in order of decreasing popularity in the index. Internally, the code would call docFreq() for each of the suggestions and reorder accordingly.