前段时间使用zoie的perf包内的性能测试代码对lucene和zoie的实时搜索部分做了对比测试,结果出乎我意料,从数据上看,lucene比zoie更适合于一般实时搜索的场景。
zoie的perf从四个方面来评测:search lancenty, indexing lancenty, indexing event rate, indexing event size。图1为zoie的评测结果,图2为lucene nrt的评测结果。
图1 zoie测试数据
图2 lucene nrt 测试数据
从数据上很容易看出,lucene在搜索响应时间上胜出,而zoie在索引数据时有更好的表现。Mike McCandless在他的一篇博客Lucene's near-real-time search is fast!后的评论回复中解释了nrt和zoie的差别:“
The biggest difference is that Zoie aims for immediate consistency
(reopen after every index change & next query), which I think very few
apps really require, given how fast NRT is.
Also, NRTCachingDir (caching small segments in RAM) achieves the
biggest (in my opinion) benefit of Zoie, but with substantially less
added complexity. Reducing complexity is important because it means
less risk of bugs; for example, Zoie had some scary corruption bugs,
which took quite some time to track down; see
https://issues.apache.org/jira/browse/LUCENE-2729
The other part of Zoie I remember is deferring resolving deletions to
Lucene docIDs, and instead using a bloom filter to post-filter
collected documents. While I understand the motivation for this
("immediate consistency") I think it's the wrong tradeoff since it
necessarily slows down all searching (checking a bloom filter is more
costly than Lucene's checking a bit set), not to mention the added RAM
required for the bloom filter.
Ie, it's better to spend more time during reopen to resolve the
deletions, so that searches don't slow down.
”
总的来说就是zoie的强一致性,推迟删除的特性导致了搜索响应时间比lucene长,而且zoie的特殊设计增加了代码的复杂性,bug难于追踪,而且对使用者来说,文档缺乏且阅读代码费时费力,我猜这也是它没能流行起来的原因之一。类似linkedin这样的频繁更新数据的搜索场景很少见,更一般的情况,lucene nrt足以胜任,所以真心觉得cntv和网易大可不用zoie……