  MongoDB 全文搜索教程

    MongoDB 全文搜索教程

    返回原文英文原文:MongoDB Text Search Tutorial

    In my introduction to text search in MongoDB, we had a look at the basic features. Today we’ll have a closer look at the details.


    You may have noticed that a text search is not executed with a find() command. Instead you call

    db.foo.runCommand( "text", {search: "bar"} )
    Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.

    I expect to see a new query operator like$textor$textsearchas soon as text search is integrated with the standard find() command.





    db.foo.runCommand( "text", {search: "bar"} )

    我多么的希望一个新的检索操作符,例如$textor $textsearch 可以和标准的find()命令相结合。

    Text Query Syntax

    In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:

    db.foo.ensureIndex( {txt: "text"} )
    db.foo.insert( {txt: "Robots are superior to humans"} )
    db.foo.insert( {txt: "Humans are weak"} )
    db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
    A search for “robot” will find two documents, the same it true for “human”:
    > db.foo.runCommand("text", {search: "robot"}).results.length
    > db.foo.runCommand("text", {search: "human"}).results.length
    When searching for multiple terms, an OR search is performed, yielding three documents in our example:
    > db.foo.runCommand("text", {search: "human robot"}).results.length
    I would have expected that the given search words are AND-ed not OR-ed.



    db.foo.ensureIndex( {txt: "text"} )
    db.foo.insert( {txt: "Robots are superior to humans"} )
    db.foo.insert( {txt: "Humans are weak"} )
    db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
    搜索单词“robot”, 会得到2个结果,而搜索“human”结果也是2个。
    > db.foo.runCommand("text", {search: "robot"}).results.length
    > db.foo.runCommand("text", {search: "human"}).results.length
    > db.foo.runCommand("text", {search: "human robot"}).results.length


    By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.

    > db.foo.runCommand("text", {search: "robot -humans"})
            "queryDebugString" : "robot||human||||",
            "language" : "english",
            "results" : [
                            "score" : 0.6666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                    "txt" : "I, Robot - by Isaac Asimov"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 212
            "ok" : 1

    Phrase Search

    By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search. Inside a phrase, order is important and stop words are also taken into account:

    > db.foo.runCommand("text", {search: '"robots are"'})
            "queryDebugString" : "robot||||robots are||",
            "language" : "english",
            "results" : [
                            "score" : 0.6666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                    "txt" : "Robots are superior to humans"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 185
            "ok" : 1
    Please have a look at the “queryDebugField”:
    "queryDebugString" : "robot||||robots are||"
    It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:
    > // order matters inside phrase
    > db.foo.runCommand("text", {search: '"are robots"'}).results.length
    > // no phrase search --> OR query
    > db.foo.runCommand("text", {search: 'are robots'}).results.length





    > db.foo.runCommand("text", {search: "robot -humans"})
            "queryDebugString" : "robot||human||||",
            "language" : "english",
            "results" : [
                            "score" : 0.6666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                    "txt" : "I, Robot - by Isaac Asimov"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 212
            "ok" : 1

    通过用引号包含由多个单词组成的词组(“foo bar”),就可以实现词组搜索。在词组里面,单词的顺序十分重要,同时搜索结束单词也需要考虑。

    > db.foo.runCommand("text", {search: '"robots are"'})
            "queryDebugString" : "robot||||robots are||",
            "language" : "english",
            "results" : [
                            "score" : 0.6666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                    "txt" : "Robots are superior to humans"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 185
            "ok" : 1
    "queryDebugString" : "robot||||robots are||"
    我们需要搜索条件中包含"robot"的词根,同时也包含"robots are"的词组。这就是为什么我们只找到一条记录。请比较如下的搜索:
    > // order matters inside phrase
    > db.foo.runCommand("text", {search: '"are robots"'}).results.length
    > // no phrase search --> OR query
    > db.foo.runCommand("text", {search: 'are robots'}).results.length
    Multi Language Support

    Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages.

    In order to use another language for indexing and searching, you do this when creating the index:

    > db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
    > db.de.validate().keysPerIndex["text.de.$txt_text"]
    With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:
    > db.de.runCommand("text", {search: "ich"}).results.length
    > db.de.runCommand("text", {search: "Vater"}).results.length
    > db.de.runCommand("text", {search: "Luke"}).results.length




    db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )
    > db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
    > db.de.validate().keysPerIndex["text.de.$txt_text"]
    As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:
    db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
    Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).

    It it also possible to mix multiple languages in the same index. Each single document can have its own language:

    db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
    译者信息             如你所见,这里只有两个索引关键字,因此停用词过滤就会起效(这里用的是德语的停用词,Vater 是德语中的 father 意思) ,我们再试试其他一些搜索:
    db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

    请注意,我们不一定需要在搜索的时候提供语言,因为这是从索引继承而来。我们已经命中了同义词 Vater 和 Luke,但没有命中停用词 ich (意思是 I)


    db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
    If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.
    // default language: german -> no hits
    > db.de.runCommand("text", {search: "ich"})
            "queryDebugString" : "||||||",
            "language" : "german",
            "results" : [ ],
            "stats" : {
                    "nscanned" : 0,
                    "nscannedObjects" : 0,
                    "n" : 0,
                    "timeMicros" : 96
            "ok" : 1
    // search for English -> one hit
    > db.de.runCommand("text", {search: "ich", language: "english"})
            "queryDebugString" : "ich||||||",
            "language" : "english",
            "results" : [
                            "score" : 0.625,
                            "obj" : {
                                    "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                    "language" : "english",
                                    "txt" : "Ich bin ein Berliner"
            "stats" : {
                    "nscanned" : 1,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 161
            "ok" : 1
    What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.

    What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.

    译者信息 如果存在 “language” 字段,其内容就相当于为文档的索引数据定义了流数据的语言和停用词过滤。单词 ich 在英语中并不是停用词,因此它被索引了。
    // default language: german -> no hits
    > db.de.runCommand("text", {search: "ich"})
            "queryDebugString" : "||||||",
            "language" : "german",
            "results" : [ ],
            "stats" : {
                    "nscanned" : 0,
                    "nscannedObjects" : 0,
                    "n" : 0,
                    "timeMicros" : 96
            "ok" : 1
    // search for English -> one hit
    > db.de.runCommand("text", {search: "ich", language: "english"})
            "queryDebugString" : "ich||||||",
            "language" : "english",
            "results" : [
                            "score" : 0.625,
                            "obj" : {
                                    "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                    "language" : "english",
                                    "txt" : "Ich bin ein Berliner"
            "stats" : {
                    "nscanned" : 1,
                    "nscannedObjects" : 0,
                    "n" : 1,
                    "timeMicros" : 161
            "ok" : 1

    这里到底发生什么事情?默认的搜索语言是德语,因此首次搜索没有返回任何结果。而第二次搜索时,我们搜索英语文本,这也是为什么我们能从这个句子中找出 JFK。


    Multiple Fields

    A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.

    > db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
    > db.mail.getIndices()
                    "v" : 0,
                    "key" : {
                            "_fts" : "text",
                            "_ftsx" : 1
                    "ns" : "de.mail",
                    "name" : "subject_text_body_text",
                    "weights" : {
                            "body" : 1,
                            "subject" : 10
                    "default_language" : "english",
                    "language_override" : "language"
    We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:
    > db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
    > db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
    > db.mail.runCommand("text", {search: "robot"})
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck"
                                    "prio" : 0 
                            "score" : 0.75,
                            "obj" : {
                                    "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                    "subject" : "Human leader to minions",
                                    "body" : "Robots suck"
                                    "prio" : 1
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 2,
                    "timeMicros" : 148
            "ok" : 1
    The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.



    > db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
    > db.mail.getIndices()
                    "v" : 0,
                    "key" : {
                            "_fts" : "text",
                            "_ftsx" : 1
                    "ns" : "de.mail",
                    "name" : "subject_text_body_text",
                    "weights" : {
                            "body" : 1,
                            "subject" : 10
                    "default_language" : "english",
                    "language_override" : "language"
    > db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
    > db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
    > db.mail.runCommand("text", {search: "robot"})
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck"
                                    "prio" : 0 
                            "score" : 0.75,
                            "obj" : {
                                    "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                    "subject" : "Human leader to minions",
                                    "body" : "Robots suck"
                                    "prio" : 1
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 2,
                    "timeMicros" : 148
            "ok" : 1

    Filtering and Projection

    You can apply additional search criteria via filtering:

    > db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck",
                                    "prio" : 0
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 2,
                    "n" : 1,
                    "timeMicros" : 185
            "ok" : 1
    Please note that filtering does not use an index.

    If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):

    > db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck"
                            "score" : 0.75,
                            "obj" : {
                                    "subject" : "Human leader to minions",
                                    "body" : "Robots suck"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 2,
                    "timeMicros" : 127
            "ok" : 1
    Filtering and projection can be combined, of course.



    > db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck",
                                    "prio" : 0
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 2,
                    "n" : 1,
                    "timeMicros" : 185
            "ok" : 1


    > db.mail.runCommand("text", {search: "robot", projection: {_id:0, prio:0} } )
            "queryDebugString" : "robot||||||",
            "language" : "english",
            "results" : [
                            "score" : 6.666666666666666,
                            "obj" : {
                                    "subject" : "Robot leader to minions",
                                    "body" : "Humans suck"
                            "score" : 0.75,
                            "obj" : {
                                    "subject" : "Human leader to minions",
                                    "body" : "Robots suck"
            "stats" : {
                    "nscanned" : 2,
                    "nscannedObjects" : 0,
                    "n" : 2,
                    "timeMicros" : 127
            "ok" : 1


    With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.




