zoukankan html css js c++ java

查询中使用全文索引

上一片博文说明了全文索引的原理以及一些参数设置及如何创建全文索引。

MySQL数据库支持全文索引的查询，其语法如下：

MATCH （col1, col2,...） AGAINST (expr  [serarch_modifier])

serarch_modifier: 
{
　　IN NATURAL LANGUAGE MODE |
   IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION  |
   IN BOOLEAN MODE |
   WITH QUERY EXPANSION 
}

MySQL数据库通过mathc()...against()语法支持全文检索的查询，match指定了需要查询的列，against指定了使用何种方法进行查询。

NATURAL LANGUAGE

全文检索通过match函数进行查询，默认采用natural language模式，其表示查询带有指定word的文档。

在上一片博客中，创建了一个表，以及在表中插入了数据，并且创建了全文索引，表中内容如下：

mysql> select * from tb3;
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          1 | pLease porridge in the pot          |
|          2 | please say sorry                    |
|          4 | some like it hot, some like it cold |
|          5 | i like coding                       |
|          6 | fuck the company                    |
+------------+-------------------------------------+
5 rows in set (0.00 sec)

我们查询body中含有“please”单词的记录。

#不使用全文索引时的情况如下
mysql> select * from tb3 where body like "%please%";
+------------+----------------------------+
| FTS_DOC_ID | body                       |
+------------+----------------------------+
|          1 | pLease porridge in the pot |
|          2 | please say sorry           |
+------------+----------------------------+
2 rows in set (0.00 sec)

#查看执行计划
mysql> explain select * from tb3 where body like "%please%";
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | tb3   | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    6 |    16.67 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.03 sec)

mysql>


#使用全文索引
mysql> select * from tb3 where match(body) against("please" in natural language mode);
+------------+----------------------------+
| FTS_DOC_ID | body                       |
+------------+----------------------------+
|          1 | pLease porridge in the pot |
|          2 | please say sorry           |
+------------+----------------------------+
2 rows in set (0.03 sec)

mysql> explain select * from tb3 where match(body) against("please" in natural language mode);
+----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+
| id | select_type | table | partitions | type     | possible_keys | key      | key_len | ref   | rows | filtered | Extra                         |
+----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+
|  1 | SIMPLE      | tb3   | NULL       | fulltext | ft_index      | ft_index | 0       | const |    1 |   100.00 | Using where; Ft_hints: sorted |
+----+-------------+-------+------------+----------+---------------+----------+---------+-------+------+----------+-------------------------------+

由两次执行计划可以看到，使用全文索引过滤率为100%，仅扫描了一行；而不使用全文索引的时候过滤率仅为16.67%，并且是全表扫描。

在where条件中使用mathc函数，查询返回的结果是根据相关性进行降序排序的，即相关性最高的结果放在第一位。相关性的值是一个非负的浮点数，0表示没有任何相关性。根据MySQL官方文档可知，其相关性的计算依据是以下4个条件：

word是否在文档中出现。
word在文档中出现的次数。
word在索引列中的数量。
多少个文档包含该word。

mysql> insert into tb3 select NULL,"please please please";        #插入一条数据
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql> select * from tb3 where match(body) against("please" in natural language mode);  #
+------------+----------------------------+
| FTS_DOC_ID | body                       |
+------------+----------------------------+
|          8 | please please please       |            #文档8的相关性比较高，因此第一个显示。
|          1 | pLease porridge in the pot |
|          2 | please say sorry           |
+------------+----------------------------+
3 rows in set (0.00 sec)


#查看匹配的结果总数。
mysql> select count(*) from tb3 where match(body) against("please" in natural language mode);
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT 
    -> COUNT(IF(MATCH (body) AGAINST ("please" IN NATURAL LANGUAGE MODE),1,NULL))
    -> AS count 
    -> from tb3;
+-------+
| count |
+-------+
|     3 |
+-------+
1 row in set (0.00 sec)

#这两条语句中，第二条语句的执行效率比较高，是因为在第一条语句中还需要进行相关性的排序统计，而在第二局SQL中不需要。

此外还可以使用SQL语句查看对应的相关性：

mysql> select fts_doc_id, body, match(body) against("please" in natural language mode) as relevence from tb3;
+------------+-------------------------------------+---------------------+
| fts_doc_id | body                                | relevence           |
+------------+-------------------------------------+---------------------+
|          1 | pLease porridge in the pot          | 0.13540691137313843 |
|          2 | please say sorry                    | 0.13540691137313843 |
|          4 | some like it hot, some like it cold |                   0 |
|          5 | i like coding                       |                   0 |
|          6 | fuck the company                    |                   0 |
|          8 | please please please                |  0.4062207341194153 |
+------------+-------------------------------------+---------------------+
6 rows in set (0.01 sec)

对于innodb存储引擎的全文索引中，还需要考虑一下因素。

查询的word字段在stopword列中，忽略该字符串的查询。
查询的word字符长度是否在区间[innodb_ft_min_token_size, innodb_ft_max_token_size]之间。

mysql> show variables like "innodb_ft_min%";
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_ft_min_token_size | 3     |
+--------------------------+-------+
1 row in set (0.01 sec)

mysql> show variables like "innodb_ft_max%";
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_ft_max_token_size | 84    |
+--------------------------+-------+
1 row in set (0.00 sec)

#这两个参数用于控制innodb存储引擎查询字符的长度，当长度小于innodb_ft_min_token_size，或者长度大于innodb_ft_max_token_size时，会忽略该次的搜索。

Boolean

MySQL数据库允许使用IN BOOLEAN MODE修饰符来进行全文检索。当使用该修饰符时，查询字符串的前后字符会有特殊的含义。

mysql> select * from tb3 where match(body) against ("+like -hot" in boolean mode);
+------------+---------------+
| FTS_DOC_ID | body          |
+------------+---------------+
|          5 | i like coding |
+------------+---------------+
1 row in set (0.00 sec)

#上面这个要求查询含有like但是没有hot字符的文档。

Boolean全文索引支持以下几种操作：

+：表示该word必须存在
-：表示该word必须被排除
(no operator): 表示该word是可选的，但是如果出现，其相关性会更高。
@distance表示查询的多个单词之间的距离是否在distance之内，distance的单位是字节。这种全文素银的查询也称为Proximity Search.
>表示出现该单词增加相关性
<表示出现该单词降低相关性
*表示以0个或多个字符。
“表示短语

#表示有please或者有hot的文档
mysql> select * from tb3 where match (body) against ("please hot" in boolean mode);
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          4 | some like it hot, some like it cold |
|          8 | please please please                |
|          1 | pLease porridge in the pot          |
|          2 | please say sorry                    |
+------------+-------------------------------------+
4 rows in set (0.00 sec)

#使用distance，进行proximity search;
#please与sorry之间的距离在5个字符之内
mysql> select * from tb3 where match(body) against ('"please sorry" @5' in boolean mode);
+------------+------------------+
| FTS_DOC_ID | body             |
+------------+------------------+
|          2 | please say sorry |
+------------+------------------+
1 row in set (0.00 sec)

#please与sorry之间的距离在2个字符之内，查询结果为空
mysql> select * from tb3 where match(body) against ('"please sorry" @2' in boolean mode);
Empty set (0.00 sec)

#查询是否有单词please，或者hot进行相关性统计
mysql> select * from tb3 where match(body) against ('please >hot' in boolean mode);
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          4 | some like it hot, some like it cold |
|          8 | please please please                |
|          1 | pLease porridge in the pot          |
|          2 | please say sorry                    |
+------------+-------------------------------------+
4 rows in set (0.00 sec)

#相关性有负数存在的，
mysql> select fts_doc_id, body, match(body) against("please >say <pot" in boolean mode) as relevence from tb3;
+------------+-------------------------------------+----------------------+
| fts_doc_id | body                                | relevence            |
+------------+-------------------------------------+----------------------+
|          1 | pLease porridge in the pot          | -0.15040236711502075 |
|          2 | please say sorry                    |    1.849597692489624 |
|          4 | some like it hot, some like it cold |                    0 |
|          5 | i like coding                       |                    0 |
|          6 | fuck the company                    |                    0 |
|          8 | please please please                |   0.4062207341194153 |
+------------+-------------------------------------+----------------------+
6 rows in set (0.00 sec)

mysql> 

#进行*号匹配
mysql> select * from tb3 where match(body) against ('so*' in boolean mode);
+------------+-------------------------------------+
| FTS_DOC_ID | body                                |
+------------+-------------------------------------+
|          4 | some like it hot, some like it cold |
|          2 | please say sorry                    |
+------------+-------------------------------------+
2 rows in set (0.01 sec)

mysql>

query expansion

MySQL数据库还支持全文索引的扩展查询。这种查询通常是在查询的关键词太短，用户需要implied knowledge(隐含知识)时进行。例如，对于单词database的查询，用户可能希望查询不仅仅包含database文档，可能还指那些包含MySQL，oracle，DB2，RDBMS的单词，而这时可以使用query expansion模式开启全文检索的implied knowledge.

通常在查询短语中添加WITH QUERY EXPANSION或IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION也可以开启blind query expansion.该查询分为两个阶段。

第一阶段：根据搜索的单词进行全文索引查询。
第二阶段：根据第一阶段产生的分词再进行一次全文检索查询。

#创建如下表
CREATE TABLE articles (
id int UNSIGNED auto_increment not null primary key,
title varchar(60),
body text,
FULLTEXT  ft_index(title, body)
)

#插入数据
INSERT INTO articles(title, body) VALUES
("mysql tutorial", "DBMS stands for DATABASE...."),
("How to use mysql well","After you went through a ..."),
("optimizing mysql","in this tutorial we will SHOW...."),
("1001 mysql tricks","1. Never run mysqld as root..."),
("mysql VS pgsql","In the following database comparison..."),
("mysql security","When configured properly....,mysql..."),
("tunning DB2", "for IBM database"),
("IBM History","DB2 history for IBM...");

mysql> select * from articles;   #查询数据
+----+-----------------------+-----------------------------------------+
| id | title                 | body                                    |
+----+-----------------------+-----------------------------------------+
| 17 | mysql tutorial        | DBMS stands for DATABASE....            |
| 18 | How to use mysql well | After you went through a ...            |
| 19 | optimizing mysql      | in this tutorial we will SHOW....       |
| 20 | 1001 mysql tricks     | 1. Never run mysqld as root...          |
| 21 | mysql VS pgsql        | In the following database comparison... |
| 22 | mysql security        | When configured properly....,mysql...   |
| 23 | tunning DB2           | for IBM database                        |
| 24 | IBM History           | DB2 history for IBM...                  |
+----+-----------------------+-----------------------------------------+
8 rows in set (0.00 sec)

在这个例子中，并没有显示创建FTS_DOC_ID列，因此innodb存储引擎会自动创建该列，并添加唯一索引（都是隐藏的）。此外，又创建了列title和body的联合索引。

mysql> select * from articles where match(title, body) against("database" );   #默认是使用in natural language mode!
+----+----------------+-----------------------------------------+
| id | title          | body                                    |
+----+----------------+-----------------------------------------+
| 17 | mysql tutorial | DBMS stands for DATABASE....            |
| 21 | mysql VS pgsql | In the following database comparison... |
| 23 | tunning DB2    | for IBM database                        |
+----+----------------+-----------------------------------------+
3 rows in set (0.00 sec)

mysql> select * from articles where match(title, body) against("database" with query expansion);
+----+-----------------------+-----------------------------------------+
| id | title                 | body                                    |
+----+-----------------------+-----------------------------------------+
| 21 | mysql VS pgsql        | In the following database comparison... |
| 17 | mysql tutorial        | DBMS stands for DATABASE....            |
| 23 | tunning DB2           | for IBM database                        |
| 24 | IBM History           | DB2 history for IBM...                  |
| 19 | optimizing mysql      | in this tutorial we will SHOW....       |
| 22 | mysql security        | When configured properly....,mysql...   |
| 18 | How to use mysql well | After you went through a ...            |
| 20 | 1001 mysql tricks     | 1. Never run mysqld as root...          |
+----+-----------------------+-----------------------------------------+
8 rows in set (0.01 sec)

因为query expansion会带来许多非相关性的查询，因此使用的时候，需要特别注意。

查看全文

相关阅读:
计算机网络学习目录
 手把手教你玩微信小程序跳一跳
 （三）python函数式编程
 跟托福说分手
 （二）python高级特性
 BitCoin工作原理
 反向传播的工作原理（深度学习第三章）
1.22计划
 梯度下降——神经网络如何学习？（深度学习第二章）
什么是神经网络（深度学习第一章）？

原文地址：https://www.cnblogs.com/wxzhe/p/10233973.html