记得这是曾经面阿里的一个面试题,数据量亿级别的数据,提升查询效率怎么去处理?
这里不讲什么导ES,导Hbase之类的, 只通过mysql来处理,这时候会用到分区表;
分区表怎么个意思呢? 跟hive里面的partition类似,但是功能没有hive强大,有很多限制,mysql 5.01后开始支持,创建分区表后,使用show create table查看,可以看到限制,5.01后才开始支持;
在业务里面很少用到分区表,这里简单记录一下使用:
创建分区表:
插入数据,分布有各个时间段的,每个时间段2条记录;
这时通过explain去查询,查询的时候已经用到分区了,扫描的不是全表的数据:
查看分区的数据:
mysql> SELECT PARTITION_NAME,TABLE_ROWS FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'by_year' and TABLE_SCHEMA='db_name';
+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| P1 | 2 |
| P2 | 2 |
| P3 | 2 |
| P4 | 4 |
+----------------+------------+
4 rows in set (0.00 sec)
实践
刚好最近有个任务表需要进行分区,每天的任务量有几十万,进行查询的时候基本上都会带着时间字段,想把表修改成分区表,按天进行分区;
直接修改表为分区表(如果数据较大的话,需要等待很长时间,我这一千多万的数据,花了十几分钟,业务高峰慎重。。。,我的线上操作是历史数据可以删掉一些不用的,然后再进行修改分区表操作)
ALTER TABLE tasks_table PARTITION BY RANGE (TO_DAYS(start_time))
(
PARTITION p20180401 VALUES LESS THAN (TO_DAYS('2018-04-01')),
PARTITION p20180402 VALUES LESS THAN (TO_DAYS('2018-04-02')),
PARTITION p20180403 VALUES LESS THAN (TO_DAYS('2018-04-03')),
PARTITION p20180404 VALUES LESS THAN (TO_DAYS('2018-04-04')),
PARTITION p20180405 VALUES LESS THAN (TO_DAYS('2018-04-05')),
PARTITION p20180406 VALUES LESS THAN (TO_DAYS('2018-04-06')),
PARTITION p20180407 VALUES LESS THAN (TO_DAYS('2018-04-07')),
PARTITION p20180408 VALUES LESS THAN (TO_DAYS('2018-04-08')),
PARTITION p20180409 VALUES LESS THAN (TO_DAYS('2018-04-09')),
PARTITION p20180410 VALUES LESS THAN (TO_DAYS('2018-04-10'))
)
出现错误
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
---因为分区键不是主键的其中一个组成部分而报错
alter table tasks_table drop primary key,add primary key(task_id,start_time);
再次执行分区修改,执行完后查看分区,如下:
MariaDB [test_db]> SELECT PARTITION_NAME,TABLE_ROWS FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME = 'tasks_table' and TABLE_SCHEMA='test_db';
+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p20180401 | 16746594 |
| p20180402 | 2808 |
| p20180403 | 2808 |
| p20180404 | 2808 |
| p20180405 | 5001 |
| p20180406 | 0 |
| p20180407 | 0 |
| p20180408 | 0 |
| p20180409 | 0 |
| p20180410 | 0 |
+----------------+------------+
10 rows in set (0.00 sec)
接下来就是按天分区了,想让他自动的分区,但是mysql不支持自动分区,只能另想它招:
1.通过脚本生成一堆add partition的语句,这种操作比较憨厚,而且如果时间长后,有可能忘记添加分区了;
2.通过mysql的存储过程加events操作;
创建存储过程
DELIMITER $$
USE `root`$$
DROP PROCEDURE IF EXISTS `create_Partition_tasks`$$
CREATE DEFINER=`root`@`%` PROCEDURE `create_Partition_tasks`()
BEGIN
/* 事务回滚,其实放这里没什么作用,ALTER TABLE是隐式提交,回滚不了的。*/
DECLARE EXIT HANDLER FOR SQLEXCEPTION ROLLBACK;
START TRANSACTION;
/* 到系统表查出这个表的最大分区,得到最大分区的日期。在创建分区的时候,名称就以日期格式存放,方便后面维护 */
SELECT REPLACE(partition_name,'p','') INTO @P12_Name FROM INFORMATION_SCHEMA.PARTITIONS
WHERE table_name='tasks_table' and TABLE_SCHEMA='test_db' ORDER BY partition_ordinal_position DESC LIMIT 1;
SET @Max_date= DATE(DATE_ADD(@P12_Name+0, INTERVAL 1 DAY))+0;
/* 修改表,在最大分区的后面增加一个分区,时间范围加1天 */
SET @s1=CONCAT('ALTER TABLE tasks_table ADD PARTITION (PARTITION p',@Max_date,' VALUES LESS THAN (TO_DAYS (''',DATE(@Max_date),''')))');
/* 输出查看增加分区语句*/
SELECT @s1;
PREPARE stmt2 FROM @s1;
EXECUTE stmt2;
DEALLOCATE PREPARE stmt2;
/* 取出最小的分区的名称,并删除掉 。
注意:删除分区会同时删除分区内的数据,慎重 */
/*select partition_name into @P0_Name from INFORMATION_SCHEMA.PARTITIONS
where table_name='tasks_table' order by partition_ordinal_position limit 1;
SET @s=concat('ALTER TABLE tasks_table DROP PARTITION ',@P0_Name);
PREPARE stmt1 FROM @s;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1; */
/* 提交 */
COMMIT ;
END$$
DELIMITER ;
创建事件event
DELIMITER ||
CREATE EVENT Partition_tasks_event
ON SCHEDULE
EVERY 1 day STARTS '2018-04-10 12:25:59'
DO
BEGIN
CALL create_Partition_tasks;
END ||
DELIMITER ;
这时候有一点需要注意,就是数据库的event_scheduler需要打开
通过命令
show variables like 'event_scheduler%' 查看是否打开;
如果没有,通过下面命令打开
set global event_scheduler = ON;
大功告成,查询的效率会大大大的提升;
检验一下:
MariaDB [test_db]> explain partitions select * from task_table where start_time<'2018-04-05' and start_time>'2018-04-03';
+------+-------------+-------+---------------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+---------------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | task_table | p20180404,p20180405 | ALL | NULL | NULL | NULL | NULL | 7809 | Using where |
+------+-------------+-------+---------------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)