zoukankan      html  css  js  c++  java
  • MariaDB删除重复记录

    不管是程序BUG,还是业务变更,重复数据这个老生常谈的问题,总是会出现。以下是我在MariaDB或是MySQL下处理的一些经验。在SQL Server中,使用窗口函数是很容易实现的。不过听说MySQL 8.0和MariaDB 10.2以上均支持窗口函数了。等有机会再来测试使用窗口函数来删除重复记录。

    背景

    表t_record中的数据fromUserId, toUserId两个字段组合作为唯一的标识,删除重复记录,只留下最大id(或最新时间)的记录。id为自增无重复的主键。

    表t_record的id作为自增的主键。

    表t_record大概有6万多的数据。以下测试均在资源很差的主机上,t_record没有在使用的情况下的结果。

    方法1

    查询重复的记录

    SELECT fromUserId, toUserId, count(*)
    FROM t_record as tr
    GROUP BY fromUserId, toUserId
    HAVING count(*) > 1;
    

    把重复记录的两个字段放到临时表_tmp1中

    CREATE TABLE _tmp1
    SELECT fromUserId, toUserId
    FROM t_record as tr
    GROUP BY fromUserId, toUserId
    HAVING count(*) > 1;
    

    把应该删除的id查询出来,放到临时表_tmp2中

    CREATE TABLE _tmp2
    SELECT id
    FROM t_record as a
    WHERE (a.fromUserId, a.toUserId) in (
    
    SELECT fromUserId, toUserId from _tmp1
    )
    and a.id not in (
    SELECT MAX(id)
    FROM t_record as tr
    GROUP BY fromUserId, toUserId
    HAVING count(*) > 1
    );
    

    删除原表的记录

    DELETE from t_record
    where id in (
    	SELECT id from _tmp2
    );
    

    通过以前3个步骤,没有删除数据大概需要23秒左右。

    方法2

    如果表中没有主键,也没有可以标识唯一记录的字段。只能是把原表的数据分表后,插入到另一张临时表,删除原表数据,把临时表的数据导回来。

    这种方法也适用合于有主键或有唯一标识的表,但操作过程中会影响在线的业务,需要中断业务。否则可能会造成数据丢失或数据不一致。
    数据量大的表,导两次数据,过程会很慢,同时也需要注意硬盘空间是否足够。

    方法3

    测试mysql不支持以下这种delete语法来删除数据。改为select id 存到临时表,查询非常慢。

    DELETE a 
    FROM table_nam a
    WHERE EXISTS (SELECT 1 FROM table_nam b
    				 WHERE b.userid = a.userid AND b.CreateDate > a.CreateDate);
    

    方法4

    在mariadb 10.1.19下测试,60多秒。
    这个方法简单,只需要一条语句,速度还行。

    DELETE 
    from t_record 
    where id not in (
    		select maxid from 
    				(select max(id) as  maxid from  t_record
    						group by fromUserId,toUserId				
    				) b
    );
    

    现对方法4的进行改造,再测试下。11.5秒。快好几倍了。

    CREATE OR REPLACE TABLE _tmp3
    SELECT id
    FROM t_record 
    WHERE id NOT IN (
    		SELECT maxid FROM 
    				(SELECT max(id) AS  maxid FROM t_record
    						GROUP BY fromUserId,toUserId				
    				) b
    );
    
    DELETE FROM t_record
    WHERE id IN (
    		SELECT id from _tmp3
    );
    

    对临时表创建主键,再测试下。1.2秒!WOW!!!

    CREATE OR REPLACE TABLE _tmp3 (id INT NOT NULL PRIMARY KEY);
    
    INSERT INTO _tmp3 (id)
    SELECT id
    FROM t_record 
    WHERE id NOT IN (
    		SELECT maxid FROM 
    				(SELECT max(id) AS  maxid FROM t_record
    						GROUP BY fromUserId,toUserId				
    				) b
    );
    
    DELETE FROM t_record 
            WHERE id IN (
                    SELECT id from _tmp3
    );
    

    还以不能再快呢?改IN子句为JOIN,再测试下。1秒!

    CREATE OR REPLACE TABLE _tmp3 (id INT NOT NULL PRIMARY KEY);
    
    INSERT INTO _tmp3 (id)
    SELECT id
    FROM t_record 
    WHERE id NOT IN (
    		SELECT maxid FROM 
    				(SELECT max(id) AS  maxid FROM t_record
    						GROUP BY fromUserId,toUserId				
    				) b
    );
    
    DELETE a FROM t_record as a INNER JOIN _tmp3 as b on b.id = a.id;
    

    难道删除的那个语句的执行计划是不同的吗?由于mysql只支持select的执行查询,所以把删除的语句修改为查询语句。

    root@localhost [db1]EXPLAIN SELECT id FROM t_record WHERE id IN ( SELECT id from _tmp3 );
    +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
    | id   | select_type | table             | type   | possible_keys | key     | key_len | ref                | rows | Extra                    |
    +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
    |    1 | PRIMARY     | _tmp3             | index  | PRIMARY       | PRIMARY | 4       | NULL               |  452 | Using index              |
    |    1 | PRIMARY     | t_record          | eq_ref | PRIMARY       | PRIMARY | 8       | testdb._tmp3.id |    1 | Using where; Using index |
    +------+-------------+-------------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
    2 rows in set (0.00 sec)
    
    root@localhost [db1]EXPLAIN SELECT a.id FROM t_record as a INNER JOIN _tmp3 as b on b.id = a.id;
    +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+
    | id   | select_type | table | type   | possible_keys | key     | key_len | ref            | rows | Extra                    |
    +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+
    |    1 | SIMPLE      | b     | index  | PRIMARY       | PRIMARY | 4       | NULL           |  452 | Using index              |
    |    1 | SIMPLE      | a     | eq_ref | PRIMARY       | PRIMARY | 8       | testdb.b.id    |    1 | Using where; Using index |
    +------+-------------+-------+--------+---------------+---------+---------+----------------+------+--------------------------+
    2 rows in set (0.01 sec)
    

    对比执行计划,是一样的。但测试多次,用JOIN方式速度还是快些。为什么呢?




  • 相关阅读:
    Struts 2 Learning
    C/C++ Learning
    Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors
    Circular Queue Implementation Principle
    Linux Process Management && Process Scheduling Principle
    duxcms SQL Injection In /admin/module/loginMod.class.php
    oracle:批量插入不同方案对比
    oracle批量插入优化方案
    oracle数据库出现“批处理中出现错误: ORA-00001: 违反唯一约束条件”解决方法
    hive:导出数据记录中null被替换为 的解决方案
  • 原文地址:https://www.cnblogs.com/fishparadise/p/9177161.html
Copyright © 2011-2022 走看看