mysql中数据去重和优化

zoukankan html css js c++ java

mysql中数据去重和优化
更改表user_info的主键uid为自增的id后，忘了设置原来主键uid属性为unique，结果导致产生uid重复的记录。为此需要清理后来插入的重复记录。

基本方法可以参考后面的附上的资料，但是由于mysql不支持同时对一个表进行操作，即子查询和要进行的操作不能是同一个表，因此需要通过零时表中转一下。

写在前面：数据量大时，一定要多涉及的关键字段创建索引！！！否则很慢很慢很慢，慢到想死的心都有了

1 单字段重复

生成零时表，其中uid是需要去重的字段
create table tmp_uid as (select uid from user_info group by uid having count(uid)) create table tmp_id as (select min(id) from user_info group by uid having count(uid))
数据量大时一定要为uid创建索引
create index index_uid on tmp_uid create index index_id on tmp_id
删除多余的重复记录，保留重复项中id最小的
delete from user_info where id not in (select id from tmp_id) and uid in (select uid from tmp_uid)
2.多字段重复

由uid的重复间接的导致了relationship中的记录重复，故继续去重。先介绍正常处理流程，在介绍本人根据自身数据特点实践的更加有效的方法！

2.1一般方法

基本的同上面：

生成零时表
create table tmp_relation as (select source,target from relationship group by source,target having count(*)>1) create table tmp_relationship_id as (select min(id) as id from relationship group by source,target having count(*)>1)
创建索引
create index index_id on tmp_relationship_id
删除
delete from relationship where id not in (select id from tmp_relationship_id) and (source,target) in (select source,target from relationship)
2.2 实践出真知

实践中发现上面的删除字段重复的方法，由于没有办法为多字段重建索引，导致数据量大时效率极低，低到无法忍受。最后，受不了等了半天没反应的状况，本人决定，另辟蹊径。

考虑到，估计同一记录的重复次数比较低。一般为2，或3，重复次数比较集中。所以可以尝试直接删除重复项中最大的，直到删除到不重复，这时其id自然也是当时重复的里边最小的。

大致流程如下：

1）选择每个重复项中id最大的一个记录
create table tmp_relation_id2 as (select max(id) from relationship group by source,target having count(*)>1)
2）创建索引（仅需在第一次时执行）
create index index_id on tmp_relation_id2
3）删除重复项中id最大的记录
delete from relationship where id in (select id from tmp_relation_id2)
4）删除临时表
drop table tmp_relation_id2
重复上述步骤1），2），3），4），直到创建的临时表中不存在记录就结束（对于重复次数的数据，比较高效）

参考：

查询及删除重复记录的方法 http://wenku.baidu.com/view/d4b3f134b90d6c85ec3ac6ad.html

查询及删除重复记录的方法

(一) 1、查找表中多余的重复记录，重复记录是根据单个字段（peopleId）来判断 select * from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)

2、删除表中多余的重复记录，重复记录是根据单个字段（peopleId）来判断，只留有rowid最小的记录 delete from people where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1) and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)

3、查找表中多余的重复记录（多个字段） select * from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)

4、删除表中多余的重复记录（多个字段），只留有rowid最小的记录 delete from vitae a where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1) and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
查看全文

相关阅读:
Codeforces Round #620 (Div. 2)
Codeforces Round #575 (Div. 3)
Codeforces Round #619 (Div. 2)
2014 Nordic Collegiate Programming Contest
Educational Codeforces Round 82 (Rated for Div. 2)
模板
 2015-2016 ACM-ICPC Southwestern Europe Regional Contest (SWERC 15)
模板
 Codeforces Round #618 (Div. 2)
Codeforces Round #343 (Div. 2)

原文地址：https://www.cnblogs.com/rainduck/p/3079868.html

mysql中数据去重和优化

1 单字段重复

2.多字段重复

2.1一般方法

2.2 实践出真知