如何快速更新数据库中的百万条数据
环境:ubuntu 14.04,MySQL 5.6,Python 2.7。
本文使用python的MySQLdb这个库,MySQL Python tutorial这个教程很赞。
MySQLDBdb处理INSERT、UPDATE、DELETE时需要显式得commit()才能使更新有效,这也意味着commit之前的更新语句是放在一个事务中的。
For databases that support transactions, the Python interface silently starts a transaction when the cursor is created. The commit() method commits the updates made using that cursor, and the rollback() method discards them. Each method starts a new transaction.
对比如下:
更新方式 | 总时间(ms) | rows | 平均(ms) |
---|---|---|---|
1 | 57831 | 1000 | 57.8 |
2 | 89695 | 10 | 896 |
3 | 92941 | 40×10000 | 0.23 |
4 | 47956 | 40×10000 | 0.119 |
5 | 28407 | 40×10000 | 0.071 |
6 | 3272 | 48×10000 | 0.008 |
方式1是单线程,where使用了索引,一个更新对应一个commit。
方式2是单线程,where没有使用索引,所有更新对应一个commit。
方式3是单线程,where使用了索引,所有更新对应一个commit。
方式4使用了2个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式5使用了4个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式6是在一个update语句里面更新所有的数据。
建立表
mysql> CREATE TABLE `test`.`number` (
`id` INT NOT NULL COMMENT '',
`num` INT NULL COMMENT '',
PRIMARY KEY (`id`) COMMENT '');
mysql> show create table number G
*************************** 1. row ***************************
Table: number
Create Table: CREATE TABLE `number` (
`id` int(11) NOT NULL,
`num` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
使用mysqldb插入数据(100*000条)
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in xrange(100*10000): # 0 到 999999
print x
sql = "INSERT INTO number VALUES(%s,%s)" % (x, x)
cur.execute(sql)
conn.commit()
end_time = time.time()
print '用时:', end_time - start_time
运行结果:
...
999997
999998
999999
用时:219.332565069
挺快。
看下数据:
mysql> select count(*) from number;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row in set (0.60 sec)
mysql> select * from number limit 3;
+----+------+
| id | num |
+----+------+
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
+----+------+
3 rows in set (0.00 sec)
更新方式1:单线程,使用索引,多个commmit,更新1000条
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in xrange(1000):
print x
sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
end_time = time.time()
print '用时:', end_time - start_time
输出:
...
997
998
999
用时:57.8318688869
1000条用时约60秒。
为什么慢?
先explain一下:
mysql> explain update number set num=3 where id=2 G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: number
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra: Using where
1 row in set (0.06 sec)
sql语句没什么问题。
再看一下磁盘IO:
letian $ iostat -d -k 1 100
Linux 3.13.0-37-generic (myhost) 06/22/2015 _x86_64_ (4 CPU)
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 57.00 4.00 468.00 4 468
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 71.00 28.00 404.00 28 404
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 71.00 28.00 608.00 28 608
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 61.00 4.00 492.00 4 492
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 73.00 0.00 536.00 0 536
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 69.00 4.00 516.00 4 516
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 61.00 4.00 472.00 4 472
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 63.00 4.00 452.00 4 452
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 16.00 0.00 1804.00 0 1804
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.00 0.00 16.00 0 16
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 2.00 0.00 12.00 0 12
tps
代表每秒IO传输的次数,与下面的更新方式4
对比一下,可以看出当前的更新方式IO较多。(为什么呢?移步Why does TRANSACTION / COMMIT improve performance so much with PHP/MySQL (InnoDB)?)
更新方式2,单线程,不使用索引,一个commit,更新100条
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in xrange(100):
print x
sql = "UPDATE number SET num=%s where num=%s" % (x*2, x)
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
end_time = time.time()
print '用时:', end_time - start_time
输出:
...
97
98
99
用时: 89.695207119
为什么这么慢:
mysql> explain update number set num=4 where num=2 G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: number
type: index
possible_keys: NULL
key: PRIMARY
key_len: 4
ref: NULL
rows: 979800
Extra: Using where
1 row in set (0.05 sec)
可以看到,每次update几乎要扫全表(rows: 979800)。
更新方式3:单线程,使用索引,一个commit,更新40×10000条
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in xrange(40*10000): # 0 到 999999
print x
sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)
cur.execute(sql)
conn.commit()
end_time = time.time()
print '用时:', end_time - start_time
输出:
...
399997
399998
399999
用时:92.9413559437
查看磁盘IO:
letian $ iostat -d -k 1 100
Linux 3.13.0-37-generic (myhost) 06/22/2015 _x86_64_ (4 CPU)
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 20.88 288.50 283.18 2458129 2412772
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 18.00 116.00 328.00 116 328
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 17.00 8.00 548.00 8 548
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 11.00 12.00 396.00 12 396
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 488.00 4 488
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 460.00 4 460
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 448.00 4 448
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 464.00 4 464
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 456.00 4 456
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 9.00 4.00 452.00 4 452
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 9.00 0.00 140.00 0 140
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 15.00 4.00 2448.00 4 2448
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 43.00 144.00 2544.00 144 2544
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 9.00 0.00 516.00 0 516
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 512.00 4 512
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 12.00 4.00 540.00 4 540
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 14.00 4.00 2288.00 4 2288
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 4.00 528.00 4 528
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 6.00 0.00 184.00 0 184
............
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 14.00 0.00 3868.00 0 3868
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 15.00 0.00 2604.00 0 2604
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 15.00 0.00 1516.00 0 1516
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.00 0.00 16.00 0 16
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 12.00 0.00 68.00 0 68
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 4.00 0.00 28.00 0 28
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
更新方式4:2个进程,一个进程里一个commit,共更新40*10000条
两个进程:
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
import multiprocessing
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
def update(int_iter):
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in int_iter:
print x
sql = "UPDATE number SET num=%s where id=%s" % (x+5, x)
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
pool = multiprocessing.Pool(processes = 2)
pool.apply_async(update, (xrange(20*10000), ))
pool.apply_async(update, (xrange(20*10000, 40*10000), ))
pool.close()
pool.join()
end_time = time.time()
print '用时:', end_time - start_time
输出:
...
199998
199999
用时: 47.9561839104
看下数据:
mysql> select * from number where id=1;
+----+------+
| id | num |
+----+------+
| 1 | 6 |
+----+------+
1 row in set (0.00 sec)
mysql> select * from number where id=399999;
+--------+--------+
| id | num |
+--------+--------+
| 399999 | 400004 |
+--------+--------+
1 row in set (0.00 sec)
更新方式5:4个进程,一个进程里一个commit,共更新40*10000条
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
import multiprocessing
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
def update(int_iter):
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
for x in int_iter:
print x
sql = "UPDATE number SET num=%s where id=%s" % (x+10, x)
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
pool = multiprocessing.Pool(processes = 4)
pool.apply_async(update, (xrange(10*10000), ))
pool.apply_async(update, (xrange(10*10000, 20*10000), ))
pool.apply_async(update, (xrange(20*10000, 30*10000), ))
pool.apply_async(update, (xrange(30*10000, 40*10000), ))
pool.close()
pool.join()
end_time = time.time()
print '用时:', end_time - start_time
运行结果:
299996
299997
299998
299999
用时: 28.4070010185
查看更新效果:
mysql> select * from number where id=1;
+----+------+
| id | num |
+----+------+
| 1 | 11 |
+----+------+
1 row in set (0.00 sec)
mysql> select * from number where id=399999;
+--------+--------+
| id | num |
+--------+--------+
| 399999 | 400009 |
+--------+--------+
1 row in set (0.00 sec)
更新方式6:一次更新多行,更新40*10000条
# coding: UTF-8
__author__ = 'letian'
import MySQLdb
import time
host='127.0.0.1'
user='root'
passwd='123456'
db='test'
port = 3306
start_time = time.time()
conn = MySQLdb.connect(host, user, passwd, db, port)
cur = conn.cursor()
sql = "UPDATE number SET num=id+100 where id<%s" % (40*10000, )
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
end_time = time.time()
print '用时:', end_time - start_time
输出:
用时: 3.27281808853
查看更新效果:
mysql> select * from number where id=1;
+----+------+
| id | num |
+----+------+
| 1 | 101 |
+----+------+
1 row in set (0.00 sec)
mysql> select * from number where id=399999;
+--------+--------+
| id | num |
+--------+--------+
| 399999 | 400099 |
+--------+--------+
1 row in set (0.00 sec)
分析使用的sql语句:
mysql> explain UPDATE number SET num=id+100 where id<400000 G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: number
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 489900
Extra: Using where
1 row in set (0.00 sec)
这种方式非常快,灵活性不够。
如果遇到这样一个场景:将id为0、1、2的num置为2,其他全部置为4,可以先全部置为4,然后在更新id为0、1、2的记录,这样速度会很快。