zoukankan      html  css  js  c++  java
  • 如何快速更新数据库中的百万条数据

    如何快速更新数据库中的百万条数据 

    环境:ubuntu 14.04,MySQL 5.6,Python 2.7。

    本文使用python的MySQLdb这个库,MySQL Python tutorial这个教程很赞。

    MySQLDBdb处理INSERT、UPDATE、DELETE时需要显式得commit()才能使更新有效,这也意味着commit之前的更新语句是放在一个事务中的。

    For databases that support transactions, the Python interface silently starts a transaction when the cursor is created. The commit() method commits the updates made using that cursor, and the rollback() method discards them. Each method starts a new transaction.

    对比如下:

    更新方式总时间(ms)rows平均(ms)
    1 57831 1000 57.8
    2 89695 10 896
    3 92941 40×10000 0.23
    4 47956 40×10000 0.119
    5 28407 40×10000 0.071
    6 3272 48×10000 0.008

    方式1是单线程,where使用了索引,一个更新对应一个commit。
    方式2是单线程,where没有使用索引,所有更新对应一个commit。
    方式3是单线程,where使用了索引,所有更新对应一个commit。
    方式4使用了2个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
    方式5使用了4个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
    方式6是在一个update语句里面更新所有的数据。

    建立表

    mysql> CREATE TABLE `test`.`number` (
      `id` INT NOT NULL COMMENT '',
      `num` INT NULL COMMENT '',
      PRIMARY KEY (`id`)  COMMENT '');
    
    mysql> show create table number G
    *************************** 1. row ***************************
           Table: number
    Create Table: CREATE TABLE `number` (
      `id` int(11) NOT NULL,
      `num` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
    1 row in set (0.00 sec)
    

    使用mysqldb插入数据(100*000条)

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    
    start_time = time.time()
    
    conn = MySQLdb.connect(host, user, passwd, db, port)
    cur = conn.cursor()
    
    for x in xrange(100*10000): # 0 到 999999
        print x
        sql = "INSERT INTO number VALUES(%s,%s)" % (x, x)
        cur.execute(sql)
    
    conn.commit()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    运行结果:

    ...
    999997
    999998
    999999
    用时:219.332565069
    

    挺快。

    看下数据:

    mysql> select count(*) from number;
    +----------+
    | count(*) |
    +----------+
    |  1000000 |
    +----------+
    1 row in set (0.60 sec)
    
    mysql> select * from number limit 3;
    +----+------+
    | id | num  |
    +----+------+
    |  0 |    0 |
    |  1 |    1 |
    |  2 |    2 |
    +----+------+
    3 rows in set (0.00 sec)
    

    更新方式1:单线程,使用索引,多个commmit,更新1000条

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    
    start_time = time.time()
    
    conn = MySQLdb.connect(host, user, passwd, db, port)
    cur = conn.cursor()
    
    for x in xrange(1000):
        print x
        sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)
        cur.execute(sql)
        conn.commit()
    
    cur.close()
    conn.close()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    输出:

    ...
    997
    998
    999
    用时:57.8318688869
    

    1000条用时约60秒。

    为什么慢?

    先explain一下:

    mysql> explain update number set num=3 where id=2 G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: number
             type: range
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: const
             rows: 1
            Extra: Using where
    1 row in set (0.06 sec)
    

    sql语句没什么问题。

    再看一下磁盘IO:

    letian $ iostat -d -k 1 100
    Linux 3.13.0-37-generic (myhost)  06/22/2015  _x86_64_  (4 CPU)
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              57.00         4.00       468.00          4        468
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              71.00        28.00       404.00         28        404
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              71.00        28.00       608.00         28        608
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              61.00         4.00       492.00          4        492
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              73.00         0.00       536.00          0        536
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              69.00         4.00       516.00          4        516
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              61.00         4.00       472.00          4        472
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              63.00         4.00       452.00          4        452
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              16.00         0.00      1804.00          0       1804
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               3.00         0.00        16.00          0         16
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               0.00         0.00         0.00          0          0
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               2.00         0.00        12.00          0         12
    

    tps代表每秒IO传输的次数,与下面的更新方式4对比一下,可以看出当前的更新方式IO较多。(为什么呢?移步Why does TRANSACTION / COMMIT improve performance so much with PHP/MySQL (InnoDB)?

    更新方式2,单线程,不使用索引,一个commit,更新100条

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    
    start_time = time.time()
    
    conn = MySQLdb.connect(host, user, passwd, db, port)
    cur = conn.cursor()
    
    for x in xrange(100):
        print x
        sql = "UPDATE number SET num=%s where num=%s" % (x*2, x)
        cur.execute(sql)
    
    conn.commit()
    cur.close()
    conn.close()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    输出:

    ...
    97
    98
    99
    用时: 89.695207119
    

    为什么这么慢:

    mysql> explain update number set num=4 where num=2 G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: number
             type: index
    possible_keys: NULL
              key: PRIMARY
          key_len: 4
              ref: NULL
             rows: 979800
            Extra: Using where
    1 row in set (0.05 sec)
    

    可以看到,每次update几乎要扫全表(rows: 979800)。

    更新方式3:单线程,使用索引,一个commit,更新40×10000条

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    start_time = time.time()
    
    conn = MySQLdb.connect(host, user, passwd, db, port)
    cur = conn.cursor()
    
    for x in xrange(40*10000): # 0 到 999999
        print x
        sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)
        cur.execute(sql)
    
    conn.commit()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    输出:

    ...
    399997
    399998
    399999
    用时:92.9413559437
    

    查看磁盘IO:

    letian $ iostat -d -k 1 100
    Linux 3.13.0-37-generic (myhost)  06/22/2015  _x86_64_  (4 CPU)
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              20.88       288.50       283.18    2458129    2412772
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              18.00       116.00       328.00        116        328
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              17.00         8.00       548.00          8        548
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              11.00        12.00       396.00         12        396
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       488.00          4        488
    
    ............
    
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       460.00          4        460
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       448.00          4        448
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       464.00          4        464
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       456.00          4        456
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               9.00         4.00       452.00          4        452
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               9.00         0.00       140.00          0        140
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              15.00         4.00      2448.00          4       2448
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              43.00       144.00      2544.00        144       2544
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               9.00         0.00       516.00          0        516
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       512.00          4        512
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              12.00         4.00       540.00          4        540
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              14.00         4.00      2288.00          4       2288
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              10.00         4.00       528.00          4        528
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               6.00         0.00       184.00          0        184
    
    ............
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              14.00         0.00      3868.00          0       3868
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              15.00         0.00      2604.00          0       2604
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              15.00         0.00      1516.00          0       1516
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               3.00         0.00        16.00          0         16
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda              12.00         0.00        68.00          0         68
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               4.00         0.00        28.00          0         28
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               0.00         0.00         0.00          0          0
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               0.00         0.00         0.00          0          0
    
    Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
    sda               0.00         0.00         0.00          0          0
    

    更新方式4:2个进程,一个进程里一个commit,共更新40*10000条

    两个进程:

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    import multiprocessing
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    start_time = time.time()
    
    def update(int_iter):
        conn = MySQLdb.connect(host, user, passwd, db, port)
        cur = conn.cursor()
        for x in int_iter:
            print x
            sql = "UPDATE number SET num=%s where id=%s" % (x+5, x)
            cur.execute(sql)
        conn.commit()
        cur.close()
        conn.close()
    
    pool = multiprocessing.Pool(processes = 2)
    
    pool.apply_async(update, (xrange(20*10000), ))
    pool.apply_async(update, (xrange(20*10000, 40*10000), ))
    pool.close()
    pool.join()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    输出:

    ...
    199998
    199999
    用时: 47.9561839104
    

    看下数据:

    mysql> select * from number where id=1;
    +----+------+
    | id | num  |
    +----+------+
    |  1 |    6 |
    +----+------+
    1 row in set (0.00 sec)
    
    mysql> select * from number where id=399999;
    +--------+--------+
    | id     | num    |
    +--------+--------+
    | 399999 | 400004 |
    +--------+--------+
    1 row in set (0.00 sec)
    

    更新方式5:4个进程,一个进程里一个commit,共更新40*10000条

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    import multiprocessing
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    start_time = time.time()
    
    def update(int_iter):
        conn = MySQLdb.connect(host, user, passwd, db, port)
        cur = conn.cursor()
        for x in int_iter:
            print x
            sql = "UPDATE number SET num=%s where id=%s" % (x+10, x)
            cur.execute(sql)
        conn.commit()
        cur.close()
        conn.close()
    
    pool = multiprocessing.Pool(processes = 4)
    
    pool.apply_async(update, (xrange(10*10000), ))
    pool.apply_async(update, (xrange(10*10000, 20*10000), ))
    pool.apply_async(update, (xrange(20*10000, 30*10000), ))
    pool.apply_async(update, (xrange(30*10000, 40*10000), ))
    
    pool.close()
    pool.join()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    运行结果:

    299996
    299997
    299998
    299999
    用时: 28.4070010185
    

    查看更新效果:

    mysql> select * from number where id=1;
    +----+------+
    | id | num  |
    +----+------+
    |  1 |   11 |
    +----+------+
    1 row in set (0.00 sec)
    
    mysql> select * from number where id=399999;
    +--------+--------+
    | id     | num    |
    +--------+--------+
    | 399999 | 400009 |
    +--------+--------+
    1 row in set (0.00 sec)
    

    更新方式6:一次更新多行,更新40*10000条

    # coding: UTF-8
    __author__ = 'letian'
    
    import MySQLdb
    import time
    
    host='127.0.0.1'
    user='root'
    passwd='123456'
    db='test'
    port = 3306
    
    start_time = time.time()
    
    conn = MySQLdb.connect(host, user, passwd, db, port)
    cur = conn.cursor()
    
    sql = "UPDATE number SET num=id+100 where id<%s" % (40*10000, )
    cur.execute(sql)
    conn.commit()
    
    cur.close()
    conn.close()
    
    end_time = time.time()
    
    print '用时:', end_time - start_time
    

    输出:

    用时: 3.27281808853
    

    查看更新效果:

    mysql> select * from number where id=1;
    +----+------+
    | id | num  |
    +----+------+
    |  1 |  101 |
    +----+------+
    1 row in set (0.00 sec)
    
    mysql> select * from number where id=399999;
    +--------+--------+
    | id     | num    |
    +--------+--------+
    | 399999 | 400099 |
    +--------+--------+
    1 row in set (0.00 sec)
    

    分析使用的sql语句:

    mysql> explain UPDATE number SET num=id+100 where id<400000 G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: number
             type: range
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: const
             rows: 489900
            Extra: Using where
    1 row in set (0.00 sec)
    

    这种方式非常快,灵活性不够。

    如果遇到这样一个场景:将id为0、1、2的num置为2,其他全部置为4,可以先全部置为4,然后在更新id为0、1、2的记录,这样速度会很快。

  • 相关阅读:
    使用SQL Server Management Studio操作replication时,要用机器名登录,不要用IP地址
    SQL Server Replication 总结
    SQL Server 全文索引的硬伤(转载)
    Sqlserver2008及以上使用全文索引排除干扰词 (转载)
    如何恢复VS2013代码实时校验功能
    [VS2013]如何闪开安装VS2013必须要有安装IE10的限制
    C# ASP.NET Webservice调用外部exe无效的解决方法
    nstallShield制作打包程序详解(图)
    VS报错之混合模式程序集是针对“v1.1.4322”版的运行时生成的,在没有配置其他信息的情况下,无法在 4.0 运行时中加载该程序集。
    Android之悬浮窗口实现(WindowManager)
  • 原文地址:https://www.cnblogs.com/timssd/p/6339595.html
Copyright © 2011-2022 走看看