zoukankan      html  css  js  c++  java
  • python chunk 方式读取大文件——本质上还是file read自身支持

    参考:https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

    最优雅方式:

    file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

    bigfile = open('bigfilename','r')
    tmp_lines = bigfile.readlines(BUF_SIZE)
    while tmp_lines:
        process([line for line in tmp_lines])
        tmp_lines = bigfile.readlines(BUF_SIZE)

    或者:

    To write a lazy function, just use yield:

    def read_in_chunks(file_object, chunk_size=1024):
        """Lazy function (generator) to read a file piece by piece.
        Default chunk size: 1k."""
        while True:
            data = file_object.read(chunk_size)
            if not data:
                break
            yield data
    
    
    f = open('really_big_file.dat')
    for piece in read_in_chunks(f):
        process_data(piece)
    

     

    Read a file in chunks in Python

    This article is just to demonstrate how to read a file in chunks rather than all at once.

    This is useful for a number of cases, such as chunked uploading or encryption purposes, or perhaps where the file you want to interact with is larger than your machine memory capacity.

    # chunked file reading
    from __future__ import division
    import os
    
    def get_chunks(file_size):
        chunk_start = 0
        chunk_size = 0x20000  # 131072 bytes, default max ssl buffer size
        while chunk_start + chunk_size < file_size:
            yield(chunk_start, chunk_size)
            chunk_start += chunk_size
    
        final_chunk_size = file_size - chunk_start
        yield(chunk_start, final_chunk_size)
    
    def read_file_chunked(file_path):
        with open(file_path) as file_:
            file_size = os.path.getsize(file_path)
    
            print('File size: {}'.format(file_size))
    
            progress = 0
    
            for chunk_start, chunk_size in get_chunks(file_size):
    
                file_chunk = file_.read(chunk_size)
    
                # do something with the chunk, encrypt it, write to another file...
    
                progress += len(file_chunk)
                print('{0} of {1} bytes read ({2}%)'.format(
                    progress, file_size, int(progress / file_size * 100))
                )
    
    if __name__ == '__main__':
        read_file_chunked('some-file.gif')

    Also available as a Gist (https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c)

    The above will output:

    File size: 698837
    131072 of 698837 bytes read (18%)
    262144 of 698837 bytes read (37%)
    393216 of 698837 bytes read (56%)
    524288 of 698837 bytes read (75%)
    655360 of 698837 bytes read (93%)
    698837 of 698837 bytes read (100%)

    Hopefully handy to someone. This of course isn’t the only way, you could also use `file.seek` in the standard library to target chunks.

    Processing large files using python

    In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

    My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.

    
    with open("input.txt") as f:
        data = f.readlines()
        for line in data:
            process(line)
    
    

    This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.

    
    with open("input.txt") as f:
        for line in f:
            process(line)
    
    

    This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.

    
    import multiprocessing as mp
    
    #init objects
    pool = mp.Pool(cores)
    jobs = []
    
    #create jobs
    with open("input.txt") as f:
        for line in f:
            jobs.append( pool.apply_async(process,(line)) )
    
    #wait for all jobs to finish
    for job in jobs:
        job.get()
    
    #clean up
    pool.close()
    
    

    Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

    While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.

    
    import multiprocessing as mp
    
    def process_wrapper(lineID):
        with open("input.txt") as f:
            for i,line in enumerate(f):
                if i != lineID:
                    continue
                else:
                    process(line)
                    break
    
    #init objects
    pool = mp.Pool(cores)
    jobs = []
    
    #create jobs
    with open("input.txt") as f:
        for ID,line in enumerate(f):
            jobs.append( pool.apply_async(process_wrapper,(ID)) )
    
    #wait for all jobs to finish
    for job in jobs:
        job.get()
    
    #clean up
    pool.close()
    
    

    Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:

    
    import multiprocessing as mp
    
    def process_wrapper(lineByte):
        with open("input.txt") as f:
            f.seek(lineByte)
            line = f.readline()
            process(line)
    
    #init objects
    pool = mp.Pool(cores)
    jobs = []
    
    #create jobs
    with open("input.txt") as f:
        nextLineByte = f.tell()
        for line in f:
            jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
            nextLineByte = f.tell()
    
    #wait for all jobs to finish
    for job in jobs:
        job.get()
    
    #clean up
    pool.close()
    
    

    Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

    I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.

    
    import multiprocessing as mp,os
    
    def process_wrapper(chunkStart, chunkSize):
        with open("input.txt") as f:
            f.seek(chunkStart)
            lines = f.read(chunkSize).splitlines()
            for line in lines:
                process(line)
    
    def chunkify(fname,size=1024*1024):
        fileEnd = os.path.getsize(fname)
        with open(fname,'r') as f:
            chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            f.seek(size,1)
            f.readline()
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:
                break
    
    #init objects
    pool = mp.Pool(cores)
    jobs = []
    
    #create jobs
    for chunkStart,chunkSize in chunkify("input.txt"):
        jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )
    
    #wait for all jobs to finish
    for job in jobs:
        job.get()
    
    #clean up
    pool.close()
    
    

    Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

  • 相关阅读:
    selenium手机百度搜索
    selenium模拟手机浏览器
    selenium模拟登录赶集网,手动解决验证码问题
    selenium模拟登录京东,手动解决验证码问题,抓取购物车价格
    selenium模拟登录QQ空间,手动解决验证码问题
    两数之和&N数之和(求教!)
    Linux内核设计与实现——从内核出发
    Linux内核简介
    编程规范
    GIT基础使用
  • 原文地址:https://www.cnblogs.com/twodog/p/12137489.html
Copyright © 2011-2022 走看看