zoukankan      html  css  js  c++  java
  • 针对格式文件,Python读取一定大小的文件内容

    由数据库导出的数据是格式化数据,如下所示,每两个<REC>之间的数据是一个记录的所有字段数据,如<TITLE>、<ABSTRACT>、<SUBJECT_CODE>。但是每条记录中可能某些字段信息为空,

    在导出的文本文件中,就会缺失这个字段,如记录3,缺失<ABSTRACT>这个字段,记录4,缺失<SUBJECT_CODE>这个字段。

    <REC>(记录1)
    <TITLE>=Regulation of the protein disulfide proteome by mitochondria in mammalian cells.
    <ABSTRACT>=The majority of protein disulfides in cells is considered an important inert structural, rather than a dynamic regulatory, determinant of protein function. 
    <SUBJECT_CODE>=A006_8;D050_42;A006_62
    <REC>(记录2)
    <TITLE>=Selective control of cortical axonal spikes by a slowly inactivating K+ current.
    <ABSTRACT>=Neurons are flexible electrophysiological entities in which the distribution and properties of ionic channels control their behaviors.
    <SUBJECT_CODE>=E057_6;E062_318;I135_46
    <REC>(记录3)
    <TITLE>=Coupling of hydrogenic tunneling to active-site motion in the hydrogen radical transfer catalyzed by a coenzyme B12-dependent mutase.
    <SUBJECT_CODE>=B016_11;B014_32;B014_54
    <REC>(记录4)
    <TITLE>=Hyaluronic acid hydrogel for controlled self-renewal and differentiation of human embryonic stem cells.
    <ABSTRACT>=Control of self-renewal and differentiation of human ES cells (hESCs) remains a challenge. 
    <REC>(记录5)
    <TITLE>=Biologically inspired crack trapping for enhanced adhesion.
    <ABSTRACT>=We present a synthetic adaptation of the fibrillar adhesion surfaces found in nature. 
    <SUBJECT_CODE>=A004_57;B022_73;C034_22
    <REC>(记录6)
    <TITLE>=Identification of a retroviral receptor used by an envelope protein derived by peptide library screening.
    <ABSTRACT>=This study demonstrates the power of a genetic selection to identify a variant virus that uses a new retroviral receptor protein. 
    <SUBJECT_CODE>=A006_8;E059_A;E059_5

    1、从数据库中导出数据时,一些表格的导出文件(txt文本文件),占用空间会在3-4G个左右,无法直接读入内存;

    2、通过python的linecache模块的getlines函数读取600M以上的文本文件时,有时会因为PC当时的运行情况,内存不足等原因,读取得到的内容为空;

    备注:linecache模块的getlines()函数最终是调用file.readlines()函数来一次读取数据的,如果文件过大,getlines函数会返回一个空链表作为结果。

    3、逐行读取文本内容,一是不方便后续的处理流程,后续流程需要对每条记录的数据进行处理,而非对每行数据进行处理;二是逐行读取文本内容,速度较慢;

    因此,有必要针对这类格式文件,设计一种可以读取一定大小,并且这段文本中的记录都是完整的,不会出现最后一个记录只有部分字段数据;

    实现代码如下:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # -*- coding: GBK -*-
    import os
    import sys
    from time import time
    
    REC_STR = '<REC>'
    
    def read_text_in_buffer_multi_line(fd,length,label):
        BUFFER = []
        fd.seek(label,0)#根据新的label设置文件位置
        flag = 0
        line = ''
    
        BUFFER = fd.readlines(length)#读取一定大小的文本,并存放在BUFFER中
        line = fd.readline()#读取下一行,用于判断文件是否结束
        if not line:
            flag = 1
    
        label = fd.tell()#获取当前的文件位置
    
        if flag == 0:#如果文件没有结束,则将BUFFER中最后一个<REC>之后的数据丢弃;否则则直接返回BUFFER
            BUFFER_POST = []
    
            while True:
                temp = BUFFER.pop()#丢弃数据
    
                if temp.startswith(REC_STR) == False:#判断是否为<REC>
                    BUFFER_POST.append(temp)
                else:#是<REC>,结束循环
                    BUFFER_POST.append(temp)
                    break
    
            len_buf_post = len(''.join(BUFFER_POST))#获取到丢弃的数据的字节数目
            label = label - len_buf_post - len(line)#当前位置减去丢弃的字节数目,再减去多读取的一行的数据的字节数目
        return BUFFER,label
    
    
    if __name__ == "__main__":
        filename = "Data\SJWD_U.txt"
        fd = open(filename,'rb')
        label = 0
        readlen = 100000*210#待读取的字节数目 
        fout = open("out.txt",'w')
    
        begin = time()
        while True:
            buffer_list,label = read_text_in_buffer_multi_line(fd,readlen,label)
            if buffer_list == []:
                break
            else:
                fout.writelines(buffer_list)
        end = time()
        print "time:",(end - begin)
        fd.close()
        fout.close()
  • 相关阅读:
    购物车升级版本
    python购物车-基础版本
    ubuntu制作离线包
    记录:一次数据库被恶意修改配置文件的问题
    kafka监控
    python基础day3
    python基础day1
    openstack部署之Horizon
    openstack部署之创建第一个实例
    openstack部署之neutron
  • 原文地址:https://www.cnblogs.com/zhbzz2007/p/5282191.html
Copyright © 2011-2022 走看看