zoukankan      html  css  js  c++  java
  • 【Python】Docx解析

    1、cd D:ProgramDataAnaconda3

    2、pip install python-docx

    3、python代码处理

    # -*- coding: utf-8 -*-
     
    
    
    import os
    import docx
    from win32com import client as wc
    
    docs = []
     
    def traverse(f):
        fs = os.listdir(f)
        for f1 in fs:
            tmp_path = os.path.join(f,f1)
            if not os.path.isdir(tmp_path):
                #print('文件: %s'%tmp_path)
                if  os.path.splitext(tmp_path)[-1].lower() == ".doc" or os.path.splitext(tmp_path)[-1].lower() == ".docx":
                    #print('文件: %s'%tmp_path)
                    docs.append(tmp_path)
            else:
                #print('文件夹:%s'%tmp_path)
                traverse(tmp_path)
    
    
    def parseDoc(f):
        doc = docx.Document(f)
        parag_num = 0
        for para in doc.paragraphs :
            print("----------------------------------------------------")
            print(para.text)
            print("----------------------------------------------------")
            parag_num += 1      
        print ('This document has ', parag_num, ' paragraphs')
    
    def doc2docx(full_path):
        #dirname = os.path.dirname(full_path)
        #filename = os.path.basename(full_path)
        #newpath = full_path.replace('doc','docx')
        newpath = full_path + "x"
    
        if os.path.exists(newpath):
            return
    
        # 首先将doc转换成docx
        word = wc.Dispatch("Word.Application")
    
        # 找到word路径 + 文件名 ,即可打开文件 
        doc = word.Documents.Open(full_path)
        
        # 使用参数16表示将doc转换成docx,保存成docx后才能 读文件
        doc.SaveAs(newpath,16)
        doc.Close()
        word.Quit()
    
                
    path = 'E:/NLP/Docs/'
    
    traverse(path)
     
    for k,v in enumerate(docs):
        if k < 1:
            print(k,v)
            parseDoc(v)
            #doc2docx(v)
  • 相关阅读:
    upcoj 2169 DP
    hdu3415 单调队列
    hdu4417(树状数组)(线段树)(划分树+二分)
    poj3264 线段树水题
    STL Map hdu1004,1075,1263
    hdu1166线段树水题
    <<<<<<<<<用来存代码哒!!!!>>>>>>>>>>>>
    jQuery
    apache配置php
    linux关机、重启命令
  • 原文地址:https://www.cnblogs.com/defineconst/p/9915851.html
Copyright © 2011-2022 走看看