zoukankan html css js c++ java

用PDFMiner从PDF中提取文本文字

1、下载并安装PDFMiner

　　从https://pypi.python.org/pypi/pdfminer/下载PDFMineer

wget https://pypi.python.org/packages/57/4f/e1df0437858188d2d36466a7bb89aa024d252bd0b7e3ba90cbc567c6c0b8/pdfminer-20140328.tar.gz#md5=dfe3eb1b7b7017ab514aad6751a7c2ea

　　加压并安装

tar -zxvf pdfminer-20140328.tar.gz
cd pdfminer-20140328/
make cmap　　#防止中文乱码，否则处理中文会出现一大堆（CID:xxx）
sudo python setup.py install

2、提取文本文字

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import sys
import string

def convert_pdf_2_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    with open(path, 'rb') as fp:
        for page in PDFPage.get_pages(fp, set()):
            interpreter.process_page(page)
        text = retstr.getvalue()
    device.close()
    retstr.close()
    return text

text = convert_pdf_2_text(sys.argv[1])
open('real?.txt','wb').write(text)

3、测试结果

【1】http://www.unixuser.org/~euske/python/pdfminer/#source

【2】https://www.zhihu.com/question/31586273

查看全文

相关阅读:
十天冲刺之三
 设计模式-模板方法模式
 设计模式-观察者模式
 设计模式-迭代子模式
 设计模式-责任链模式
 设计模式-门面模式
 1395. Count Number of Teams
747. Largest Number At Least Twice of Others
1160. Find Words That Can Be Formed by Characters
1539. Kth Missing Positive Number

原文地址：https://www.cnblogs.com/vincent-vg/p/6827031.html