zoukankan      html  css  js  c++  java
  • python爬虫:读取PDF

    下面的代码可以实现用python读取PDF,包括读取本地和网络上的PDF。

    pdfminer下载地址:https://pypi.python.org/packages/source/p/pdfminer/pdfminer-20140328.tar.gz

    #!/usr/bin/python
    # -*- encoding:utf-8 -*-


    from urllib2 import urlopen
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO

    def convert_pdf_to_txt(fp):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    fp.close()
    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

    url='http://pythonscraping.com/pages/warandpeace/chapter1.pdf'
    fp = StringIO(urlopen(url).read()) # for url

    # path='chapter1.pdf'
    # fp = file(path, 'rb') # for path

    text=convert_pdf_to_txt(fp)
    print text
  • 相关阅读:
    QTP自动化测试进阶
    疯狂Java实战演义
    软件开发之韵:和谐敏捷
    Android应用开发详解
    操作系统直接决定了计算机系统的整体性能
    iBATIS框架源码剖析
    PHP 5完全攻略
    天气地图系统
    OpenSolaris系统管理
    Asp.net MVC 3实例学习之ExtShop(三)——完成首页
  • 原文地址:https://www.cnblogs.com/miranda-tang/p/5569439.html
Copyright © 2011-2022 走看看