zoukankan      html  css  js  c++  java
  • python爬虫:读取PDF

    下面的代码可以实现用python读取PDF,包括读取本地和网络上的PDF。

    pdfminer下载地址:https://pypi.python.org/packages/source/p/pdfminer/pdfminer-20140328.tar.gz

    #!/usr/bin/python
    # -*- encoding:utf-8 -*-


    from urllib2 import urlopen
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from cStringIO import StringIO

    def convert_pdf_to_txt(fp):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    fp.close()
    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

    url='http://pythonscraping.com/pages/warandpeace/chapter1.pdf'
    fp = StringIO(urlopen(url).read()) # for url

    # path='chapter1.pdf'
    # fp = file(path, 'rb') # for path

    text=convert_pdf_to_txt(fp)
    print text
  • 相关阅读:
    包的使用,json&pickle模块,hashlib模块
    在阿里云购买云服务器并安装宝塔面板
    python 采集斗图啦(多线程)
    python 采集斗图啦xpath
    python 采集唯美girl
    小程序接入内容内容审查接口(图片.文字)
    PEP8规范
    接口的安全问题
    学习redis知识的过程
    Spring葵花宝典
  • 原文地址:https://www.cnblogs.com/miranda-tang/p/5569439.html
Copyright © 2011-2022 走看看