zoukankan html css js c++ java

爬虫学习之pdf读取和存储

在py3中如需进行pdf文件操作需要加载PDFMiner3K库文件，可通过pip方式或者可以下载源文件方式安装

python3 -m pip install pdfminer3k 
下载源文件方式:
1、先下载源文件 
2、通过python3 setup.py install

处理Pdf文件的思路：

PDF 读成字符串，然后用StringIO 转换成文件对象

实例：

 1 from urllib.request import urlopen
 2 from io import StringIO
 3 from pdfminer.pdfinterp import PDFResourceManager, process_pdf
 4 from pdfminer.converter import TextConverter
 5 from pdfminer.layout import LAParams
 6 
 7 def readPDF(pdfFile):
 8     rsrcmgr = PDFResourceManager()
 9     retstr = StringIO()
10     laparams = LAParams()
11     device = TextConverter(rsrcmgr, retstr, laparams=laparams)
12 
13     process_pdf(rsrcmgr, device, pdfFile)
14     device.close()
15 
16     content = retstr.getvalue()
17     retstr.close()
18     return  content
19 
20 pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
21 outputString = readPDF(pdfFile)
22 print(outputString)
23 pdfFile.close()

readPDF 函数最大的好处是，如果你的PDF 文件在电脑里，你就可以直接把urlopen 返回
的对象pdfFile 替换成普通的open() 文件对象：
pdfFile = open("../pages/warandpeace/chapter1.pdf", 'rb')
输出结果可能不是很完美，尤其是当PDF 里有图片、各种各样的文本格式，或者带有表格
和数据图的时候。但是，对大多数只包含纯文本内容的PDF 而言，其输出结果与纯文本格
式基本没什么区别。

查看全文

相关阅读:
Python 25个关键技术点（附代码）
win10 LTSC 2019 激活
 【转】我都30岁了，零基础想转行去学编程，靠谱吗？
查看SELinux状态及关闭SELinux
Linux下使用route设置路由
 windows下使用route添加路由
 linux中core dump开启使用教程
 如何写好技术文档——来自Google十多年的文档经验
 TCP往返传输时间(RTT)的估计
 【Windows11来了】使用VMware16 pro虚拟机安装WIN11抢先体验

原文地址：https://www.cnblogs.com/xiaoyaowuming/p/6405729.html