爬虫4：pdf页面+pdfminer模块+demo - 走看看

zoukankan html css js c++ java

爬虫4：pdf页面+pdfminer模块+demo
　　本文介绍下pdf页面的爬取，需要借助pdfminer模块

　　demo一般流程：

　　1）设置url
url = 'http://www.------' + '.PDF'
　　2)requests模块获取url
import requests
r = requests.get(inner_url)
　　3）写入.pdf文件
myFile = open("PDF/" + i[u'associateAnnouncement'] + '.pdf', "wb") myFile.write( r.content ) myFile.close()
　　4)使用pdfminer模块(API可以查看本人的另一篇 http://www.cnblogs.com/rongyux/p/5445723.html)，cmd命令行输入，转化pdf文件为html，为了方便解析
pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
　　5）BeautifulSoup解析html
from bs4 import BeautifulSoup html = open('PDF/1202268749.html').read()
未完待续，先睡觉，pdfminer把pdf页面解析成html页面，然后beautifulsoap解析html页面即可。
查看全文

相关阅读:
UU跑腿
 Java基础之接口与抽象类的区别
 经典面试题：Redis为什么这么快？
isBlank与isEmpty
阅读联机API文档
 从文本中取出链接地址并检测链接地址能否打开
 2019-01-19=树莓派学习总结
 单片机知识点
 Linux 编程题
 嵌入式基础知识

原文地址：https://www.cnblogs.com/rongyux/p/5513811.html

Copyright © 2011-2022 走看看