scrapy爬虫提取网页链接的两种方法以及构造HtmlResponse对象的方式

zoukankan html css js c++ java

scrapy爬虫提取网页链接的两种方法以及构造HtmlResponse对象的方式

Response对象的几点说明：

　　Response对象用来描述一个HTTP响应，Response只是一个基类，根据相应的不同有如下子类：

　　　　TextResponse，HtmlResponse，XmlResponse

　　仅以HtmlResponse为例，HtmlResponse在基类Response的基础上，还多了很多新的方法。

一.使用Selector

　　　　因为链接也是页面中的数据，所以可以使用与提取数据相同的方法进行提取。在分析网页时可以通过jupyter notebook构造selector对象进行分析（selector对象有xpath和css方法）

　　　　　　import requests

　　　　　　from scrapy.selector import Selector

　　　　

　　　　　　res=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

　　　　　　selector=Selector(response=res)

二 .使用 scrapy框架中的linkextractors模块

　　　　用法见相关资料

　　1. le.extractor_links(response)中的response指的是HtmlResponse

　　2.HtmlResponse的构造方法：

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
import requests

#先构造Response对象，再用Response对象构造HtmlResponse对象，从而能够使用linkextractor模块

ResStack=requests.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

res = HtmlResponse(url="http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" , body=ResStack.text , encoding="utf-8")

注：1.HtmlResponse包含多种参数，具体如何使用可查书

　　2.HtmlResponse也包含多种方法，比如css，xpath，text等方法，也可以通过jupyter notebook进行网页分析，而且也可以使用linkextractor提取链接，更加方便

查看全文

相关阅读:
[转]XSLT <xsl:output> 元素
 XSLT教程
 [转]web.xml文件中配置<mimemapping>下载文件类型
 IntelliJ IDEA 付费版免费版比较
 tomcat 显示隐藏目录结构
 XHTMLMP 7788
[转]HttpSessionListener 和HttpSessionBindingListener的区别
 error LNK2005: _DllMain@12 已经在 LIBCMTD.lib(dllmain.obj) 中定义
 请确保在应用程序配置的 \\ 节中包括 System.Web.SessionStateMod 或自定义会话状态模块
 NuGet 是个什么玩意？

原文地址：https://www.cnblogs.com/RosemaryJie/p/12301202.html