zoukankan html css js c++ java

scrapy 爬取 useragent

useragentstring.com 网站几乎廊括了所有的User-Agent，刚学了scrapy，打算那它练手，把上面的 user-agent 爬取下来。

本文只爬取常见的 FireFox, Chrome, Opera, Safri, Internet Explorer

一、创建爬虫项目

1.创建爬虫项目useragent

$ scrapy startproject useragent

2.进入项目目录

$ cd useragent

3.生成爬虫文件 ua

这一步不是必须的，不过有了就方便些

$ scrapy genspider ua useragentstring.com

二、编辑 item 文件

# useragentitems.py
import scrapy

class UseragentItem(scrapy.Item):
    # define the fields for your item here like:
    ua_name = scrapy.Field()
    ua_string = scrapy.Field()

三、编辑爬虫文件

# useragentspidersua.py 

import scrapy

from useragent.items import UseragentItem

class UaSpider(scrapy.Spider):
    name = "ua"
    allowed_domains = ["useragentstring.com"]
    start_urls = (
        'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
    )

    def parse(self, response):
        ua_name = response.url.splite('=')[-1]
        for ua_string in response.xpath('//li/a/text()').extract():
            item = UseragentItem()
            item['ua_name'] = ua_name
            item['ua_string'] = ua_string.strip()
            yield item

四、运行爬虫

通过参数-o，控制爬虫输出为 json 文件

$ scrapy crawl ua -o item.json

结果如图：

看起来没有得到想要的结果，注意到那个robot.txt。我猜测可能是网站禁止爬虫

猜的对不对先不管，先模拟浏览器再说，给所有的 request 添加 headers:

# useragentspidersua.py 

import scrapy

from useragent.items import UseragentItem

class UaSpider(scrapy.Spider):
    name = "ua"
    allowed_domains = ["useragentstring.com"]
    start_urls = (
        'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
    )
    
    # 在所有的请求发生之前执行
    def start_requests(self):
        for url in self.start_urls:
            headers = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}
            yield scrapy.Request(url, callback=self.parse, headers=headers)

    def parse(self, response):
        ua_name = response.url.split('=')[-1]
        for ua_string in response.xpath('//li/a/text()').extract():
            item = UseragentItem()
            item['ua_name'] = ua_name
            item['ua_string'] = ua_string.strip()
            yield item

在运行，OK了！
效果图如下：

好了，以后不愁没有 User Agent用了。

查看全文

相关阅读:
AWS EC2 优化 CPU 选项
 chrome 向群组中添加标签页
 Hadoop中TeraSort算法分析
 hadoop —— teragen & terasort
spark本地读取写入s3文件
 将 Spark Streaming 的结果保存到 S3
ipython notesbook 默认路径修改
 Python操作MongoDB
python读取excel，数字都是浮点型，日期格式是数字的解决办法
 flask-profiler的使用

原文地址：https://www.cnblogs.com/hhh5460/p/5826097.html