zoukankan      html  css  js  c++  java
  • 爬虫基础篇

    1.爬虫相关概述

    爬虫概念:

    通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
    模拟:浏览器就是一款纯天然的原始的爬虫工具
    

    爬虫分类:

    通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
    聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
    增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据
    

    风险分析

    合理的的使用
    爬虫风险的体现:
    爬虫干扰了被访问网站的正常运营;
    爬虫抓取了受到法律保护的特定类型的数据或信息。
    避免风险:
    严格遵守网站设置的robots协议;
    在规避反爬虫措施的同时,需要优化自己的代码,避免干扰被访问网站的正常运行;
    在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的个人信息、隐私或者他人的商业秘密的,应及时停止并删除。
    

    反爬机制

    反反爬策略 
    robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.
    

    常用的头信息

    User-Agent:请求载体的身份标识
    Connection:close
    content-type
    

    如何鉴定页面中是否有动态加载的数据?

    局部搜索 全局搜索

    对一个陌生网站进行爬取前的第一步做什么?
    确定你要爬取的数据是否为动态加载的!!!
    

    2.requests模块的基本使用

    requests模块
    概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
    编码流程:
    指定url
    进行请求的发送
    获取响应数据(爬取到的数据)
    持久化存储
    
    import requests
    url = 'https://www.sogou.com'
    #返回值是一个响应对象
    response = requests.get(url=url)
    #text返回的是字符串形式的响应数据
    data = (response.text)
    with open('./sogou.html',"w",encoding='utf-8') as f:
        f.write(data)
    

    基于搜狗编写一个简易的网页采集器

    解决乱码问题

    解决UA检测问题

    import requests
    
    wd = input('输入key:')
    url = 'https://www.sogou.com/web'
    # 存储的就是动态的请求参数
    params = {
        'query': wd
    }
    #params参数表示的是对请求url参数的封装
    #headers 解决反爬机制,实现UA伪装
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, params=params,headers=headers)
    #手动修改响应数据的编码,解决中文乱码
    response.encoding = 'utf-8'
    
    data = (response.text)
    filename = wd + '.html'
    with open(filename, "w", encoding='utf-8') as f:
        f.write(data)
    print(wd, "下载成功")
    
    

    1.爬取豆瓣电影的详细数据

    分析

    当滚轮滑动到底部的时候,发起ajax的请求,且请求到了一组电影数据
    动态加载的数据:就是通过另一个额外的请求请求到的数据
    ajax生成动态加载的数据
    js生成动态加载的数据
    
    import requests
    limit = input("排行榜前多少的数据:::")
    url = 'https://movie.douban.com/j/chart/top_list'
    params = {
        "type": "5",
        "interval_id": "100:90",
        "action": "",
        "start": "0",
        "limit": limit
    }
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, params=params, headers=headers)
    #json返回的是序列化好的对象
    data_list = (response.json())
    
    with open('douban.txt', "w", encoding='utf-8') as f:
        for i in data_list:
            name = i['title']
            score = i['score']
            f.write(name+""+score+""+"
    ")
    print("成功")
    

    2.爬取肯德基地理位置信息

    import requests
    
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
    params = {
        "cname": "",
        "pid": "",
        "keyword": "青岛",
        "pageIndex": "1",
        "pageSize": "10"
    }
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.post(url=url, params=params, headers=headers)
    # json返回的是序列化好的对象
    data_list = (response.json())
    with open('kedeji.txt', "w", encoding='utf-8') as f:
        for i in data_list["Table1"]:
            name = i['storeName']
            addres = i['addressDetail']
            f.write(name + "," + addres  + "
    ")
    print("成功")
    

    3.爬取药品管理局数据

    import requests
    
    url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    with open('化妆品,txt', "w", encoding="utf-8") as f:
        for i in range(1, 5):
            params = {
                "on": "true",
                "page": str(i),
                "pageSize": "12",
                "productName": "",
                "conditionType": "1",
                "applyname": "",
                "applysn": ""
            }
    
            response = requests.post(url=url, params=params, headers=headers)
            data_dic = (response.json())
    
            for i in data_dic["list"]:
                id = i['ID']
                post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
                post_data = {
                    "id": id
                }
                response2 = requests.post(url=post_url, params=post_data, headers=headers)
                data_dic2 = (response2.json())
                title = data_dic2["epsName"]
                name = data_dic2['legalPerson']
    
                f.write(title + ":" + name + "
    ")
    

    3.数据解析

    解析:根据指定的规则对数据进行提取

    作用:实现聚焦爬虫

    聚焦爬虫的编码流程:

    指定url
    发起请求
    获取响应数据
    数据解析
    持久化存储
    

    数据解析的方式:

    正则
    bs4
    xpath
    pyquery(拓展)
    

    数据解析的通用原理是什么?

    数据解析需要作用在页面源码中(一组html标签组成的)
    

    html的核心作用是什么?

    展示数据
    
    

    html是如何展示数据的呢?

    html所要展示的数据一定是被放置在html标签之中,或者是在属性中
    
    

    通用原理:

    1.标签定位
    2.取文本or取属性
    
    

    1.正则解析

    1.爬取糗事百科糗图数据

    爬取单张

    import requests
    
    url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, headers=headers)
    # content返回的是byte类型的数据
    img_data = (response.content)
    with open('./123.jpg', "wb") as f:
            f.write(img_data)
    print("成功")
    
    
    
    

    爬取单页

    <div class="thumb">
    
    <a href="/article/123319109" target="_blank">
    <img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
    </a>
    </div>
    
    
    import re
    import os
    import requests
    
    dir_name = "./img"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    url = "https://www.qiushibaike.com/imgrank/"
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    img_text = requests.get(url, headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 对图片地址发请求获取图片数据
        with open(img_path, "wb") as f:
            f.write(response)
    print("成功")
    
    
    

    爬取多页

    import re
    import os
    import requests
    
    dir_name = "./img"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    for i in range(1,5):
        url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
        print(f"正在爬取第{i}页的图片")
        img_text = requests.get(url, headers=headers).text
        ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
        img_list = re.findall(ex, img_text, re.S)
        for src in img_list:
            src = "https:" + src
            img_name = src.split('/')[-1]
            img_path = dir_name + "/" + img_name
            response = requests.get(src, headers).content
            # 对图片地址发请求获取图片数据
            with open(img_path, "wb") as f:
                f.write(response)
    print("成功")
    
    

    2.bs4解析

    环境安装

    pip install bs4 
    
    

    bs4的解析原理

    实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
    调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取
    
    

    如何实例化BeautifulSoup对象呢?

    BeautifulSoup(fp,'lxml'):专门用作于解析本地存储的html文档中的数据
    BeautifulSoup(page_text,'lxml'):专门用作于将互联网上请求到的页面源码数据进行解析
    
    

    标签定位

    soup.tagName:定位到第一个TagName标签,返回的是第一个
    
    

    属性定位

    soup.find('div',class_='s'),返回值是class=s的div标签
    find_all:和find用法一致,但是返回值是列表
    
    

    选择器定位

    select('选择器'),返回值为列表
    	标签,类,id,层级(>一个层级,空格 多个层级)
    
    

    提取数据

    取文本

    tag.string:标签中直系的文本内容
    tag.text:标签中所有的文本内容
    
    

    取属性

    soup.find("a",id_='tt')['href']
    
    

    1.爬取三国演义小说内容

    http://www.shicimingju.com/book/sanguoyanyi.html

    爬取章节名称+章节内容

    1.在首页中解析章节名称&每一个章节详情页的url

    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    page_text = requests.get(url, headers=headers).text
    soup = BeautifulSoup(page_text, 'lxml')
    a_list = soup.select(".book-mulu a")
    with open('./sanguo.txt', 'w', encoding='utf-8') as f:
        for a in a_list:
            new_url = "http://www.shicimingju.com" + a["href"]
            mulu = a.text
            print(mulu)
            ##对章节详情页的url发起请求,解析详情页中的章节内容
            new_page_text = requests.get(new_url, headers).text
            new_soup = BeautifulSoup(new_page_text, 'lxml')
            neirong = new_soup.find('div', class_='chapter_content').text
            f.write(mulu+":"+neirong+"
    ")
    
    

    3.xpath解析

    环境安装

    pip install lxml
    
    

    xpath的解析原理

    实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
    需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取
    
    

    etree对象的实例化

    tree = etree.parse(fileNane)
    tree = etree.HTML(page_text)
    xpath方法返回的永远是一个列表
    
    

    标签定位

    tree.xpath("")
    在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
    xpath表达式中最左侧的//表示可以从任意位置进行标签定位
    xpath表达式中非最左侧的//表示的是多个层级的意思
    xpath表达式中非最左侧的/表示的是一个层级的意思
    
    属性定位://div[@class='ddd']
    
    索引定位://div[@class='ddd']/li[3] #索引从1开始
    索引定位://div[@class='ddd']//li[2] #索引从1开始
    
    

    提取数据

    取文本:
    tree.xpath("//p[1]/text()"):取直系的文本内容
    tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
    取属性:
    tree.xpath('//a[@id="feng"]/@href')
    
    

    1.爬取boss的招聘信息

    from lxml import etree
    import requests
    import time
    
    
    url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
        'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
    }
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
    for li in li_list:
        #需要将li表示的局部页面源码数据中的相关数据进行提取
        #如果xpath表达式被作用在了循环中, 表达式要以. / 或者. // 开头
        detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
        job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
        company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
        # # 对详情页的url发请求解析出岗位职责
        detail_page_text = requests.get(detail_url, headers=headers).text
        tree = etree.HTML(detail_page_text)
        job_desc = tree.xpath('//div[@class="text"]/text()')
        #列表转字符传
        job_desc = ''.join(job_desc)
        print(job_title,company,job_desc)
        time.sleep(5)
    
    

    2.爬取糗事百科

    爬取作者,和文章。注意作者有匿名和实名之分

    from lxml import etree
    import requests
    
    
    url = "https://www.qiushibaike.com/text/page/4/"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
    print(div_list)
    
    for div in div_list:
    #用户名分为匿名用户和注册用户
        author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
        content = div.xpath('.//div[@class="content"]/span//text()')
        content = ''.join(content)
        print(author, content)
    
    
    

    3.爬取网站图片

    from lxml import etree
    import requests
    import os
    dir_name = "./img2"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    for i in range(1, 6):
        if i == 1:
            url = "http://pic.netbian.com/4kmeinv/"
        else:
            url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
    
        page_text = requests.get(url, headers=headers).text
        tree = etree.HTML(page_text)
        li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
        for li in li_list:
            img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
            img_name = li.xpath('./a/b/text()')[0]
            #解决中文乱码
            img_name = img_name.encode('iso-8859-1').decode('gbk')
            response = requests.get(img_src).content
            img_path = dir_name + "/" + f"{img_name}.jpg"
            with open(img_path, "wb") as f:
                f.write(response)
        print(f"第{i}页成功")
    
    

    4.IP代理

    代理服务器

    实现请求转发,从而可以实现更换请求的ip地址
    
    

    代理的匿名度

    透明:服务器知道你使用了代理并且知道你的真实ip
    匿名:服务器知道你使用了代理,但是不知道你的真实ip
    高匿:服务器不知道你使用了代理,更不知道你的真实ip
    
    

    代理的类型

    http:该类型的代理只可以转发http协议的请求
    
    https:只可以转发https协议的请求
    
    

    免费代理ip的网站

    快代理
    西祠代理
    goubanjia
    代理精灵(推荐):http://http.zhiliandaili.cn/
    

    在爬虫中遇到ip被禁掉如何处理?

    使用代理
    构建一个代理池
    拨号服务器
    
    
    import requests
    import random
    from lxml import etree
    
    # 列表形式的代理池
    all_ips = []
    proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    proxy_page_text = requests.get(url=proxy_url, headers=headers).text
    tree = etree.HTML(proxy_page_text)
    proxy_list = tree.xpath('//body//text()')
    for ip in proxy_list:
        dic = {'https': ip}
        all_ips.append(dic)
    # 爬取快代理中的免费代理ip
    free_proxies = []
    for i in range(1, 3):
        url = f"http://www.kuaidaili.com/free/inha/{i}/"
        page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
        tree = etree.HTML(page_text)
        # xpath表达式中不可以出现tbody
        tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
        for tr in tr_list:
            ip = tr.xpath("./td/text()")[0]
            port = tr.xpath("./td[2]/text()")[0]
            dic = {
                "ip":ip,
                "port":port
            }
            print(dic)
            free_proxies.append(dic)
        print(f"第{i}页")
    print(len(free_proxies))
    
    

    5.处理cookie

    视频解析接口

    https://www.wocao.xyz/index.php?url=
    https://2wk.com/vip.php?url=
    https://api.47ks.com/webcloud/?v-
    
    

    视频解析网址

    牛巴巴     http://mv.688ing.com/
    爱片网     https://ap2345.com/vip/
    全民解析   http://www.qmaile.com/
    
    

    回归正点

    为什么要处理cookie?

    保存客户端的相关状态
    
    

    在请求中携带cookie,在爬虫中如果遇到了cookie的反爬如何处理?

    #手动处理
    在抓包工具中捕获cookie,将其封装在headers中 
    
    
    
    #自动处理
    使用session机制
    使用场景:动态变化的cookie
    session对象:该对象和requests模块用法几乎一致.如果在请求的过程中产生了cookie,如果该请求使用session发起的,则cookie会被自动存储到session中
    
    

    爬去雪球网的数据

    import requests
    
    s = requests.Session()
    main_url = "https://xueqiu.com"  # 先对url发请求获取cookie
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    params = {
        "size": "8",
        '_type': "10",
        "type": "10"
    }
    s.get(main_url, headers=headers)
    url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'
    
    page_text = s.get(url, headers=headers).json()
    print(page_text)
    
    

    6.验证码识别

    相关的线上打码平台识别

    1.注册,登录(用户中心的身份认证)

    2.登录后

    ​ 创建一个软件:软件ID->生成一个软件id

    ​ 下载示例代码:开发文档->python->下载

    平台实例代码的演示

    import requests
    from hashlib import md5
    
    
    class Chaojiying_Client(object):
        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')
            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }
    
        def PostPic(self, im, codetype):
            params = {
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()
    
        def ReportError(self, im_id):
            params = {
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()
    
    
    chaojiying = Chaojiying_Client('超级鹰用户名', '超级鹰用户名的密码', '96001')
    im = open('a.jpg', 'rb').read()
    print(chaojiying.PostPic(im, 1902)['pic_str'])									
    
    

    将古诗网中的验证码进行识别

    zbb.py

    import requests
    from hashlib import md5
    
    
    class Chaojiying_Client(object):
        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')
            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }
    
        def PostPic(self, im, codetype):
            params = {
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()
    
        def ReportError(self, im_id):
            params = {
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()
    
    
    def www(path,type):
        chaojiying = Chaojiying_Client('5423', '521521', '906630')
        im = open(path, 'rb').read()
        return chaojiying.PostPic(im, type)['pic_str']
    
    

    requests.py

    import requests
    from lxml import etree
    from zbb import www
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
    img_data = requests.get(img_url,headers=headers).content
    with open('./111.jpg','wb') as f:
        f.write(img_data)
    img_text = www('./111.jpg',1004)
    print(img_text)
    
    

    7.模拟登陆

    为什么在爬虫中需要实现模拟登录?

    有的数据是必须经过登录后才可以显示出来的
    
    

    古诗网

    涉及到的反扒机制

    1.验证码
    2.动态请求参数:每次请求对应的请求参数都是动态变化
    	动态捕获:通常情况下,动态的请求参数都会被隐藏在前台页面的源码中
    3.cookie存在验证码图片之中 
     坑壁玩意
    
    
    import requests
    from lxml import etree
    from zbb import www
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    # 获取cookie
    s = requests.Session()
    # s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
    # s.get(s_url, headers=headers)
    
    # 获取验证码
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
    img_data = s.get(img_url, headers=headers).content
    with open('./111.jpg', 'wb') as f:
        f.write(img_data)
    img_text = www('./111.jpg', 1004)
    print(img_text)
    
    # 动态捕获动态的请求参数
    __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
    __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
    
    # 点击登录按钮后发起请求的url:通过抓包工具捕获
    login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
    data = {
        "__VIEWSTATE": __VIEWSTATE,
        "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,  # 变化的
        "from": "http://so.gushiwen.cn/user/collect.aspx",
        "email": "542154983@qq.com",
        "pwd": "zxy521",
        "code": img_text,
        "denglu": "登录"
    }
    main_page_text = s.post(login_url, headers=headers, data=data).text
    with open('main.html', 'w', encoding='utf-8') as fp:
        fp.write(main_page_text)
    
    

    8.基于线程池的异步爬取

    基于线程池的异步爬取 趣味百科前十页内容

    import requests
    from multiprocessing.dummy import Pool
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    #将url获取,加入列表之中
    urls = []
    for i in range(1, 11):
        urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')
    
    #创建一个request请求
    def get_request(url):
        # 必须只能有一个参数
        return requests.get(url, headers=headers).text
    #实例化线程10个
    pool = Pool(10)
    response_text_list = pool.map(get_request,urls)
    print(response_text_list)
    
    

    9.单线程+多任务异步协程

    1.简介

    协程:对象

    #可以把协程当做是一个特殊的函数.如果一个函数的定义被async关键字所修饰.该特殊的函数被调用后函数内部的程序语句不会被立即执行,而是会返回一个协程对象.
    
    

    任务对象(task)

    #所谓的任务对象就是对协程对象的进一步封装.在任务对象中可以实现显示协程对象的运行状况.
    #任务对象最终是需要被注册到事件循环对象中.
    
    

    绑定回调

    #回调函数是绑定给任务对象,只有当任务对象对应的特殊函数被执行完毕后,回调函数才会被执行
    
    

    事件循环对象

    #无限循环的对象.也可以把其当成是某一种容器.该容器中需要放置多个任务对象(就是一组待执行的代码块).
    
    

    异步的体现

    #当事件循环开启后,该对象会安装顺序执行每一个任务对象,
        #当一个任务对象发生了阻塞事件循环是不会等待,而是直接执行下一个任务对象
    
    

    await:挂起的操作.交出cpu的使用权

    单任务

    from time import sleep
    import asyncio
    
    
    # 回调函数:
    # 默认参数:任务对象
    def callback(task):
        print('i am callback!!1')
        print(task.result())  # result返回的就是任务对象对应的那个特殊函数的返回值
    
    
    async def get_request(url):
        print('正在请求:', url)
        sleep(2)
        print('请求结束:', url)
        return 'hello bobo'
    
    
    # 创建一个协程对象
    c = get_request('www.1.com')
    # 封装一个任务对象
    task = asyncio.ensure_future(c)
    
    # 给任务对象绑定回调函数
    task.add_done_callback(callback)
    
    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()
    loop.run_until_complete(task)  # 将任务对象注册到事件循环对象中并且开启了事件循环
    
    
    

    2.多任务的异步协程

    import asyncio
    from time import sleep
    import time
    start = time.time()
    urls = [
        'http://localhost:5000/a',
        'http://localhost:5000/b',
        'http://localhost:5000/c'
    ]
    #在待执行的代码块中不可以出现不支持异步模块的代码
    #在该函数内部如果有阻塞操作必须使用await关键字进行修饰
    async def get_request(url):
        print('正在请求:',url)
        # sleep(2)
        await asyncio.sleep(2)
        print('请求结束:',url)
        return 'hello bobo'
    
    tasks = [] #放置所有的任务对象
    for url in urls:
        c = get_request(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    print(time.time()-start) 
    
    

    注意事项:

    1.将多个任务对象存储到一个列表中,然后将该列表注册到事件循环中.在注册的过程中,该列表需要被wait方法进行处理.
    2.在任务对象对应的特殊函数内部的实现中,不可以出现不支持异步模块的代码,否则就会中断整个的异步效果.并且,在该函数内部每一组阻塞的操作都必须使用await关键字进行修饰.
    3.requests模块对应的代码不可以出现在特殊函数内部,因为requests是一个不支持异步的模块
    
    

    3.aiohttp

    支持异步操作的网络请求的模块

    - 环境安装:pip install aiohttp
    
    import asyncio
    import requests
    import time
    import aiohttp
    from lxml import etree
    
    urls = [
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
    ]
    
    
    # 无法实现异步的效果:是因为requests模块是一个不支持异步的模块
    async def req(url):
        async with aiohttp.ClientSession() as s:
            async with await s.get(url) as response:
                # response.read():byte
                page_text = await response.text()
                return page_text
    
        # 细节:在每一个with前面加上async,在每一步的阻塞操作前加上await
    
    
    def parse(task):
        page_text = task.result()
        tree = etree.HTML(page_text)
        name = tree.xpath('//p/text()')[0]
        print(name)
    
    
    if __name__ == '__main__':
        start = time.time()
        tasks = []
        for url in urls:
            c = req(url)
            task = asyncio.ensure_future(c)
            task.add_done_callback(parse)
            tasks.append(task)
    
        loop = asyncio.get_event_loop()
        loop.run_until_complete(asyncio.wait(tasks))
    
        print(time.time() - start)
    
    
    

    10.selenium

    概念

    基于浏览器自动化的一个模块.
    
    

    环境的安装:

    下载selenium模块
    
    

    selenium和爬虫之间的关联是什么?

    便捷的获取页面中动态加载的数据
         requests模块进行数据爬取:可见非可得
         selenium:可见即可得
    实现模拟登录
    
    

    基本操作:

    谷歌浏览器驱动程序下地址:
    http://chromedriver.storage.googleapis.com/index.html
    
    selenium驱动程序和谷歌版本的映射关系表:
    https://blog.csdn.net/huilan_same/article/details/51896672
    
    

    动作链

    一系列的行为动作
    
    

    无头浏览器

    无可视化界面的浏览器
    phantosJS
    
    

    1.京东基本操作示例

    from selenium import webdriver
    from time import sleep
    #1.实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    #2.模拟用户发起请求
    url = 'https://www.jd.com'
    bro.get(url) 
    #3.标签定位
    search_input =  bro.find_element_by_id('key')
    #4.对指定标签进行数据交互
    search_input.send_keys('华为')
    #5.系列的行为动作
    btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
    btn.click()
    sleep(2)
    #6.执行js代码
    jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
    bro.execute_script(jsCode)
    sleep(3)
    #7.关闭
    bro.quit()
    
    

    2.爬取药品总局信息

    from selenium import webdriver
    from lxml import etree
    from time import sleep
    
    page_text_list = []
    # 实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    url = 'http://125.35.6.84:81/xk/'
    bro.get(url)
    # 必须等待页面加载完毕
    sleep(2)
    # page_source就是浏览器打开页面的源码数据
    
    page_text = bro.page_source
    page_text_list.append(page_text)
    #必须要与窗口对应,窗口必须要显示点击按钮才可
    jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
    bro.execute_script(jsCode)
    #打开后两页的
    for i in range(2):
        bro.find_element_by_id('pageIto_next').click()
        sleep(2)
    
        page_text = bro.page_source
        page_text_list.append(page_text)
    
    for p in page_text_list:
        tree = etree.HTML(p)
        li_list = tree.xpath('//*[@id="gzlist"]/li')
        for li in li_list:
            name = li.xpath('./dl/@title')[0]
            print(name)
    sleep(2)
    bro.quit()
    
    

    3.动作链

    from lxml import etree
    from time import sleep
    from selenium import webdriver
    from selenium.webdriver import ActionChains
    
    # 实例化一个浏览器对象
    page_text_list = []
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
    bro.get(url)
    # 如果定位的标签是存在于iframe对应的子页面中的话,在进行标签定位前一定要执行一个switch_to的操作
    bro.switch_to.frame('iframeResult')
    div_tag = bro.find_element_by_id('draggable')
    
    # 1.实例化动作链对象
    action = ActionChains(bro)
    action.click_and_hold(div_tag)
    
    for i in range(5):
        #perform让动作链立即执行
        action.move_by_offset(17, 0).perform()
        sleep(0.5)
    #释放
    action.release()
    
    sleep(3)
    
    bro.quit()
    
    

    4.处理反爬selenium

    像淘宝很多网站都禁止selenium爬取

    正常在浏览器输入window.Navigator.webdriver返回的是undefined

    用代码打开浏览器返回的是true

    from selenium import webdriver
    from selenium.webdriver import ChromeOptions
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])
    
    #实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe',options=option)
    bro.get('https://www.taobao.com/')
    
    
    

    5.模拟12306登录

    from selenium import webdriver
    from selenium.webdriver import ActionChains
    from PIL import Image  # 用作于图片的裁剪 pillow
    from zbb import www
    from time import sleep
    
    bro =webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    bro.get('https://kyfw.12306.cn/otn/resources/login.html')
    sleep(5)
    zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
    zhdl.click()
    sleep(1)
    
    username = bro.find_element_by_id('J-userName')
    username.send_keys('181873')
    pwd = bro.find_element_by_id('J-password')
    pwd.send_keys('zx1')
    # 验证码图片进行捕获(裁剪)
    bro.save_screenshot('main.png')
    # 定位到了验证码图片对应的标签
    code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
    location = code_img_ele.location  # 验证码图片基于当前整张页面的左下角坐标
    size = code_img_ele.size  # 验证码图片的长和宽
    # 裁剪的矩形区域(左下角和右上角两点的坐标)
    rangle = (
    int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
    
    i = Image.open('main.png')
    frame = i.crop(rangle)
    frame.save('code.png')
    
    # # 使用打码平台进行验证码的识别
    result = www('./code.png',9004)
      # x1,y1|x2,y2|x3,y3  ==> [[x1,y1],[x2,y2],[x3,y3]]
    all_list = []  # [[x1,y1],[x2,y2],[x3,y3]] 每一个列表元素表示一个点的坐标,坐标对应值的0,0点是验证码图片左下角
    if '|' in result:
        list_1 = result.split('|')
        count_1 = len(list_1)
        for i in range(count_1):
            xy_list = []
            x = int(list_1[i].split(',')[0])
            y = int(list_1[i].split(',')[1])
            xy_list.append(x)
            xy_list.append(y)
            all_list.append(xy_list)
    else:
        x = int(result.split(',')[0])
        y = int(result.split(',')[1])
        xy_list = []
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
    print(all_list)
    action = ActionChains(bro)
    for l in all_list:
        x = l[0]
        y = l[1]
        action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
        sleep(2)
    
    btn = bro.find_element_by_xpath('//*[@id="J-login"]')
    btn.click()
    
    
    action.release()
    sleep(3)
    bro.quit()
    
    
    
  • 相关阅读:
    ASP.NET在禁用视图状态的情况下仍然使用ViewState对象【转】
    Atcoder Regular Contest 061 D Card Game for Three(组合数学)
    Solution 「CERC 2016」「洛谷 P3684」机棚障碍
    Solution 「CF 599E」Sandy and Nuts
    Solution 「洛谷 P6021」洪水
    Solution 「ARC 058C」「AT 1975」Iroha and Haiku
    Solution 「POI 2011」「洛谷 P3527」METMeteors
    Solution 「CF 1023F」Mobile Phone Network
    Solution 「SP 6779」GSS7
    Solution 「LOCAL」大括号树
  • 原文地址:https://www.cnblogs.com/zdqc/p/13408310.html
Copyright © 2011-2022 走看看