zoukankan      html  css  js  c++  java
  • 爬虫基础篇

    1.爬虫相关概述

    爬虫概念:

    通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
    模拟:浏览器就是一款纯天然的原始的爬虫工具
    

    爬虫分类:

    通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
    聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
    增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据
    

    风险分析

    合理的的使用
    爬虫风险的体现:
    爬虫干扰了被访问网站的正常运营;
    爬虫抓取了受到法律保护的特定类型的数据或信息。
    避免风险:
    严格遵守网站设置的robots协议;
    在规避反爬虫措施的同时,需要优化自己的代码,避免干扰被访问网站的正常运行;
    在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的个人信息、隐私或者他人的商业秘密的,应及时停止并删除。
    

    反爬机制

    反反爬策略 
    robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.
    

    常用的头信息

    User-Agent:请求载体的身份标识
    Connection:close
    content-type
    

    如何鉴定页面中是否有动态加载的数据?

    局部搜索 全局搜索

    对一个陌生网站进行爬取前的第一步做什么?
    确定你要爬取的数据是否为动态加载的!!!
    

    2.requests模块的基本使用

    requests模块
    概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
    编码流程:
    指定url
    进行请求的发送
    获取响应数据(爬取到的数据)
    持久化存储
    
    import requests
    url = 'https://www.sogou.com'
    #返回值是一个响应对象
    response = requests.get(url=url)
    #text返回的是字符串形式的响应数据
    data = (response.text)
    with open('./sogou.html',"w",encoding='utf-8') as f:
        f.write(data)
    

    基于搜狗编写一个简易的网页采集器

    解决乱码问题

    解决UA检测问题

    import requests
    
    wd = input('输入key:')
    url = 'https://www.sogou.com/web'
    # 存储的就是动态的请求参数
    params = {
        'query': wd
    }
    #params参数表示的是对请求url参数的封装
    #headers 解决反爬机制,实现UA伪装
    headers = {
        'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, params=params,headers=headers)
    #手动修改响应数据的编码,解决中文乱码
    response.encoding = 'utf-8'
    
    data = (response.text)
    filename = wd + '.html'
    with open(filename, "w", encoding='utf-8') as f:
        f.write(data)
    print(wd, "下载成功")
    
    

    1.爬取豆瓣电影的详细数据

    分析

    当滚轮滑动到底部的时候,发起ajax的请求,且请求到了一组电影数据
    动态加载的数据:就是通过另一个额外的请求请求到的数据
    ajax生成动态加载的数据
    js生成动态加载的数据
    
    import requests
    limit = input("排行榜前多少的数据:::")
    url = 'https://movie.douban.com/j/chart/top_list'
    params = {
        "type": "5",
        "interval_id": "100:90",
        "action": "",
        "start": "0",
        "limit": limit
    }
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, params=params, headers=headers)
    #json返回的是序列化好的对象
    data_list = (response.json())
    
    with open('douban.txt', "w", encoding='utf-8') as f:
        for i in data_list:
            name = i['title']
            score = i['score']
            f.write(name+""+score+""+"
    ")
    print("成功")
    

    2.爬取肯德基地理位置信息

    import requests
    
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
    params = {
        "cname": "",
        "pid": "",
        "keyword": "青岛",
        "pageIndex": "1",
        "pageSize": "10"
    }
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.post(url=url, params=params, headers=headers)
    # json返回的是序列化好的对象
    data_list = (response.json())
    with open('kedeji.txt', "w", encoding='utf-8') as f:
        for i in data_list["Table1"]:
            name = i['storeName']
            addres = i['addressDetail']
            f.write(name + "," + addres  + "
    ")
    print("成功")
    

    3.爬取药品管理局数据

    import requests
    
    url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    with open('化妆品,txt', "w", encoding="utf-8") as f:
        for i in range(1, 5):
            params = {
                "on": "true",
                "page": str(i),
                "pageSize": "12",
                "productName": "",
                "conditionType": "1",
                "applyname": "",
                "applysn": ""
            }
    
            response = requests.post(url=url, params=params, headers=headers)
            data_dic = (response.json())
    
            for i in data_dic["list"]:
                id = i['ID']
                post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
                post_data = {
                    "id": id
                }
                response2 = requests.post(url=post_url, params=post_data, headers=headers)
                data_dic2 = (response2.json())
                title = data_dic2["epsName"]
                name = data_dic2['legalPerson']
    
                f.write(title + ":" + name + "
    ")
    

    3.数据解析

    解析:根据指定的规则对数据进行提取

    作用:实现聚焦爬虫

    聚焦爬虫的编码流程:

    指定url
    发起请求
    获取响应数据
    数据解析
    持久化存储
    

    数据解析的方式:

    正则
    bs4
    xpath
    pyquery(拓展)
    

    数据解析的通用原理是什么?

    数据解析需要作用在页面源码中(一组html标签组成的)
    

    html的核心作用是什么?

    展示数据
    
    

    html是如何展示数据的呢?

    html所要展示的数据一定是被放置在html标签之中,或者是在属性中
    
    

    通用原理:

    1.标签定位
    2.取文本or取属性
    
    

    1.正则解析

    1.爬取糗事百科糗图数据

    爬取单张

    import requests
    
    url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    response = requests.get(url=url, headers=headers)
    # content返回的是byte类型的数据
    img_data = (response.content)
    with open('./123.jpg', "wb") as f:
            f.write(img_data)
    print("成功")
    
    
    
    

    爬取单页

    <div class="thumb">
    
    <a href="/article/123319109" target="_blank">
    <img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
    </a>
    </div>
    
    
    import re
    import os
    import requests
    
    dir_name = "./img"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    url = "https://www.qiushibaike.com/imgrank/"
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    img_text = requests.get(url, headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 对图片地址发请求获取图片数据
        with open(img_path, "wb") as f:
            f.write(response)
    print("成功")
    
    
    

    爬取多页

    import re
    import os
    import requests
    
    dir_name = "./img"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    for i in range(1,5):
        url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
        print(f"正在爬取第{i}页的图片")
        img_text = requests.get(url, headers=headers).text
        ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
        img_list = re.findall(ex, img_text, re.S)
        for src in img_list:
            src = "https:" + src
            img_name = src.split('/')[-1]
            img_path = dir_name + "/" + img_name
            response = requests.get(src, headers).content
            # 对图片地址发请求获取图片数据
            with open(img_path, "wb") as f:
                f.write(response)
    print("成功")
    
    

    2.bs4解析

    环境安装

    pip install bs4 
    
    

    bs4的解析原理

    实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
    调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取
    
    

    如何实例化BeautifulSoup对象呢?

    BeautifulSoup(fp,'lxml'):专门用作于解析本地存储的html文档中的数据
    BeautifulSoup(page_text,'lxml'):专门用作于将互联网上请求到的页面源码数据进行解析
    
    

    标签定位

    soup.tagName:定位到第一个TagName标签,返回的是第一个
    
    

    属性定位

    soup.find('div',class_='s'),返回值是class=s的div标签
    find_all:和find用法一致,但是返回值是列表
    
    

    选择器定位

    select('选择器'),返回值为列表
    	标签,类,id,层级(>一个层级,空格 多个层级)
    
    

    提取数据

    取文本

    tag.string:标签中直系的文本内容
    tag.text:标签中所有的文本内容
    
    

    取属性

    soup.find("a",id_='tt')['href']
    
    

    1.爬取三国演义小说内容

    http://www.shicimingju.com/book/sanguoyanyi.html

    爬取章节名称+章节内容

    1.在首页中解析章节名称&每一个章节详情页的url

    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
    }
    page_text = requests.get(url, headers=headers).text
    soup = BeautifulSoup(page_text, 'lxml')
    a_list = soup.select(".book-mulu a")
    with open('./sanguo.txt', 'w', encoding='utf-8') as f:
        for a in a_list:
            new_url = "http://www.shicimingju.com" + a["href"]
            mulu = a.text
            print(mulu)
            ##对章节详情页的url发起请求,解析详情页中的章节内容
            new_page_text = requests.get(new_url, headers).text
            new_soup = BeautifulSoup(new_page_text, 'lxml')
            neirong = new_soup.find('div', class_='chapter_content').text
            f.write(mulu+":"+neirong+"
    ")
    
    

    3.xpath解析

    环境安装

    pip install lxml
    
    

    xpath的解析原理

    实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
    需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取
    
    

    etree对象的实例化

    tree = etree.parse(fileNane)
    tree = etree.HTML(page_text)
    xpath方法返回的永远是一个列表
    
    

    标签定位

    tree.xpath("")
    在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
    xpath表达式中最左侧的//表示可以从任意位置进行标签定位
    xpath表达式中非最左侧的//表示的是多个层级的意思
    xpath表达式中非最左侧的/表示的是一个层级的意思
    
    属性定位://div[@class='ddd']
    
    索引定位://div[@class='ddd']/li[3] #索引从1开始
    索引定位://div[@class='ddd']//li[2] #索引从1开始
    
    

    提取数据

    取文本:
    tree.xpath("//p[1]/text()"):取直系的文本内容
    tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
    取属性:
    tree.xpath('//a[@id="feng"]/@href')
    
    

    1.爬取boss的招聘信息

    from lxml import etree
    import requests
    import time
    
    
    url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
        'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
    }
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
    for li in li_list:
        #需要将li表示的局部页面源码数据中的相关数据进行提取
        #如果xpath表达式被作用在了循环中, 表达式要以. / 或者. // 开头
        detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
        job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
        company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
        # # 对详情页的url发请求解析出岗位职责
        detail_page_text = requests.get(detail_url, headers=headers).text
        tree = etree.HTML(detail_page_text)
        job_desc = tree.xpath('//div[@class="text"]/text()')
        #列表转字符传
        job_desc = ''.join(job_desc)
        print(job_title,company,job_desc)
        time.sleep(5)
    
    

    2.爬取糗事百科

    爬取作者,和文章。注意作者有匿名和实名之分

    from lxml import etree
    import requests
    
    
    url = "https://www.qiushibaike.com/text/page/4/"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
    print(div_list)
    
    for div in div_list:
    #用户名分为匿名用户和注册用户
        author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
        content = div.xpath('.//div[@class="content"]/span//text()')
        content = ''.join(content)
        print(author, content)
    
    
    

    3.爬取网站图片

    from lxml import etree
    import requests
    import os
    dir_name = "./img2"
    if not os.path.exists(dir_name):
        os.mkdir(dir_name)
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    for i in range(1, 6):
        if i == 1:
            url = "http://pic.netbian.com/4kmeinv/"
        else:
            url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
    
        page_text = requests.get(url, headers=headers).text
        tree = etree.HTML(page_text)
        li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
        for li in li_list:
            img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
            img_name = li.xpath('./a/b/text()')[0]
            #解决中文乱码
            img_name = img_name.encode('iso-8859-1').decode('gbk')
            response = requests.get(img_src).content
            img_path = dir_name + "/" + f"{img_name}.jpg"
            with open(img_path, "wb") as f:
                f.write(response)
        print(f"第{i}页成功")
    
    

    4.IP代理

    代理服务器

    实现请求转发,从而可以实现更换请求的ip地址
    
    

    代理的匿名度

    透明:服务器知道你使用了代理并且知道你的真实ip
    匿名:服务器知道你使用了代理,但是不知道你的真实ip
    高匿:服务器不知道你使用了代理,更不知道你的真实ip
    
    

    代理的类型

    http:该类型的代理只可以转发http协议的请求
    
    https:只可以转发https协议的请求
    
    

    免费代理ip的网站

    快代理
    西祠代理
    goubanjia
    代理精灵(推荐):http://http.zhiliandaili.cn/
    

    在爬虫中遇到ip被禁掉如何处理?

    使用代理
    构建一个代理池
    拨号服务器
    
    
    import requests
    import random
    from lxml import etree
    
    # 列表形式的代理池
    all_ips = []
    proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    proxy_page_text = requests.get(url=proxy_url, headers=headers).text
    tree = etree.HTML(proxy_page_text)
    proxy_list = tree.xpath('//body//text()')
    for ip in proxy_list:
        dic = {'https': ip}
        all_ips.append(dic)
    # 爬取快代理中的免费代理ip
    free_proxies = []
    for i in range(1, 3):
        url = f"http://www.kuaidaili.com/free/inha/{i}/"
        page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
        tree = etree.HTML(page_text)
        # xpath表达式中不可以出现tbody
        tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
        for tr in tr_list:
            ip = tr.xpath("./td/text()")[0]
            port = tr.xpath("./td[2]/text()")[0]
            dic = {
                "ip":ip,
                "port":port
            }
            print(dic)
            free_proxies.append(dic)
        print(f"第{i}页")
    print(len(free_proxies))
    
    

    5.处理cookie

    视频解析接口

    https://www.wocao.xyz/index.php?url=
    https://2wk.com/vip.php?url=
    https://api.47ks.com/webcloud/?v-
    
    

    视频解析网址

    牛巴巴     http://mv.688ing.com/
    爱片网     https://ap2345.com/vip/
    全民解析   http://www.qmaile.com/
    
    

    回归正点

    为什么要处理cookie?

    保存客户端的相关状态
    
    

    在请求中携带cookie,在爬虫中如果遇到了cookie的反爬如何处理?

    #手动处理
    在抓包工具中捕获cookie,将其封装在headers中 
    
    
    
    #自动处理
    使用session机制
    使用场景:动态变化的cookie
    session对象:该对象和requests模块用法几乎一致.如果在请求的过程中产生了cookie,如果该请求使用session发起的,则cookie会被自动存储到session中
    
    

    爬去雪球网的数据

    import requests
    
    s = requests.Session()
    main_url = "https://xueqiu.com"  # 先对url发请求获取cookie
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    params = {
        "size": "8",
        '_type': "10",
        "type": "10"
    }
    s.get(main_url, headers=headers)
    url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'
    
    page_text = s.get(url, headers=headers).json()
    print(page_text)
    
    

    6.验证码识别

    相关的线上打码平台识别

    1.注册,登录(用户中心的身份认证)

    2.登录后

    ​ 创建一个软件:软件ID->生成一个软件id

    ​ 下载示例代码:开发文档->python->下载

    平台实例代码的演示

    import requests
    from hashlib import md5
    
    
    class Chaojiying_Client(object):
        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')
            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }
    
        def PostPic(self, im, codetype):
            params = {
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()
    
        def ReportError(self, im_id):
            params = {
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()
    
    
    chaojiying = Chaojiying_Client('超级鹰用户名', '超级鹰用户名的密码', '96001')
    im = open('a.jpg', 'rb').read()
    print(chaojiying.PostPic(im, 1902)['pic_str'])									
    
    

    将古诗网中的验证码进行识别

    zbb.py

    import requests
    from hashlib import md5
    
    
    class Chaojiying_Client(object):
        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')
            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }
    
        def PostPic(self, im, codetype):
            params = {
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()
    
        def ReportError(self, im_id):
            params = {
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()
    
    
    def www(path,type):
        chaojiying = Chaojiying_Client('5423', '521521', '906630')
        im = open(path, 'rb').read()
        return chaojiying.PostPic(im, type)['pic_str']
    
    

    requests.py

    import requests
    from lxml import etree
    from zbb import www
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
    img_data = requests.get(img_url,headers=headers).content
    with open('./111.jpg','wb') as f:
        f.write(img_data)
    img_text = www('./111.jpg',1004)
    print(img_text)
    
    

    7.模拟登陆

    为什么在爬虫中需要实现模拟登录?

    有的数据是必须经过登录后才可以显示出来的
    
    

    古诗网

    涉及到的反扒机制

    1.验证码
    2.动态请求参数:每次请求对应的请求参数都是动态变化
    	动态捕获:通常情况下,动态的请求参数都会被隐藏在前台页面的源码中
    3.cookie存在验证码图片之中 
     坑壁玩意
    
    
    import requests
    from lxml import etree
    from zbb import www
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    # 获取cookie
    s = requests.Session()
    # s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
    # s.get(s_url, headers=headers)
    
    # 获取验证码
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
    img_data = s.get(img_url, headers=headers).content
    with open('./111.jpg', 'wb') as f:
        f.write(img_data)
    img_text = www('./111.jpg', 1004)
    print(img_text)
    
    # 动态捕获动态的请求参数
    __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
    __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
    
    # 点击登录按钮后发起请求的url:通过抓包工具捕获
    login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
    data = {
        "__VIEWSTATE": __VIEWSTATE,
        "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,  # 变化的
        "from": "http://so.gushiwen.cn/user/collect.aspx",
        "email": "542154983@qq.com",
        "pwd": "zxy521",
        "code": img_text,
        "denglu": "登录"
    }
    main_page_text = s.post(login_url, headers=headers, data=data).text
    with open('main.html', 'w', encoding='utf-8') as fp:
        fp.write(main_page_text)
    
    

    8.基于线程池的异步爬取

    基于线程池的异步爬取 趣味百科前十页内容

    import requests
    from multiprocessing.dummy import Pool
    
    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
    }
    #将url获取,加入列表之中
    urls = []
    for i in range(1, 11):
        urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')
    
    #创建一个request请求
    def get_request(url):
        # 必须只能有一个参数
        return requests.get(url, headers=headers).text
    #实例化线程10个
    pool = Pool(10)
    response_text_list = pool.map(get_request,urls)
    print(response_text_list)
    
    

    9.单线程+多任务异步协程

    1.简介

    协程:对象

    #可以把协程当做是一个特殊的函数.如果一个函数的定义被async关键字所修饰.该特殊的函数被调用后函数内部的程序语句不会被立即执行,而是会返回一个协程对象.
    
    

    任务对象(task)

    #所谓的任务对象就是对协程对象的进一步封装.在任务对象中可以实现显示协程对象的运行状况.
    #任务对象最终是需要被注册到事件循环对象中.
    
    

    绑定回调

    #回调函数是绑定给任务对象,只有当任务对象对应的特殊函数被执行完毕后,回调函数才会被执行
    
    

    事件循环对象

    #无限循环的对象.也可以把其当成是某一种容器.该容器中需要放置多个任务对象(就是一组待执行的代码块).
    
    

    异步的体现

    #当事件循环开启后,该对象会安装顺序执行每一个任务对象,
        #当一个任务对象发生了阻塞事件循环是不会等待,而是直接执行下一个任务对象
    
    

    await:挂起的操作.交出cpu的使用权

    单任务

    from time import sleep
    import asyncio
    
    
    # 回调函数:
    # 默认参数:任务对象
    def callback(task):
        print('i am callback!!1')
        print(task.result())  # result返回的就是任务对象对应的那个特殊函数的返回值
    
    
    async def get_request(url):
        print('正在请求:', url)
        sleep(2)
        print('请求结束:', url)
        return 'hello bobo'
    
    
    # 创建一个协程对象
    c = get_request('www.1.com')
    # 封装一个任务对象
    task = asyncio.ensure_future(c)
    
    # 给任务对象绑定回调函数
    task.add_done_callback(callback)
    
    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()
    loop.run_until_complete(task)  # 将任务对象注册到事件循环对象中并且开启了事件循环
    
    
    

    2.多任务的异步协程

    import asyncio
    from time import sleep
    import time
    start = time.time()
    urls = [
        'http://localhost:5000/a',
        'http://localhost:5000/b',
        'http://localhost:5000/c'
    ]
    #在待执行的代码块中不可以出现不支持异步模块的代码
    #在该函数内部如果有阻塞操作必须使用await关键字进行修饰
    async def get_request(url):
        print('正在请求:',url)
        # sleep(2)
        await asyncio.sleep(2)
        print('请求结束:',url)
        return 'hello bobo'
    
    tasks = [] #放置所有的任务对象
    for url in urls:
        c = get_request(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    print(time.time()-start) 
    
    

    注意事项:

    1.将多个任务对象存储到一个列表中,然后将该列表注册到事件循环中.在注册的过程中,该列表需要被wait方法进行处理.
    2.在任务对象对应的特殊函数内部的实现中,不可以出现不支持异步模块的代码,否则就会中断整个的异步效果.并且,在该函数内部每一组阻塞的操作都必须使用await关键字进行修饰.
    3.requests模块对应的代码不可以出现在特殊函数内部,因为requests是一个不支持异步的模块
    
    

    3.aiohttp

    支持异步操作的网络请求的模块

    - 环境安装:pip install aiohttp
    
    import asyncio
    import requests
    import time
    import aiohttp
    from lxml import etree
    
    urls = [
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
        'http://localhost:5000/bobo',
    ]
    
    
    # 无法实现异步的效果:是因为requests模块是一个不支持异步的模块
    async def req(url):
        async with aiohttp.ClientSession() as s:
            async with await s.get(url) as response:
                # response.read():byte
                page_text = await response.text()
                return page_text
    
        # 细节:在每一个with前面加上async,在每一步的阻塞操作前加上await
    
    
    def parse(task):
        page_text = task.result()
        tree = etree.HTML(page_text)
        name = tree.xpath('//p/text()')[0]
        print(name)
    
    
    if __name__ == '__main__':
        start = time.time()
        tasks = []
        for url in urls:
            c = req(url)
            task = asyncio.ensure_future(c)
            task.add_done_callback(parse)
            tasks.append(task)
    
        loop = asyncio.get_event_loop()
        loop.run_until_complete(asyncio.wait(tasks))
    
        print(time.time() - start)
    
    
    

    10.selenium

    概念

    基于浏览器自动化的一个模块.
    
    

    环境的安装:

    下载selenium模块
    
    

    selenium和爬虫之间的关联是什么?

    便捷的获取页面中动态加载的数据
         requests模块进行数据爬取:可见非可得
         selenium:可见即可得
    实现模拟登录
    
    

    基本操作:

    谷歌浏览器驱动程序下地址:
    http://chromedriver.storage.googleapis.com/index.html
    
    selenium驱动程序和谷歌版本的映射关系表:
    https://blog.csdn.net/huilan_same/article/details/51896672
    
    

    动作链

    一系列的行为动作
    
    

    无头浏览器

    无可视化界面的浏览器
    phantosJS
    
    

    1.京东基本操作示例

    from selenium import webdriver
    from time import sleep
    #1.实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    #2.模拟用户发起请求
    url = 'https://www.jd.com'
    bro.get(url) 
    #3.标签定位
    search_input =  bro.find_element_by_id('key')
    #4.对指定标签进行数据交互
    search_input.send_keys('华为')
    #5.系列的行为动作
    btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
    btn.click()
    sleep(2)
    #6.执行js代码
    jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
    bro.execute_script(jsCode)
    sleep(3)
    #7.关闭
    bro.quit()
    
    

    2.爬取药品总局信息

    from selenium import webdriver
    from lxml import etree
    from time import sleep
    
    page_text_list = []
    # 实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    url = 'http://125.35.6.84:81/xk/'
    bro.get(url)
    # 必须等待页面加载完毕
    sleep(2)
    # page_source就是浏览器打开页面的源码数据
    
    page_text = bro.page_source
    page_text_list.append(page_text)
    #必须要与窗口对应,窗口必须要显示点击按钮才可
    jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
    bro.execute_script(jsCode)
    #打开后两页的
    for i in range(2):
        bro.find_element_by_id('pageIto_next').click()
        sleep(2)
    
        page_text = bro.page_source
        page_text_list.append(page_text)
    
    for p in page_text_list:
        tree = etree.HTML(p)
        li_list = tree.xpath('//*[@id="gzlist"]/li')
        for li in li_list:
            name = li.xpath('./dl/@title')[0]
            print(name)
    sleep(2)
    bro.quit()
    
    

    3.动作链

    from lxml import etree
    from time import sleep
    from selenium import webdriver
    from selenium.webdriver import ActionChains
    
    # 实例化一个浏览器对象
    page_text_list = []
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
    bro.get(url)
    # 如果定位的标签是存在于iframe对应的子页面中的话,在进行标签定位前一定要执行一个switch_to的操作
    bro.switch_to.frame('iframeResult')
    div_tag = bro.find_element_by_id('draggable')
    
    # 1.实例化动作链对象
    action = ActionChains(bro)
    action.click_and_hold(div_tag)
    
    for i in range(5):
        #perform让动作链立即执行
        action.move_by_offset(17, 0).perform()
        sleep(0.5)
    #释放
    action.release()
    
    sleep(3)
    
    bro.quit()
    
    

    4.处理反爬selenium

    像淘宝很多网站都禁止selenium爬取

    正常在浏览器输入window.Navigator.webdriver返回的是undefined

    用代码打开浏览器返回的是true

    from selenium import webdriver
    from selenium.webdriver import ChromeOptions
    option = ChromeOptions()
    option.add_experimental_option('excludeSwitches', ['enable-automation'])
    
    #实例化一个浏览器对象
    bro = webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe',options=option)
    bro.get('https://www.taobao.com/')
    
    
    

    5.模拟12306登录

    from selenium import webdriver
    from selenium.webdriver import ActionChains
    from PIL import Image  # 用作于图片的裁剪 pillow
    from zbb import www
    from time import sleep
    
    bro =webdriver.Chrome(executable_path=r'C:Userszhui3Desktopchromedriver.exe')
    bro.get('https://kyfw.12306.cn/otn/resources/login.html')
    sleep(5)
    zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
    zhdl.click()
    sleep(1)
    
    username = bro.find_element_by_id('J-userName')
    username.send_keys('181873')
    pwd = bro.find_element_by_id('J-password')
    pwd.send_keys('zx1')
    # 验证码图片进行捕获(裁剪)
    bro.save_screenshot('main.png')
    # 定位到了验证码图片对应的标签
    code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
    location = code_img_ele.location  # 验证码图片基于当前整张页面的左下角坐标
    size = code_img_ele.size  # 验证码图片的长和宽
    # 裁剪的矩形区域(左下角和右上角两点的坐标)
    rangle = (
    int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
    
    i = Image.open('main.png')
    frame = i.crop(rangle)
    frame.save('code.png')
    
    # # 使用打码平台进行验证码的识别
    result = www('./code.png',9004)
      # x1,y1|x2,y2|x3,y3  ==> [[x1,y1],[x2,y2],[x3,y3]]
    all_list = []  # [[x1,y1],[x2,y2],[x3,y3]] 每一个列表元素表示一个点的坐标,坐标对应值的0,0点是验证码图片左下角
    if '|' in result:
        list_1 = result.split('|')
        count_1 = len(list_1)
        for i in range(count_1):
            xy_list = []
            x = int(list_1[i].split(',')[0])
            y = int(list_1[i].split(',')[1])
            xy_list.append(x)
            xy_list.append(y)
            all_list.append(xy_list)
    else:
        x = int(result.split(',')[0])
        y = int(result.split(',')[1])
        xy_list = []
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
    print(all_list)
    action = ActionChains(bro)
    for l in all_list:
        x = l[0]
        y = l[1]
        action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
        sleep(2)
    
    btn = bro.find_element_by_xpath('//*[@id="J-login"]')
    btn.click()
    
    
    action.release()
    sleep(3)
    bro.quit()
    
    
    
  • 相关阅读:
    UVA 1262 Password(密码)(暴力枚举)
    【POJ 3468】A Simple Problem with Integers【树状数组】
    【洛谷P3368】【模板】树状数组2【树状数组】
    【洛谷P3368】【模板】树状数组2【树状数组】
    【洛谷P3368】【模板】树状数组2【树状数组】
    【洛谷P1955】程序自动分析【并查集】【离散】
    【洛谷P1955】程序自动分析【并查集】【离散】
    【洛谷P1955】程序自动分析【并查集】【离散】
    【CH 4201】楼兰图腾【树状数组】
    【CH 4201】楼兰图腾【树状数组】
  • 原文地址:https://www.cnblogs.com/zdqc/p/13408310.html
Copyright © 2011-2022 走看看