zoukankan      html  css  js  c++  java
  • 爬虫基础

    环境: python3、windows

    模块:requests、BeautifulSoup

    安装模块:

    pip3 install BeautifulSoup4
    pip3 install requests
    

      

    一、以汽车之家为例子,来一段简单的爬虫代码。

    rt requests
    from bs4 import BeautifulSoup
    
    # 找到所有新闻
    # 标题,简介,url,图片
    #get方式向汽车之家新闻页面发送请求,获取返回的页面信息
    response = requests.get('http://www.autohome.com.cn/news/')
    #get请求默认编码是utf8,而国内网站许多如汽车之家则需改成gbk
    response.encoding = 'gbk'
    
    #以python标准库解析html文档
    soup = BeautifulSoup(response.text,'html.parser')
    #查找id=xx的标签,以此基础查找所有li标签
    li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')
    
    #通过f12查看新闻板块下此标签的li包含我们需要的信息。再将每一个需要的标签通过BeautifulSoup方法解析出来。
    for li in li_list:
        title = li.find('h3')
        #h3标签中会有None,可能是广告,直接跳过
        if not title:
            continue
        #简介
        summary = li.find('p').text
        
        #详细页url,找到a标签,a标签的所有属性都在attrs的字典里,可以attrs取值,也可以直接get方法取值
        # url = li.find('a').attrs['href']
        url = li.find('a').get('href')
    
        #同理先拿到图片url,再通过url向服务器发送请求,写入本地
        img_url = li.find('img').get('src')
        img = requests.get(img_url)
    
        #这里是伪代码,实际运行过程,文章标题会有许多的特殊字符,不可作为图片名称。可用其它名称,
        #或者通过正则替换掉特殊字符。
        file_name = title.text
        with open(file_name+'.jpg','wb') as f:
            f.write(img.content)
    

      

    二、通过代码进行登录验证:

    1.登录github:

    首先我们进入github登录页面,输入错误的用户名以及密码,通过f12 NetWork一栏查看htttp请求状态

    点击session,在Headers一栏,可以看到接收我们登录信息的URL是哪一个

    此时,再查找服务端需要的Data信息,再最下方找到了Form Data

    根据这个格式,我们向github服务端发送post请求:

    import requests
    from bs4 import BeautifulSoup
    
    #获取token
    r1 = requests.get('https://github.com/login')
    s1 = BeautifulSoup(r1.text,'html.parser')
    #同样是通过f12查看源码搜索token,找到了作为CSRF禁止跨站请求的token的标签,通过解析取得它的值 token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
    #有的网站会在第一次get请求时给客户端发送一组cookies,当客户端带着此cookies来进行验证才会通过,所以这里先获取未登录的cookies r1_cookie_dict= r1.cookies.get_dict() #将用户名密码token发送到服务端 r2 = requests.post('https://github.com/session', data={ 'utf8':'✓', 'authenticity_token':token, 'login':'Mitsui1993', 'password':'假装有密码', 'commit':'Sign in' }, cookies = r1_cookie_dict ) #获取登陆后拿到的cookies,并整合到一个dict里 r2_cookie_dict = r2.cookies.get_dict() cookie_dict = {} cookie_dict.update(r1_cookie_dict) cookie_dict.update(r2_cookie_dict) #带着cookies验证是否登录成功,查看登录后可见的页面 r3 = requests.get( url='https://github.com/settings/emails', cookies=cookie_dict ) #text里包含我的用户名,由此判定已经登录成功。 print(r3.text)

      

    2.通过requests对抽屉网进行点赞

    import requests
    
    #取得未登录第一次get请求的cookies
    r1 = requests.get('http://dig.chouti.com')
    r1_cookies = r1.cookies.get_dict()
    
    #由于点赞前需要先登录,所以这里跟github一样,我们通过解析http请求知道需要发送的目标url以及所需数据
    r2 = requests.post('http://dig.chouti.com/login',
                       data={
                           'phone':'8615xxxxx',
                           'password':'woshiniba',
                           'oneMonth':1
                       },
                       cookies = r1_cookies)
    
    #获取登录后的cookies
    r2_cookies = r2.cookies.get_dict()
    
    #整合cookies
    r_cookies = {}
    r_cookies.update(r1_cookies)
    r_cookies.update(r2_cookies)
    
    #真正的点赞功能需要的是第一次get时的cookies里的gpsd,这也是为什么我们主张将登陆前后的cookies合并一起发送的原因,
    #这将大大提高我们请求的容错率。
    # r_cookies = {'gpsd':r1_cookies['gpsd']}
    
    #点赞格式url格式linksId=后面为文章id
    r3 = requests.post('http://dig.chouti.com/link/vote?linksId=13921736',
                  cookies = r_cookies)
    
    #获得正确的状态码及返回信息,则正面已经成功。
    print(r3.text)
    

      

    三。requests模块与 模块的其它方法:

     1 def request(method, url, **kwargs):
     2     """Constructs and sends a :class:`Request <Request>`.
     3 
     4     :param method: method for the new :class:`Request` object.
     5     :param url: URL for the new :class:`Request` object.
     6     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
     7     :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
     8     :param json: (optional) json data to send in the body of the :class:`Request`.
     9     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    10     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    11     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
    12         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
    13         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
    14         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
    15         to add for the file.
    16     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    17     :param timeout: (optional) How long to wait for the server to send data
    18         before giving up, as a float, or a :ref:`(connect timeout, read
    19         timeout) <timeouts>` tuple.
    20     :type timeout: float or tuple
    21     :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    22     :type allow_redirects: bool
    23     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    24     :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    25     :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    26     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    27     :return: :class:`Response <Response>` object
    28     :rtype: requests.Response
    29 
    30     Usage::
    31 
    32       >>> import requests
    33       >>> req = requests.request('GET', 'http://httpbin.org/get')
    34       <Response [200]>
    35     """
    36 复制代码
    Requests
  • 相关阅读:
    ip地址和子网掩码
    Mysql 进阶查询 (select 语句的高级用法)
    MHA高可用配置及故障切换
    数据库的备份与恢复需要修改
    每天一分钟,了解mysql索引,事务与存储引擎
    mysql基础命令详解
    带你走进mysql数据库
    Spring XML无自动提示
    Spring环境搭建错误
    读书笔记_java设计模式深入研究 第十一章 装饰器模式 Decorator
  • 原文地址:https://www.cnblogs.com/mitsui/p/7444220.html
Copyright © 2011-2022 走看看