zoukankan      html  css  js  c++  java
  • 网易公开课之爬虫

    1 利用jupyter notebook写代码

    C:Userszuo>jupyter notebook

    2 在jupyter notebook页面,有快捷方式,可以在help中设置。

     

    3 BeautifulSoup的常用方法

    from  bs4 import BeautifulSoup
    text = '''
    <!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no">
        <title>Title</title>
        <link rel="stylesheet" href="bootstrap-3.3.7-dist/css/bootstrap.min.css">
    </head>
    <body>
    <h1>hello world</h1>
    <span class="s1">xxx</span>
    <a id="a1" href="" name="y">yyy</a>
    <a href="https://baidu.com" name="baidu">百度</a>
    <a href="https://tencent.com" name="tentent">腾讯</a>
    <script src="jquery-3.2.1.min.js"></script>
    <script src="bootstrap-3.3.7-dist/js/bootstrap.min.js"></script>
    </body>
    </html>
    '''
    
    soup = BeautifulSoup(text,'html.parser') # 需要传参,解析器
    print(soup.text)
    # 筛选标签
    print(soup.select('a'))
    print(soup.select('h1'))
    print(soup.select('h1')[0])
    print(soup.select('h1')[0].text)
    # 筛选 id
    print(soup.select('#a1'))
    print(soup.select('#a1')[0])
    print(soup.select('#a1')[0].text)
    # 筛选 class
    print(soup.select('.s1'))
    print(soup.select('.s1')[0])
    print(soup.select('.s1')[0].text)
    # 筛选所有a表的href的属性
    for link in soup.select('a'):
    
        print(link,type(link),link['href'],link['name']) # tag的属性操作方法与字典相同

      输出:

    D:Anaconda3python.exe D:/virtualenv/xxx/xxx/1.py
    
    
    
    
    
    
    
    Title
    
    
    
    hello world
    xxx
    yyy
    百度
    腾讯
    
    
    
    
    
    [<a href="" id="a1" name="y">yyy</a>, <a href="https://baidu.com" name="baidu">百度</a>, <a href="https://tencent.com" name="tentent">腾讯</a>]
    [<h1>hello world</h1>]
    <h1>hello world</h1>
    hello world
    [<a href="" id="a1" name="y">yyy</a>]
    <a href="" id="a1" name="y">yyy</a>
    yyy
    [<span class="s1">xxx</span>]
    <span class="s1">xxx</span>
    xxx
    <a href="" id="a1" name="y">yyy</a> <class 'bs4.element.Tag'>  y
    <a href="https://baidu.com" name="baidu">百度</a> <class 'bs4.element.Tag'> https://baidu.com baidu
    <a href="https://tencent.com" name="tentent">腾讯</a> <class 'bs4.element.Tag'> https://tencent.com tentent

      select('#id span p ')

    text = '''
    <!DOCTYPE html>
    <html lang="zh-CN">
    <body>
    <div class="d1">
        <span class="s1"></span>
        <div class="d2">
            <span class="s2"></span>
            <div class="d3">
                <p class="p1">hello world</p>
            </div>
        </div>
    </div>
    <div class="d1">
    </div>
    </body>
    </html>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(text,'html.parser')
    res = soup.select('.d1 div p') #select方法可以通过标签 逐层查找
    print(res,len(res))

      输出:

    [<p class="p1">hello world</p>] 1

     .contents,contents 属性可以将tag的子节点以列表的方式输出

    text = '''
    <!DOCTYPE html>
    <html lang="zh-CN">
    <body>
    <div class="d1">
        <span class="s1"></span>
        <div class="d2">
            <span class="s2"></span>
            <div class="d3">
                <p class="p1">hello world</p>
            </div>
        </div>
    </div>
    
    </body>
    </html>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(text,'html.parser')
    res = soup.select('.d1')[0]
    for i in range(len(res.contents)):
        print(res.contents[i])
    print(res.contents,len(res.contents))
    

        输出:

    <span class="s1"></span>
    
    
    <div class="d2">
    <span class="s2"></span>
    <div class="d3">
    <p class="p1">hello world</p>
    </div>
    </div>
    
    
    ['
    ', <span class="s1"></span>, '
    ', <div class="d2">
    <span class="s2"></span>
    <div class="d3">
    <p class="p1">hello world</p>
    </div>
    </div>, '
    '] 5

    4 requests,BeautifulSoup两者的结合的简单应用,爬虫腾讯nba首页的标题及相关网址。比较easy。

    import requests
    from bs4 import BeautifulSoup
    
    res = requests.get('http://sports.qq.com/nba/')
    res.encoding = 'gbk'
    soup = BeautifulSoup(res.text,'html.parser')
    for item in soup.select('.icon-v'):
        title = item.text
        url = item['href']
        print(title,url)

      输出:

    数据帝:火箭刷新三分纪录 詹皇一成就称霸NBA http://sports.qq.com/a/20180402/023995.htm
    红黑榜:剩4场还需59板!韦少想场均三双得“刷”了 http://sports.qq.com/a/20180402/017828.htm
    2日综述:西蒙斯准三双76人十连胜 詹皇三双骑士胜 http://sports.qq.com/a/20180402/018807.htm
    哈登25+8集锦 http://v.qq.com/x/page/v0026yjpvej.html
    比赛单节回放 http://v.qq.com/x/page/v00267dvuvl.html
    五佳球 http://v.qq.com/x/page/z0026q6tai0.html
    火箭季后赛首轮最想打谁?这两支球队成理想对手 http://sports.qq.com/a/20180402/006580.htm
    直击-圣城今夜中国风 哈登领衔秀中文:爱你! http://sports.qq.com/a/20180402/004477.htm
    三分40中6!哈登手感成迷 是否也该让他轮休 http://sports.qq.com/a/20180402/002498.htm
    詹姆斯集锦 https://v.qq.com/x/cover/qpj1wfj6xs37jcv/t0026a945ua.html
    五佳球:詹姆斯禁区飞身怒扣 https://v.qq.com/x/cover/qpj1wfj6xs37jcv/b0026yudkdb.html
    骑士赢球仍被狂嘘 詹皇:我能做其他事帮助获胜 http://sports.qq.com/a/20180402/012409.htm
    KD集锦 http://v.qq.com/x/page/v0026lzsnes.html
    汤神集锦 http://v.qq.com/x/page/y0026vjpbqj.html
    五佳球 http://v.qq.com/x/page/n0026t1h3nd.html
    单节回放 https://v.qq.com/x/cover/3s23igd42po7lmy/h0026lt9nbp.html
    阿杜被新秀晃倒强硬回击 一夜缔造两项里程碑 http://sports.qq.com/a/20180402/014353.htm
    前方直击-麦考伤情无碍今日出院 勇士全队啥反应? http://sports.qq.com/a/20180402/017943.htm
    韦少26+15+13集锦 http://v.qq.com/x/page/y0026zgw6j0.html
    雷霆vs鹈鹕五佳球 http://v.qq.com/x/page/q0026e8qbh4.html
    《NBA数据酷》:乔丹神奇比赛力压科比81分 http://v.qq.com/x/page/l0026s5o0qe.html
    火箭本季夺冠无悬念?美娜粉嫩出镜为你详解 https://v.qq.com/x/cover/bnt1h8oqszrau20/i002619v18i.html
    马刺完了?你还是太年轻!保持连胜他们仍能拿50胜 http://sports.qq.com/a/20180402/018352.htm
    密歇根防守强在哪儿?翻版绿军欲推翻维拉诺瓦 http://sports.qq.com/a/20180402/018459.htm
    博彩公司看好维拉诺瓦夺冠 3年2夺冠已成定局? http://sports.qq.com/a/20180402/019851.htm
    当《灌篮高手》在日本成现实 中国篮球为何无动于衷? http://sports.qq.com/a/20180402/004467.htm
    直击-最终四强有多疯狂?四个人为看它挤一张床 http://sports.qq.com/a/20180402/013969.htm
    对话密歇根球员:维拉诺瓦像勇士 不愿自比灰姑娘 http://sports.qq.com/a/20180402/015985.htm
    一文读懂希腊篮球:获乔丹盛赞 催生最强美国男篮 http://sports.qq.com/a/20180402/022059.htm

     5 爬取163新闻页面

    import requests
    from bs4 import BeautifulSoup
    from datetime import datetime
    import re
    import json
    
    def get_comment_vote(news):
        key = re.search('/(w+).html',news).group(1)
        comment_url = 'http://sdk.comment.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/{}'.format(key)
        comment_res = requests.get(comment_url)
        jd = json.loads(comment_res.text)
        comment = jd['tcount']
        vote = jd['cmtAgainst'] + jd['cmtVote'] + jd['rcount']
        return (comment,vote)
    
    def crawl(news):
        try:
            result = {}
            res = requests.get(news)
            res.encoding = 'gbk'
            soup = BeautifulSoup(res.text,'html.parser')
            title = soup.select('.post_content_main h1')[0].text
            date1 = soup.select('.post_time_source')[0].contents[0].lstrip().rstrip('u3000来源: ')
            date = datetime.strptime(date1,'%Y-%m-%d %H:%M:%S')
            source = soup.select('.cDGray span')[0].contents[1].lstrip(' 本文来源:')
            author = soup.select('.cDGray span')[1].text.lstrip('责任编辑:')
            comment,vote = get_comment_vote(news)
            result['title'] = title
            result['date'] = date
            result['source'] = source
            result['author'] = author
            result['comment'] = comment
            result['vote'] = vote
            return result
        except:
            pass
    
    
    NEWS = 'http://news.163.com/'
    res = requests.get(NEWS)
    res.encoding= 'gbk'
    soup = BeautifulSoup(res.text,'html.parser')
    for item in soup.select('a'):
        if item.get('href') and item['href'].startswith('http://news.163.com/18/'):
            print(item['href'])
            result = crawl(item['href'])
            print(result)

    6 开发者工具中的XHR

      一句话,记录ajax中的请求。

    7 页面加载过程中的异步加载现象

      比如新浪新闻,当往下拉到地步,会有自动加载的现象,网易新闻和腾讯新闻并没有这种现象。这便是异步加载,同时JS实现的。

      在开发者工具中的Network的JS中可以捕捉到。返回的json数据外面套了一层JS函数。

    8 使用pandas 整理数据

      这里涉及到图表,pycharm不如jupyter notebook好用

    import pandas
    print(total)
    df = pandas.DataFrame(total)
    df

       输出:

      在这里,total 是一个列表,列表中的元素是一个个拥有键值对的字典。

      df = pandas.DataFrame(total)

      df 的样式如下如所示。

      

    9 数据存储到数据库

      df.to_excel('news.xlsx')

      注意,xlsx的后缀名要加上。

      最后生成excel文件。

      

  • 相关阅读:
    hdu 2019 数列有序!
    hdu 2023 求平均成绩
    HDU 5805 NanoApe Loves Sequence (思维题) BestCoder Round #86 1002
    51nod 1264 线段相交
    Gym 100801A Alex Origami Squares (求正方形边长)
    HDU 5512 Pagodas (gcd)
    HDU 5510 Bazinga (字符串匹配)
    UVALive 7269 Snake Carpet (构造)
    UVALive 7270 Osu! Master (阅读理解题)
    UVALive 7267 Mysterious Antiques in Sackler Museum (判断长方形)
  • 原文地址:https://www.cnblogs.com/654321cc/p/8695605.html
Copyright © 2011-2022 走看看