zoukankan      html  css  js  c++  java
  • The website is API(2)

    一、Beautifu Soup库

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(demo,"html.parser")

    Tag、Name、Attributes、NavigableString、Comment

    .contents 子节点的列表,将<tag>所有儿子节点存入列表

    .children 子节点的迭代类型

    .descendants 子孙节点的迭代类型

    .parent 节点的父亲标签

    .parents 节点先辈标签的迭代类型

     .next_sibling(s) 返回安照HTML文本顺序的下一个平行节点标签

    .previous_sibling(s) 上一个

    >>> import requests
    >>> r = requests.get("http://python123.io/ws/demo.html")
    >>> demo = r.text
    >>> demo
    '<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>'
    >>>from bs4 import BeautifulSoup

    >>> soup = BeautifulSoup(demo,"html.parser")
    >>> soup.prettify()
    '<html> <head> <title> This is a python demo page </title> </head> <body> <p class="title"> <b> The demo python introduces several python courses. </b> </p> <p class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </p> </body> </html>'
    >>> print(soup.prettify())
    <html>
    <head>
    <title>
    This is a python demo page
    </title>
    </head>
    <body>
    <p class="title">
    <b>
    The demo python introduces several python courses.
    </b>
    </p>
    <p class="course">
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
    </a>
    and
    <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
    </a>
    .
    </p>
    </body>
    </html>

    二、信息组织与提取 

    1.信息标记的三种形式:

    XML:尖括号

    JSON:有类型键值对

    YAML:无类型

    3.信息提取的一般方法

    (1)完整解析信息地标记形式,再提取关键信息

    (2)无视标记形式,直接搜索关键信息

    (3)融合方法

    实例:

    >>>  import requests
     r = requests.get("http://python123.io/ws/demo.html")
     demo = r.text
     demo
     
    SyntaxError: unexpected indent
    >>> import requests
    >>> r = requests.get("http://python123.io/ws/demo.html")
    >>> demo = r.text
    >>> demo
    '<html><head><title>This is a python demo page</title></head>
    <body>
    <p class="title"><b>The demo python introduces several python courses.</b></p>
    <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
    </body></html>'
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(demo,"html,parser")
    Traceback (most recent call last):
      File "<pyshell#6>", line 1, in <module>
        soup = BeautifulSoup(demo,"html,parser")
      File "C:UsersASUSAppDataLocalProgramsPythonPython37-32libsite-packagess4\__init__.py", line 196, in __init__
        % ",".join(features))
    bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html,parser. Do you need to install a parser library?
    >>> yes
    Traceback (most recent call last):
      File "<pyshell#7>", line 1, in <module>
        yes
    NameError: name 'yes' is not defined
    >>> soup = BeautifulSoup(demo,"html.parser")
    >>> from link in soup.find_all('a')
    SyntaxError: invalid syntax
    >>> for link in soup.find_all('a')
    SyntaxError: invalid syntax
    >>> for link in soup.find_all('a'):
        print(link.get('href'))
    
        
    http://www.icourse163.org/course/BIT-268001
    http://www.icourse163.org/course/BIT-1001870001

    4.基于bs4库的HTML内容查找方法

    <>.find_all(name,attrs,recursive,string,**kwargs)

    返回一个列表类型,存储查找的结果

    name:对标签名称的检索字符串

    >>> soup.find_all('a')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> soup.find_all(['a','b'])
    [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    >>> for tag in soup.find_all(True):
        print(tag.name)
    
        
    html
    head
    title
    body
    p
    b
    p
    a
    a
    >>> import re
    >>> for tag in soup.find_all(re.compile('b')):
        print(tag.name)
    
        
    body
    b

    attrs:对标签属性值的检索字符串,可标注属性检索

    >>> soup.find_all('p','course')
    [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    >>> soup.find_all(id='link1')
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
    >>> soup.find_all(id=re.compile('link'))
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

    recursive:是否对子孙全部检索,默认True

    >>> soup.find_all('a',recursive=False)
    []

    string:<>...</>中字符串区域的检索字符串

    >>> soup.find_all(string = 'Basic Python')
    ['Basic Python']
    >>> soup.find_all(string = re.compile("python"))
    ['This is a python demo page', 'The demo python introduces several python courses.']

    >>> soup(string = 'Basic Python')
    ['Basic Python']

    扩展方法:

    <>.find() find_parents parent next_sibling(s) previous_sibling(s)

    三、中国大学排名定向爬虫

    技术路线:requests+bs4

    可行性:robots协议

    步骤1:获取内容 getHTMLText()

    2:数据结构 fillUnivList()

    3:利用DS printUnivList()

    import requests
    from bs4 import BeautifulSoup
    import bs4
    
    def getHTMLText(url):
        try:
            r = requests.get(url,timeout=30)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return ""
        
    def fillUnivList(ulist,html):
        soup = BeautifulSoup(html,'html.parser')
        for tr in soup.find('tbody').children:
            if isinstance(tr,bs4.element.Tag):#过滤
                tds = tr('td')
                ulist.append([tds[0].string,tds[1].string,tds[3].string])
    
    def printUnivList(ulist,num):
        #tplt = "{0:^10}	{1:{3}^10}	{2:^10}"
        print("{:^10}	{:^6}	{:^10}".format("排名","学校名称","总分"))
        for i in range(num):
            u = ulist[i]
            print("{:^10}	{:^6}	{:^10}".format(u[0],u[1],u[2]))
            
    
    def main():
        uinfo = []
        url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
        html = getHTMLText(url)
        fillUnivList(uinfo,html)
        printUnivList(uinfo,20)
    main()

    优化后:

    import requests
    from bs4 import BeautifulSoup
    import bs4
    
    def getHTMLText(url):
        try:
            r = requests.get(url,timeout=30)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return ""
        
    def fillUnivList(ulist,html):
        soup = BeautifulSoup(html,'html.parser')
        for tr in soup.find('tbody').children:
            if isinstance(tr,bs4.element.Tag):#过滤
                tds = tr('td')
                ulist.append([tds[0].string,tds[1].string,tds[3].string])
    
    def printUnivList(ulist,num):
        tplt = "{0:^10}	{1:{3}^10}	{2:^10}"
        print(tplt.format("排名","学校名称","总分",chr(12288)))
        for i in range(num):
            u = ulist[i]
            print(tplt.format(u[0],u[1],u[2],chr(12288)))
            
    
    def main():
        uinfo = []
        url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
        html = getHTMLText(url)
        fillUnivList(uinfo,html)
        printUnivList(uinfo,20)
    main()
  • 相关阅读:
    解题:POI 2006 Periods of Words
    解题:NOI 2014 动物园
    1483. 最高平均分
    1438. 较大分组的位置(回顾)
    1258. 漂亮子数组
    1903. 部门统计(回顾)
    1509. 柠檬水找零
    1451. 到最近的人的最大距离
    1425. 比较含退格的字符串
    1394. 山羊拉丁文
  • 原文地址:https://www.cnblogs.com/kmxojer/p/11436741.html
Copyright © 2011-2022 走看看