zoukankan      html  css  js  c++  java
  • 15 Beautiful Soup(提取数据详解find_all())

    # 1、获取所有tr标签
    # 2、获取第2个tr标签
    # 3、获取所有class等于even的tr标签
    # 4_1、将所有id等于test,class也等于test的所有a标签提取出
    # 4_2、获取所有a标签下href属性的值
    # 5、获取所有的职位信息(纯文本)


    # 1、获取所有tr标签
     1 from bs4 import BeautifulSoup
     2 text = """
     3 <table class="tablelist" cellpadding="0" cellspacing="0">
     4     <tbody>
     5         <tr class="h">
     6             <td class="l" width="374">职位名称</td>
     7             <td>职位类别</td>
     8             <td>人数</td>
     9             <td>地点</td>
    10             <td>发布时间</td>
    11         </tr>
    12         <tr class="even">
    13             <td class="l square"><a target="blank"
    14 href="https://www.baidu.com/">研发工程师(上海1)</a></td>
    15             <td>技术类</td>
    16             <td>1</td>
    17             <td>上海</td>
    18             <td>2020-1-1</td>
    19         </tr>
    20         <tr class="odd">
    21             <td class="l square"><a target="blank"
    22 href="https://www.baidu.com/">工程师(北京2)</a></td>
    23             <td>技术类</td>
    24             <td>2</td>
    25             <td>北京</td>
    26             <td>2020-2-2</td>
    27         </tr>
    28         <tr class="even">
    29             <td class="l square"><a target="blank"
    30 href="https://www.baidu.com/">工程师(上饶3)</a></td>
    31             <td>管理类</td>
    32             <td>3</td>
    33             <td>上饶</td>
    34             <td>2020-3-3</td>
    35         </tr>
    36     </tbody>
    37 </table>
    38 """
    39 
    40 soup = BeautifulSoup(text, 'lxml')
    41 # 1、获取所有tr标签
    42 trs = soup.find_all('tr')
    43 for tr in trs:
    44     print(tr)
    45     print('='*30)
    # 2、获取第2个tr标签
    1 # 2、获取第2个tr标签
    2 # limit参数指限制多少个标签提取
    3 tr2 = soup.find_all('tr', limit=2)[1]
    4 print(tr2)
    # 3、获取所有class等于even的tr标签
    1 # 3、获取所有class等于even的tr标签
    2 # class为python关键字,后面加_加以区分
    3 trs = soup.find_all('tr', class_ = 'even')
    4 for tr in trs:
    5     print(tr)
    6     print('='*30)
    1 #方法二:
    2 trs = soup.find_all('tr', attrs={'class':'even'})
    3 for tr in trs:
    4     print(tr)
    5     print('='*30)
    # 4_1、将所有id等于test,class也等于test的所有a标签提取出
    # 4_1、将所有id等于test,class也等于test的所有a标签提取出
    # 方法一:
    alists1 = soup.find_all('a', attrs={'id':'test', 'class':'test'})
    print(alists1)
    # 方法二
    alists2 = soup.find_all('a', class_='test', id='test')
    print(alists2)
    # 4_2、获取所有a标签下href属性的值
    1 # 4_2、获取所有a标签下href属性的值
    2 ahs = soup.find_all('a')
    3 for ah in ahs:
    4     # 方法一:通过下标操作
    5     href1 = ah['href']
    6     print('href1={}'.format(href1))
    7     # 方法二:通过属性操作
    8     href2 = ah.attrs['href']
    9     print('href2={}'.format(href2))
    # 5、获取所有的职位信息(纯文本)
     1 # 5、获取所有的职位信息(纯文本)
     2 trs = soup.find_all('tr')[1:]   # 从第二个tr开始获取
     3 movies = []
     4 for tr in trs:
     5     movie = {}
     6     tds = tr.find_all('td')
     7     title = tds[0].string
     8     category = tds[1].string
     9     num = tds[2].string
    10     city = tds[3].string
    11     time = tds[4].string
    12     movie['title'] = title
    13     movie['category'] = category
    14     movie['num'] = num
    15     movie['city'] = city
    16     movie['time'] = time
    17     movies.append(movie)
    18 
    19 print(movies)
    # 5、获取所有的职位信息(纯文本)方法二:推荐
     1 trs = soup.find_all('tr')[1:]   # 从第二个tr开始获取
     2 movies = []
     3 for tr in trs:
     4     movie = {}
     5     # 获取tr下所有非标签字符
     6     # infos= list(tr.strings)
     7     # 获取tr下所有非标签&非空白字符
     8     infos = list(tr.stripped_strings)
     9     #print(infos)
    10     movie['title'] = infos[0]
    11     movie['category'] = infos[1]
    12     movie['num'] = infos[2]
    13     movie['city'] = infos[3]
    14     movie['time'] = infos[4]
    15     movies.append(movie)
    16 
    17 print(movies)

    附:关于string、strings、stripped_strings属性以及get_text()方法:

    string:      获取某个标签下的非标签字符串,返回值是一个字符串。

    strings:       获取某个标签下的子孙非标签字符串,返回值是一个生成器。

    stripped_strings:  获取某个标签下的子孙非标签字符串并去掉空白字符,返回值是一个生成器。

    get_text():       获取某个标签下的子孙非标签字符串,不是以列表的形式返回,是以普通字符串返回。

  • 相关阅读:
    ADB命令大全
    Backup your Android without root or custom recovery -- adb backup
    Content portal for Pocketables Tasker articles
    Is there a way to detect if call is in progress? Phone Event
    Tasker to proximity screen off
    Tasker to detect application running in background
    Tasker to create toggle widget for ES ftp service -- Send Intent
    Tasker to proximity screen on
    Tasker to answer incoming call by pressing power button
    Tasker to stop Poweramp control for the headset while there is an incoming SMS
  • 原文地址:https://www.cnblogs.com/sruzzg/p/13092085.html
Copyright © 2011-2022 走看看