zoukankan      html  css  js  c++  java
  • python爬虫:用BeautifulSoup抓取div标签

     1 # -*- coding:utf-8 -*-
     2 #python 2.7
     3 #XiaoDeng
     4 #http://tieba.baidu.com/p/2460150866
     5 #标签操作
     6 
     7 
     8 from bs4 import BeautifulSoup
     9 import urllib.request
    10 import re
    11 
    12 
    13 #如果是网址,可以用这个办法来读取网页
    14 #html_doc = "http://tieba.baidu.com/p/2460150866"
    15 #req = urllib.request.Request(html_doc)  
    16 #webpage = urllib.request.urlopen(req)  
    17 #html = webpage.read()
    18 
    19 
    20 
    21 html="""
    22 <html><head><title>The Dormouse's story</title></head>
    23 <body>
    24 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    25 <p class="story">Once upon a time there were three little sisters; and their names were
    26 <a href="http://example.com/elsie" class="sister" id="xiaodeng"><!-- Elsie --></a>,
    27 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    28 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    29 <a href="http://example.com/lacie" class="sister" id="xiaodeng">Lacie</a>
    30 and they lived at the bottom of a well.</p>
    31 <div class="ntopbar_loading"><img src="http://simg.sinajs.cn/blog7style/images/common/loading.gif">加载中…</div>
    32 
    33 <div class="SG_connHead">
    34             <span class="title" comp_title="个人资料">个人资料</span>
    35             <span class="edit">
    36                         </span>
    37 <div class="info_list">     
    38                                    <ul class="info_list1">
    39                     <li><span class="SG_txtc">博客等级:</span><span id="comp_901_grade"><img src="http://simg.sinajs.cn/blog7style/images/common/sg_trans.gif" real_src="http://simg.sinajs.cn/blog7style/images/common/number/9.gif"  /></span></li>
    40                     <li><span class="SG_txtc">博客积分:</span><span id="comp_901_score"><strong>0</strong></span></li>
    41                     </ul>
    42                     <ul class="info_list2">
    43                     <li><span class="SG_txtc">博客访问:</span><span id="comp_901_pv"><strong>3,971</strong></span></li>
    44                     <li><span class="SG_txtc">关注人气:</span><span id="comp_901_attention"><strong>0</strong></span></li>
    45                     <li><span class="SG_txtc">获赠金笔:</span><strong id="comp_901_d_goldpen">0支</strong></li>
    46                     <li><span class="SG_txtc">赠出金笔:</span><strong id="comp_901_r_goldpen">0支</strong></li>
    47                     <li class="lisp" id="comp_901_badge"><span class="SG_txtc">荣誉徽章:</span></li>
    48                     </ul>
    49                   </div>
    50 <div class="atcTit_more"><span class="SG_more"><a href="http://blog.sina.com.cn/" target="_blank">更多&gt;&gt;</a></span></div>                 
    51 <p class="story">...</p>
    52 """
    53 soup = BeautifulSoup(html, 'html.parser')   #文档对象
    54 
    55 
    56 
    57 # 类名为xxx而且文本内容为hahaha的div
    58 for k in soup.find_all('div',class_='atcTit_more'):#,string='更多'
    59     print(k)
    60     #<div class="atcTit_more"><span class="SG_more"><a href="http://blog.sina.com.cn/" target="_blank">更多&gt;&gt;</a></span></div>
  • 相关阅读:
    七牛云上传博客
    .net 导入Excel
    liunx ln -s 软连接
    dos2unix 命令
    x-csrf-token
    设置git 不提交 修改权限的文件
    nginx 启动、重启、关闭
    命令行导入mysql数据
    mongo 相关命令
    laravel 安装完成后安装 vendor 目录
  • 原文地址:https://www.cnblogs.com/yizhenfeng168/p/6987620.html
Copyright © 2011-2022 走看看