zoukankan      html  css  js  c++  java
  • Python 爬虫 数据清洗 去掉 超链接

    有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

    <div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
    Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
    </p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
    think they
    're supposed to do</em>, November 21, 2013
    - January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
    </strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
    </p><div class="follow-artist" data-artist-id="12106"
    role="button"
    tabindex="0">
    <span cl
    ass
    ="icon"></
    span><s
    pan class=
    "toolti
    p
    ">Follow</span></div><div class="artist-bio"><p>

    <p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
    cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
    one of the most traditional art forms, weaving. Cook weaves his own canvases on a
    loom and juxtaposes these with
     store-bought canvas sheets
    in abstract arrangements.
    For the artist,
    the surface of th
    e canvas itself becomes the foc
    us of his practice. Using simple geometric shapes and a l
    imited color palate, Cook
    's works nurture structural s
    implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>

    第一种方法:

      用这则替换,把 href 替换为 hre1f 就可以了,

    第二种方法:

            result_div_list = re.findall('<(.*?)>',str(result_div))
           
        
    if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

    记录下来,供以后学习参考 

  • 相关阅读:
    php socket 发送HTTP请求 POST json
    登录令牌 Token 介绍
    如何打开rdb文件
    Web登录其实没那么简单
    爆款小程序是如何诞生的?
    如何在小程序上增加音视频?
    拒绝“割韭菜”— 谈谈区块链正经的商用场景!
    想知道微信怎么做指纹支付开发?看这里!
    游戏安全有多重要?——GAME-TECH游戏开发者技术沙龙
    嘿,OCR文字识别了解下!
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8135519.html
Copyright © 2011-2022 走看看