zoukankan      html  css  js  c++  java
  • Python 爬虫 数据清洗 去掉 超链接

    有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

    <div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
    Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
    </p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
    think they
    're supposed to do</em>, November 21, 2013
    - January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
    </strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
    </p><div class="follow-artist" data-artist-id="12106"
    role="button"
    tabindex="0">
    <span cl
    ass
    ="icon"></
    span><s
    pan class=
    "toolti
    p
    ">Follow</span></div><div class="artist-bio"><p>

    <p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
    cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
    one of the most traditional art forms, weaving. Cook weaves his own canvases on a
    loom and juxtaposes these with
     store-bought canvas sheets
    in abstract arrangements.
    For the artist,
    the surface of th
    e canvas itself becomes the foc
    us of his practice. Using simple geometric shapes and a l
    imited color palate, Cook
    's works nurture structural s
    implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>

    第一种方法:

      用这则替换,把 href 替换为 hre1f 就可以了,

    第二种方法:

            result_div_list = re.findall('<(.*?)>',str(result_div))
           
        
    if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

    记录下来,供以后学习参考 

  • 相关阅读:
    平衡“把事情做完”和“一味追求时间延长”
    a little sad
    测试
    【转】无法打开登录所请求的数据库 "xxxx"。登录失败。 用户 'xxxxx' 登录失败。
    JavaScript.JQuery.Ajax学习笔记
    RouteDebug.dll
    自由社区网站的搭建(一)——前言
    连接数据库时注意连接方式
    SQL高版本向低版本附加数据库时出现的问题
    安装sql2017时提示Polybase 要求安装Oracle JRE 7更新51 (64位)或更高版本”规则失败
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8135519.html
Copyright © 2011-2022 走看看