zoukankan      html  css  js  c++  java
  • Python 爬虫 数据清洗 去掉 超链接

    有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

    <div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
    Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
    </p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
    think they
    're supposed to do</em>, November 21, 2013
    - January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
    </strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
    </p><div class="follow-artist" data-artist-id="12106"
    role="button"
    tabindex="0">
    <span cl
    ass
    ="icon"></
    span><s
    pan class=
    "toolti
    p
    ">Follow</span></div><div class="artist-bio"><p>

    <p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
    cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
    one of the most traditional art forms, weaving. Cook weaves his own canvases on a
    loom and juxtaposes these with
     store-bought canvas sheets
    in abstract arrangements.
    For the artist,
    the surface of th
    e canvas itself becomes the foc
    us of his practice. Using simple geometric shapes and a l
    imited color palate, Cook
    's works nurture structural s
    implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>

    第一种方法:

      用这则替换,把 href 替换为 hre1f 就可以了,

    第二种方法:

            result_div_list = re.findall('<(.*?)>',str(result_div))
           
        
    if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

    记录下来,供以后学习参考 

  • 相关阅读:
    Babel:JavaScript编译器
    Webpack:前端资源模块化管理和打包工具
    springboot之RocketMq实现
    spingboot之Java邮件发送
    第一模块总结
    嵌入式面试题(一)
    C/C++练习题(三)
    ToolTip特效 JavaScript 盗取厦门人才网的特效
    C#后台无刷新页面弹出alert方法
    复制表及其只复制表数据的区别
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8135519.html
Copyright © 2011-2022 走看看