zoukankan      html  css  js  c++  java
  • Python 爬虫 数据清洗 去掉 超链接

    有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

    <div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
    Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
    </p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
    think they
    're supposed to do</em>, November 21, 2013
    - January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
    </strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
    </p><div class="follow-artist" data-artist-id="12106"
    role="button"
    tabindex="0">
    <span cl
    ass
    ="icon"></
    span><s
    pan class=
    "toolti
    p
    ">Follow</span></div><div class="artist-bio"><p>

    <p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
    cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
    one of the most traditional art forms, weaving. Cook weaves his own canvases on a
    loom and juxtaposes these with
     store-bought canvas sheets
    in abstract arrangements.
    For the artist,
    the surface of th
    e canvas itself becomes the foc
    us of his practice. Using simple geometric shapes and a l
    imited color palate, Cook
    's works nurture structural s
    implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>

    第一种方法:

      用这则替换,把 href 替换为 hre1f 就可以了,

    第二种方法:

            result_div_list = re.findall('<(.*?)>',str(result_div))
           
        
    if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

    记录下来,供以后学习参考 

  • 相关阅读:
    Linux下编辑、编译、调试命令总结——gcc和gdb描述
    scanf函数读取缓冲区数据的问题
    Windows下设置Ubuntu引导项
    前端术语汇总笔记(会保持更新)
    实现动态加载一个 JavaScript 资源
    提取一个字符串中的数字,并将其转为数组
    CSS3图片倒影技术
    js函数聚合
    js继承函数封装
    联动菜单实现思路
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8135519.html
Copyright © 2011-2022 走看看