zoukankan      html  css  js  c++  java
  • Python 爬虫 数据清洗 去掉 超链接

    有时候我们需要清洗数据,里面有超链接,怎么去掉他们,比如下面的问题

    <div class="lot-page-details"><ul class="info-list"><li class="lot-info-item"><p><strong class="section-header">Provenance</strong></p><p>Brand New
    Gallery, Milan<br/>Acquired from the above by the present owner</p></li><li class="lot-info-item"><p><strong class="section-header">Exhibited</strong>
    </p><p>Milan, Brand New Gallery, <em>This is the story of America. Everybody's doing what they 
    think they
    're supposed to do</em>, November 21, 2013
    - January 11, 2014</p></li><li class="artist-biography"><p><strong class="section-header">Artist Bio
    </strong></p><a href="/artist/12106/ethan-cook"><h4>Ethan Cook</h4></a><p class="artist-info">American • 1983
    </p><div class="follow-artist" data-artist-id="12106"
    role="button"
    tabindex="0">
    <span cl
    ass
    ="icon"></
    span><s
    pan class=
    "toolti
    p
    ">Follow</span></div><div class="artist-bio"><p>

    <p>New York-based artist Ethan Cook is known for his abstract paintings on self-produced canvases. More recently, he has used handwoven strips of
    cotton and linen to create painterly compositions. Cook's woven canvases are contemporary in their minimalist focus on shape and color while referencing
    one of the most traditional art forms, weaving. Cook weaves his own canvases on a
    loom and juxtaposes these with
     store-bought canvas sheets
    in abstract arrangements.
    For the artist,
    the surface of th
    e canvas itself becomes the foc
    us of his practice. Using simple geometric shapes and a l
    imited color palate, Cook
    's works nurture structural s
    implicity.</p></p><a href="/artist/12106/ethan-cook"><div class="lot-essay-button artist"><em>View More Works</em></div></a></div></li></ul></div>

    第一种方法:

      用这则替换,把 href 替换为 hre1f 就可以了,

    第二种方法:

            result_div_list = re.findall('<(.*?)>',str(result_div))
           
        
    if 'href' in str(result_div_list): for ii in result_div_list: if 'href' in ii: item_desc = str(result_div).replace(str(ii) ,'') else: item_desc = result_div

    记录下来,供以后学习参考 

  • 相关阅读:
    EF-CodeFirst-3搞事
    EF-CodeFirst-1 玩起来
    EF-CodeFirst-2玩的嗨
    Asp.Net SignalR Hub类中的操作详解
    Asp.Net SignalR GlobalHost外部通知
    Asp.Net SignalR 多平台的Client与Server
    Asp.Net SignalR 集群会遇到的问题
    常用数学符号读法大全
    关于神经网络拟合任意函数的讨论
    Asp.net MVC使用FormsAuthentication,MVC和WEB API可以共享身份认证 (转载)
  • 原文地址:https://www.cnblogs.com/xuchunlin/p/8135519.html
Copyright © 2011-2022 走看看