zoukankan      html  css  js  c++  java
  • 数据清洗之微博内容清洗

    获取文字加表情(alt标签的属性)

    #!/usr/bin/env python  
    # encoding: utf-8
    from functools import reduce
    from lxml import html
    from bs4 import BeautifulSoup
    html="""
    <div><span class="url-icon"><img alt="[馋嘴]" src="//h5.sinaimg.cn/m/emoticon/icon/default/d_chanzui-ad3f4f182c.png" style="1em; height:1em;"/></span>听着就很好吃​</div>
    """
    
    def main():
        bs=BeautifulSoup(html,'html.parser')
        main_div=bs.find('div')
        contents=parse_div(main_div)
        print(contents)
    def parse_div(div_tags):
        contents=div_tags.contents
        result=[]
        for content in contents:
            if isinstance(content,str):
                content=content.replace('
    ','').replace(' ','')
                result.append(content)
            elif content.has_attr('alt'):
                result.append(content.get('alt',''))
            else:
                new_contents=parse_div(content)
                result.append(new_contents)
        return ''.join(result)
    #最优解
    def main(self, htmlstr):
            root = html.fromstring(htmlstr)
            nodes = root.xpath(".//text()|.//@alt")
            return ''.join([i.replace('
    ','').replace(" ", "").replace("u200b", "") for i in nodes])
    
    
    
    if __name__ == '__main__':
            main()
    
    
    
    
    
    
  • 相关阅读:
    Catch That Cow POJ 3278(BFS)
    python的各种推导式(列表推导式、字典推导式、集合推导式)
    贝叶斯神经网络
    浅谈贝叶斯
    置换检验
    Python的基本用法
    字符串和编码
    开启新篇章
    无偏博弈类问题
    PAT1103
  • 原文地址:https://www.cnblogs.com/c-x-a/p/9340620.html
Copyright © 2011-2022 走看看