zoukankan      html  css  js  c++  java
  • lxml解析html文件输出为dataframe

    本地html文件分为表头节点<th>和表格内容节点<td>,父节点<tr>

    import pandas as pd
    from pandas.io.parsers import TextParser
    from lxml.html import parse
    from lxml import etree
    htmlf = open("C:/Users/Administrator/Desktop/11/ho_relation_tdd-enm2.html", 'r', encoding="utf-8").read()
    doc = etree.HTML(htmlf)
    rows = doc.xpath('.//tr')
    header = rows[0].xpath(".//th/text()")
    data = [i.xpath(".//td/text()") for i in rows[1:]]
    df = TextParser(data, names=header).get_chunk()
    
    
    
  • 相关阅读:
    linux运维、架构之路-MySQL主从复制
    多线程
    IO
    查看信息
    乱码
    字节流与字符流
    file
    JDBC
    规范
    Java常用包
  • 原文地址:https://www.cnblogs.com/huangyz-xy/p/13622123.html
Copyright © 2011-2022 走看看