zoukankan      html  css  js  c++  java
  • python简单爬虫 使用pandas解析表格,不规则表格

    url = http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm

    部分表格如图:

    部分html代码:

    <table class="MsoNormalTable" style="353.0pt;margin-left:4.65pt;border-collapse:collapse;border:none;    mso-border-alt:solid windowtext .5pt;mso-padding-alt:0cm 5.4pt 0cm 5.4pt;    mso-border-insideh:.5pt solid windowtext;mso-border-insidev:.5pt solid windowtext" width="471" cellspacing="0" cellpadding="0" border="1">
     <tbody>
      <tr class="firstRow" style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:36.75pt">
       <td style="170.0pt;border:solid windowtext 1.0pt;mso-border-alt:            solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family:            宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">学院<span lang="EN-US">
            <o:p></o:p></span></span></strong></p></td>
       <td style="183.0pt;border:solid windowtext 1.0pt;            border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:            solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family:            宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">专业名称<span lang="EN-US">
            <o:p></o:p></span></span></strong></p></td>
      </tr>
      <tr style="mso-yfti-irow:1;height:16.5pt">
       <td rowspan="4" style="170.0pt;border:solid windowtext 1.0pt;            border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;            padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体;            mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程学院<span lang="EN-US">450
           <o:p></o:p></span></span></p></td>
       <td style="183.0pt;border-top:none;border-left:none;            border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;            mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;            mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体;            mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程<span lang="EN-US">
           <o:p></o:p></span></span></p></td>
      </tr>
        ......
     </tbody>
    </table>

    用pandas解析表格,代码如下:

    import pandas as pd
    url = 'http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm'
    
    table = pd.read_html(url) 
    pd.set_option('display.max_rows', None)  # 显示全部的行
    with open("湖南大学学院与专业.txt", "wt", encoding='utf8') as out_file:  # 保存为txt文件
        for i in table:
            out_file.write(str(i)+'
    ')

    运行结果如下(部分):

     非常简洁高效!

  • 相关阅读:
    收集一些关于前端的网站(持续更新)
    关于表单设计(登录)的几点感悟
    css样式表中的样式覆盖顺序(转)
    UI设计常用网站(随时更新)
    关于ps安装遇到的问题解决(转载整理)
    【工作记录】解决溢出div自适应的高度问题
    Cross origin requests are only supported for protocol schemes: http, data, chrome-extension, https, chrome-extension-resource. 报错处理
    [css]img垂直居中的方法
    iOS学习第三天杂记
    iOS学习第二天杂记--UI
  • 原文地址:https://www.cnblogs.com/cttcarrotsgarden/p/10769097.html
Copyright © 2011-2022 走看看