zoukankan      html  css  js  c++  java
  • 大学排名数据爬取

    网址:http://www.qianmu.org/ranking/1528.htm

    import requests
    from lxml import etree
    import lxml
    
    resp=requests.get('http://www.qianmu.org/2018QS%E4%B8%96%E7%95%8C%E5%A4%A7%E5%AD%A6%E6%8E%92%E5%90%8D')
    selector=lxml.etree.HTML(resp.text)
    links=selector.xpath('//div[@id="content"]//td[2]/a/@href')
    for link in links:
        # print(link)
        r=requests.get(link)
        selector=lxml.etree.HTML(r.text)
        data={}
        data['name']=selector.xpath('//div[@id="wikiContent"]/h1/text()')
        key=selector.xpath('//div[@id="wikiContent"]/div[@class="infobox"]//table//td[1]/p/text()')
        cols=selector.xpath('//div[@id="wikiContent"]/div[@class="infobox"]//table//td[2]')
        values=[]
        for col in cols:
            values.append(''.join(col.xpath('.//text()')))
        # print(len(key),len(value1))
        for i in range(len(key)):
            data[key[i]]=values[i]
        print(data)
    

      

  • 相关阅读:
    linux Segmentation faults 段错误详解
    linux cut
    linux sed
    linux tr
    linux ar
    objdump--反汇编查看
    linux中dd命令
    readelf
    登录后,前端做了哪些工作,如何得知已登录?
    正向代理和反向代理?
  • 原文地址:https://www.cnblogs.com/NCLONG/p/12292907.html
Copyright © 2011-2022 走看看