zoukankan      html  css  js  c++  java
  • xpath使用实例之爬取好段子网好段子代码

    haoduanzi.py

    #
    !/usr/local/bin/python3.7 import urllib.request import urllib.parse from lxml import etree import time def handler_request(url, page): # 创建请求头 headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15' } # 创建请求 req = urllib.request.Request(url=url, headers=headers) # 发送请求 rep = urllib.request.urlopen(req) # 获取内容 content = rep.read().decode() return content def location_element(tree): # 段子标题 titles = tree.xpath("//div[@class='head']/h2/text()") # print(ret) # 获取喜欢量和不喜欢量 good = tree.xpath("//div[@class='ping x1']/a[1]/span/text()") # print(good) bad = tree.xpath("//div[@class='ping x1']/a[2]/span/text()") # print(bad) for i in range(len(titles)): # print('标题:', titles[i]) # print('喜欢:', good[i]) # print('不喜欢:', bad[i] str1 = '<h3>%s</h3>'%titles[i]+ ' ' + '<b>喜欢:</b>' + '<span>%s</span>'%good[i] + ' ' + '<b>不喜欢:</b>' + '<span>%s</span>'%bad[i] with open('Reptile/duanzi.html', 'a') as stream: stream.write(str1) if __name__ == "__main__": start_page = input('请输入起始页码:') end_page = input('请输入结束页码:') url = 'http://www.haoduanzi.com/category/?1-{}.html' for page in range(int(start_page), int(end_page)+1): url = url.format(page) # print(url) # 创建请求 content = handler_request(url, page) # print(content) # 创建对象 time.sleep(1) tree = etree.HTML(content) # print(tree) # 定位内容 location_element(tree)

    使用浏览器查看结果(第一页内容,页数可自己设定):

  • 相关阅读:
    linux中grep用法(“或”、“与”)
    mac 常用开发软件列表
    Devops实战(四)Rancher的部署与安装详解
    Devops实战(三)Kubenets与minikube的安装以及使用实战
    intel 无线网卡 AC8260 周期性跳ping(高延迟)解决方案
    确定了,回归吧,19,20就当换了换环境,该努力了。
    win10下用Linux搭建python&nodejs开发环境
    pict总结
    移动无线常用测试工具
    游戏测试工具
  • 原文地址:https://www.cnblogs.com/lxmtx/p/12912184.html
Copyright © 2011-2022 走看看