zoukankan      html  css  js  c++  java
  • 爬虫案例:博客文章列表

    博客实例:

    爬取博客园文章列表,假设页面的URL是https://www.cnblogs.com/loaderman

    要求:

    1. 使用requests获取页面信息,用XPath / re 做数据提取

    2. 获取每个博客里的标题,描述,链接地址,日期等

    3. 保存到 json 文件内

    代码

    # -*- coding:utf-8 -*-
    
    import urllib2
    import json
    from lxml import etree
    
    url = "https://www.cnblogs.com/loaderman/"
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
    
    request = urllib2.Request(url, headers=headers)
    
    html = urllib2.urlopen(request).read()
    # 响应返回的是字符串,解析为HTML DOM模式 text = etree.HTML(html)
    
    text = etree.HTML(html)
    # 返回所有结点位置,contains()模糊查询方法,第一个参数是要匹配的标签,第二个参数是标签名部分内容
    node_list = text.xpath('//div[contains(@class, "post")]')
    print (node_list)
    items = {}
    for each in node_list:
        print (each)
        title = each.xpath(".//h2/a[@class='postTitle2']/text()")[0]
        detailUrl = each.xpath(".//a[@class='postTitle2']/@href")[0]
        content = each.xpath(".//div[@class='c_b_p_desc']/text()")[0]
        date = each.xpath(".//p[@class='postfoot']/text()")[0]
    
        items = {
            "title": title,
            "image": detailUrl,
            "content": content,
            "date": date,
    
        }
    
        with open("loaderman.json", "a") as f:
            f.write(json.dumps(items, ensure_ascii=False).encode("utf-8") + "
    ")

    效果:

  • 相关阅读:
    Nginx模块fastcgi_cache的几个注意点
    Nginx的fastcgi_cache
    Nginx Location 语法,与简单配置[转]
    nginx location 匹配顺序
    Amoeba基本配置
    LVS的DR模式配置
    Keepalived安装及初步使用
    HAProxy安装及初步使用
    lvs nginx HAProxy优缺点
    Redis安装及初步使用
  • 原文地址:https://www.cnblogs.com/loaderman/p/11759854.html
Copyright © 2011-2022 走看看