zoukankan      html  css  js  c++  java
  • 爬虫大作业

    一、目的 :

              爬取博客园博问上160页每页25条帖子标题

    二、python爬取数据

            博问主页:https://q.cnblogs.com/list/unsolved?page=1 

            第二页:https://q.cnblogs.com/list/unsolved?page=2     以此类推……

            可得160页bkyUrl地址

    for i in range(1,161):
        bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i)

         通过浏览器查看博问主页元素:

      观察可得在主体div类为.left_sidebar标签下有25个标签h2、h2标签内a标签文本即为各博问贴子标题

      因此可得getpagetitle函数获取每页25条博问贴子标题:

    def getpagetitle(bkyUrl):
        time.sleep(1)
        print(bkyUrl)
        res1 = requests.get(bkyUrl)
        res1.encoding = 'utf-8'
        soup1 = BeautifulSoup(res1.text, 'html.parser')
        item_list = soup1.select(".left_sidebar")[0]
        for i in item_list.select("h2"):
           title = i.select("a")[0].text

    将上述操作整合一起,获取160 * 25 条博文标题

    import requests
    import time
    from bs4 import BeautifulSoup
    def addtitle(title):
      f = open("test.txt","a",encoding='utf-8')
      f.write(title+" ")
      f.close()
    def getpagetitle(bkyUrl):
      time.sleep(1)
      print(bkyUrl)
      res1 = requests.get(bkyUrl)
      res1.encoding = 'utf-8'
      soup1 = BeautifulSoup(res1.text, 'html.parser')
      item_list = soup1.select(".left_sidebar")[0]
      for i in item_list.select("h2"):
        title = i.select("a")[0].text
        addtitle(title)
    for i in range(160,161):
      bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i)
      getpagetitle(bkyUrl)

    保存标题test.txt文本:

  • 相关阅读:
    Spring学习总结之高级装配
    Spring学习总结之装配bean
    NS2安装过程中环境变量设置的问题(ns-2.35)
    =======================分割线======================================
    java的内存管理机制
    Python之面向对象编程
    Python之列表生成式
    Python之函数的参数
    Git基础级介绍
    第四次作业——个人作业——软件案例分析
  • 原文地址:https://www.cnblogs.com/a565972733/p/12817747.html
Copyright © 2011-2022 走看看