zoukankan      html  css  js  c++  java
  • 爬虫大作业

    一、目的 :

              爬取博客园博问上160页每页25条帖子标题

    二、python爬取数据

            博问主页:https://q.cnblogs.com/list/unsolved?page=1 

            第二页:https://q.cnblogs.com/list/unsolved?page=2     以此类推……

            可得160页bkyUrl地址

    for i in range(1,161):
        bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i)

         通过浏览器查看博问主页元素:

      观察可得在主体div类为.left_sidebar标签下有25个标签h2、h2标签内a标签文本即为各博问贴子标题

      因此可得getpagetitle函数获取每页25条博问贴子标题:

    def getpagetitle(bkyUrl):
        time.sleep(1)
        print(bkyUrl)
        res1 = requests.get(bkyUrl)
        res1.encoding = 'utf-8'
        soup1 = BeautifulSoup(res1.text, 'html.parser')
        item_list = soup1.select(".left_sidebar")[0]
        for i in item_list.select("h2"):
           title = i.select("a")[0].text

    将上述操作整合一起,获取160 * 25 条博文标题

    import requests
    import time
    from bs4 import BeautifulSoup
    def addtitle(title):
      f = open("test.txt","a",encoding='utf-8')
      f.write(title+" ")
      f.close()
    def getpagetitle(bkyUrl):
      time.sleep(1)
      print(bkyUrl)
      res1 = requests.get(bkyUrl)
      res1.encoding = 'utf-8'
      soup1 = BeautifulSoup(res1.text, 'html.parser')
      item_list = soup1.select(".left_sidebar")[0]
      for i in item_list.select("h2"):
        title = i.select("a")[0].text
        addtitle(title)
    for i in range(160,161):
      bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i)
      getpagetitle(bkyUrl)

    保存标题test.txt文本:

  • 相关阅读:
    sharepoint custom web service
    网站模板创建报错
    sharepoint 2013 持续爬网
    使用 SQL的 for xml path来进行字符串拼接 (group by)
    sharepoint 修改AD密码
    SharePoint 2013 BCS
    系统补丁对sharepoint很重要
    SharePoint 2010: Export User Profile Properties to a Text File or Excel using PowerShell
    SharePoint2013 错误
    sharepoint 开发相关工具总结
  • 原文地址:https://www.cnblogs.com/a565972733/p/12817747.html
Copyright © 2011-2022 走看看