zoukankan      html  css  js  c++  java
  • Python抓取博客园首页文章列表(带分页)

    1、使用工具:
    Python3.5
    BeautifulSoup
    2、抓取网站:
    博客园首页文章列表http://www.cnblogs.com
    3、分析网站文章结构:
    这里写图片描述
    4、实现代码:

    __author__ = 'Administrator'
    import urllib.request
    import re
    from bs4 import BeautifulSoup
    import time
    
    
    ########################################################
    #
    #              抓取博客园首页推荐文章列表http://www.cnblogs.com
    #
    #             鹿伟伟
    #
    ########################################################
    class CnblogsUtils(object):
        def __init__(self):
            user_agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
            self.headers ={'Cache-Control':'max-age=0',
                            'Connection':'keep-alive',
                            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                            'User-Agent':user_agent,
                            }
        def getPage(self,url=None):
            request=urllib.request.Request(url,headers=self.headers)
            response=urllib.request.urlopen(request)
            soup=BeautifulSoup(response.read(),"html.parser")
            #print(soup.prettify())
            return soup
        def parsePage(self,url=None,pageNo=None):
            soup=self.getPage(url+pageNo)
            itemBlog=soup.find_all("div",{"class":"post_item"})
            #print(itemBlog.__len__())
            #print(itemBlog[10])
            blog=CnblogsUtils()
            print("+++++++++++++++++++++++++++第",pageNo,"页++++++++++++++++++++++++++++")
            for i,blogInfo in enumerate(itemBlog):
                #print(blogInfo)
                blog.num=i
                blog.url=blogInfo.find("h3").find("a").get("href")
                blog.title=blogInfo.find("h3").find("a").string
                #print("++++++++++",blogInfo.find("div","post_item_foot").contents[2])
                #去掉空格strip()
                blog.time=blogInfo.find("div","post_item_foot").contents[2].strip()
                blog.author=blogInfo.find("div","post_item_foot").find("a").string
                print(blog.num+1,"标题:",blog.title,"作者:",blog.author,"详情:",blog.url,blog.time)
    
    #######     执行    ########
    if __name__ =="__main__":
        #要抓取的网页地址'http://blog.csdn.net/?&page={}'.format(i+1),i+1)
        url = "http://www.cnblogs.com/sitehome/p/"
        cnblog=CnblogsUtils()
        #cnblog.parsePage(url)
        for i in range(0,10):
            cnblog.parsePage(url,str(i+1))
            time.sleep(3)
    
    

    5、执行结果:
    这里写图片描述

  • 相关阅读:
    【linux】Centos下登陆mysql报错#1045
    tomcat在centos7里面启动很慢的解决办法
    tomcat日志文件 转载https://www.cnblogs.com/operationhome/p/9680040.html
    tomcat的文件目录结构
    centos 7服务器下tomcat 问题 1.配置问题
    x11转发遇到的问题
    x11转发,可以在shell里面看到图形界面
    linux里面tomcat配置遇到的问题
    vim中文乱码 vim字符集设置
    c#.net常见字符串处理方法
  • 原文地址:https://www.cnblogs.com/luweiwei/p/5968458.html
Copyright © 2011-2022 走看看