zoukankan      html  css  js  c++  java
  • 爬虫综合大作业

    作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002
    from bs4 import BeautifulSoup
    import requests as re
    url='http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res=requests.get(url)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    def getclick(newurl):##一般需要通过使用compile的方法把正则表达式转化为pattern(编译时的字符串表达式)
        id=re.search('_(.*).html',newurl).group(1).split('/')[1]##取数组第二个元素,以/分开字符串,默认为-1,全分,
        clickurl='http://oa.gzcc.cn/api.php?op=count&id={}&modelid+80'.format(id)##format代替%d代表参数的方式,把参数填到{}内
        click=int(request.get(clickurl).text.split(".")[-1].lstrip("html('").rstrip("');"))##删除)末尾的字符串,删除html(前面的字符串,选取第一个数组,以.分开字符串##向服务器发出请求,构造对象,内容为服务器返回的值
        return click;
    def getonpages(listurl):
        res=request.get(listurl)
        res.encoding='utf-8'
        soup=BeautifulSoup(res.text,'html.parser')

        for news in soup.select('li'):
            if len(news.select('.news-list-title'))>0:
                title=news.select('.new-list-title')[0].text
                time=news.select('.news-list-info')[0].content[0].text
                url1=news.select('a')[0]['href']
                bumen=news.select('.news-list-info')[0].content[1].text
                description=news.select('.news-list-description')[0].text

                read=requests.get(url1)
                read.encoding='utf-8'
                soupd=BeautifulSoup(read.text,'html.parser')
                detail=soupd.select('.show-content')[0].text
                click=getclick(url1)
                print(title,click)

                count=int(soup.select('.a1')[0].text.rstrip)("条")##rstript删除末尾指定的字符串,数组获取为第一组,以选择方式得到类名为a1内容
                pages=count//10+1
                for i in range(2,4):
                    pagesurl="http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html".format(i)
                    getonpages(pagesurl)

    参考网站:https://www.jb51.net/article/141830.htm
    https://blog.csdn.net/k_koris/article/details/82950654
    https://www.cnblogs.com/tina-python/p/5508402.html
    https://www.cnblogs.com/keye/p/7868059.html
    https://www.cnblogs.com/benric/p/4965224.html
    http://www.runoob.com/python/att-string-split.html

  • 相关阅读:
    取得窗口大小和窗口位置兼容所有浏览器的js代码
    一个简单易用的导出Excel类
    如何快速启动chrome插件
    网页表单设计案例
    Ubuntu下的打包解包
    The source file is different from when the module was built. Would you like the debugger to use it anyway?
    FFisher分布
    kalman filter
    Group delay Matlab simulate
    24位位图格式解析
  • 原文地址:https://www.cnblogs.com/ChiuMingKit/p/10840766.html
Copyright © 2011-2022 走看看