zoukankan      html  css  js  c++  java
  • 爬虫综合大作业

    作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002
    from bs4 import BeautifulSoup
    import requests as re
    url='http://news.gzcc.cn/html/xiaoyuanxinwen/'
    res=requests.get(url)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    def getclick(newurl):##一般需要通过使用compile的方法把正则表达式转化为pattern(编译时的字符串表达式)
        id=re.search('_(.*).html',newurl).group(1).split('/')[1]##取数组第二个元素,以/分开字符串,默认为-1,全分,
        clickurl='http://oa.gzcc.cn/api.php?op=count&id={}&modelid+80'.format(id)##format代替%d代表参数的方式,把参数填到{}内
        click=int(request.get(clickurl).text.split(".")[-1].lstrip("html('").rstrip("');"))##删除)末尾的字符串,删除html(前面的字符串,选取第一个数组,以.分开字符串##向服务器发出请求,构造对象,内容为服务器返回的值
        return click;
    def getonpages(listurl):
        res=request.get(listurl)
        res.encoding='utf-8'
        soup=BeautifulSoup(res.text,'html.parser')

        for news in soup.select('li'):
            if len(news.select('.news-list-title'))>0:
                title=news.select('.new-list-title')[0].text
                time=news.select('.news-list-info')[0].content[0].text
                url1=news.select('a')[0]['href']
                bumen=news.select('.news-list-info')[0].content[1].text
                description=news.select('.news-list-description')[0].text

                read=requests.get(url1)
                read.encoding='utf-8'
                soupd=BeautifulSoup(read.text,'html.parser')
                detail=soupd.select('.show-content')[0].text
                click=getclick(url1)
                print(title,click)

                count=int(soup.select('.a1')[0].text.rstrip)("条")##rstript删除末尾指定的字符串,数组获取为第一组,以选择方式得到类名为a1内容
                pages=count//10+1
                for i in range(2,4):
                    pagesurl="http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html".format(i)
                    getonpages(pagesurl)

    参考网站:https://www.jb51.net/article/141830.htm
    https://blog.csdn.net/k_koris/article/details/82950654
    https://www.cnblogs.com/tina-python/p/5508402.html
    https://www.cnblogs.com/keye/p/7868059.html
    https://www.cnblogs.com/benric/p/4965224.html
    http://www.runoob.com/python/att-string-split.html

  • 相关阅读:
    BOI 2002 双调路径
    BOI'98 DAY 2 TASK 1 CONFERENCE CALL Dijkstra/Dijkstra+priority_queue/SPFA
    USACO 2013 November Contest, Silver Problem 2. Crowded Cows 单调队列
    BOI 2003 Problem. Spaceship
    USACO 2006 November Contest Problem. Road Blocks SPFA
    CEOI 2004 Trial session Problem. Journey DFS
    USACO 2015 January Contest, Silver Problem 2. Cow Routing Dijkstra
    LG P1233 木棍加工 动态规划,Dilworth
    LG P1020 导弹拦截 Dilworth
    USACO 2007 February Contest, Silver Problem 3. Silver Cow Party SPFA
  • 原文地址:https://www.cnblogs.com/ChiuMingKit/p/10840766.html
Copyright © 2011-2022 走看看