zoukankan      html  css  js  c++  java
  • 爬虫综合大作业

    作业来自于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

    爬取网路游戏玩家人数排名

     网络游戏排名情况:


    代码:

    mport requests
    from lxml import etree
    session = requests.Session()
    for id in range(0,251,25):
        URL = 'http://movie.douban.com/top250/?start=' + str(id)
        req = session.get(URL)
        req.encoding = 'utf8'              
        root=etree.HTML(req.content)                
        items = root.xpath('//ol/li/div[@class="item"]')
        for item in items:
            rank,name,alias,rating_num,quote,url = "","","","","",""
            try:
                url = item.xpath('./div[@class="pic"]/a/@href')[0]
                rank = item.xpath('./div[@class="pic"]/em/text()')[0]
                title = item.xpath('./div[@class="info"]//a/span[@class="title"]/text()')
                name = title[0].encode('gb2312','ignore').decode('gb2312')
                alias = title[1].encode('gb2312','ignore').decode('gb2312') if len(title)==2 else ""
                rating_num = item.xpath('.//div[@class="bd"]//span[@class="rating_num"]/text()')[0]
                quote_tag = item.xpath('.//div[@class="bd"]//span[@class="inq"]')
                if len(quote_tag)  is not 0:
                    quote = quote_tag[0].text.encode('gb2312','ignore').decode('gb2312').replace('xa0','')
                print(rank,rating_num,quote)
                print(name.encode('gb2312','ignore').decode('gb2312') ,alias.encode('gb2312','ignore').decode('gb2312') .replace('/',','))
            except:
                print('faild!')
                pass

     

    获取每一个网络游戏

    def get_html(web_url):  
         header = {
             "User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"}
         html = requests.get(url=web_url, headers=header).text
         Soup = BeautifulSoup(html, "lxml")
         data = Soup.find("ol").find_all("li") 

         return data

    倒入数据库

    sql = "INSERT INTO test(rank, name, tpye,developers, company, " 
              "state, yeas, number) values(%s,%s,%s,%s,%s,%s,%s,%s)"
        try:
            cur.executemany(sql, movies_info)
            db.commit()
        except Exception as e:
            print("Error:", e)

     

    核心代码:

    importscrapy

    fromscrapy importSpider

    fromdoubanTop250.items importDoubantop250Item

    classDoubanSpider(scrapy.Spider):

    name = 'douban'

    allowed_domains = ['douban.com']

    start_urls = ['https://movie.douban.com/top250/']

    defparse(self, response):

    lis = response.css('.info')

    forli inlis:

    item = Doubantop250Item()



    name = li.css('.hd span::text').extract()

    title = ''.join(name)

    info = li.css('p::text').extract()[1].replace('n', '').strip()

    score = li.css('.rating_num::text').extract_first()

    people = li.css('.star span::text').extract()[1]

    words = li.css('.inq::text').extract_first()



    item['title'] = title

    item['info'] = info

    item[number] = number

    item['people'] = people

    item['words'] = words

    yielditem



    next = response.css('.next a::attr(href)').extract_first()

    ifnext:

    url = response.urljoin(next)

    yieldscrapy.Request(url=url, callback=self.parse)

    pass

    生成的items.py文件,是保存爬取数据的容器,代码修改如下。

    importscrapy

    classDoubantop250Item(scrapy.Item):


    title = scrapy.Field()

    info = scrapy.Field()

    score = scrapy.Field()

    people = scrapy.Field()

    words = scrapy.Field()

            db.rollback()

     excel表格

     

     

  • 相关阅读:
    Django(二)
    Django(一)
    MYSQL理论知识汇总
    默认参数
    深浅拷贝和赋值关系
    bootstrap常用知识
    jQuery常用功能代码
    java集合框架知识总结
    Mysql数据库SQL语句整理
    基于IO流的模拟下载文件的操作
  • 原文地址:https://www.cnblogs.com/Cclm/p/10834434.html
Copyright © 2011-2022 走看看