zoukankan      html  css  js  c++  java
  • 爬虫大作业

    本次作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

    可以用pandas读出之前保存的数据:

    newsdf = pd.read_csv(r'F:duymgzccnews.csv')

    一.把爬取的内容保存到数据库sqlite3

    import sqlite3
    with sqlite3.connect('gzccnewsdb.sqlite') as db:
    newsdf.to_sql('gzccnews',con = db)

    with sqlite3.connect('gzccnewsdb.sqlite') as db:
    df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)

    保存到MySQL数据库

    • import pandas as pd
    • import pymysql
    • from sqlalchemy import create_engine
    • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
    • engine = create_engine(conInfo,encoding='utf-8')
    • df = pd.DataFrame(allnews)
    • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)

    二.爬虫综合大作业

    1.主题        看了一部电影《绿皮书》,觉得挺不错,不知道网上的评价怎样,借此分析一下

                                                                                                                                                            图1-网页截图

    2.爬取的对象   爬取排在前面的300条评论

                                                                                                                                                  图2-爬取的内容

    3.爬取内容的限制与约束

              这次爬取的是公网上的内容。不会涉及到很多的隐私性,所以应该没什么约束。不敢爬太多,怕被发现。内容里很多图片,但都没爬下来。主要是想拿数据来对手机做一个分析。

    4。核心代码

      爬取评论的代码

    
    
    import pandas
    import requests
    from bs4 import BeautifulSoup
    import time
    import random
    import re


    def getHtml(url):
    cookies = {
    'PHPSESSID': 'Cookie: bid=7iHsqC-UoSo; ap_v=0,6.0; __utma=30149280.1922890802.1557152542.1557152542.1557152542.1;<br> __utmc=30149280; __utmz=30149280.1557152542.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=223695111.1923787146.1557152542.1557152542.1557152542.1;<br> __utmb=223695111.0.10.1557152542; __utmc=223695111; __utmz=223695111.1557152542.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);<br> _pk_ses.100001.4cf6=*; push_noty_num=0; push_doumail_num=0; __utmt=1; __utmv=30149280.19600; ct=y; ll="118281";<br> __utmb=30149280.12.9.1557154640902; __yadk_uid=6FEHGUf1WakFoINiOARNsLcmmbwf3fRJ; <br>_vwo_uuid_v2=DE694EB251BD96736CA7C8B8D85C2E9A7|9505affee4012ecfc57719004e3e5789;<br> _pk_id.100001.4cf6=1f5148bca7bc0b13.1557152543.1.1557155093.1557152543.; dbcl2="196009385:lRmza0u0iAA"'}
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
    req = requests.get(url, headers=headers, cookies=cookies, verify=False);
    req.encoding = 'utf8'
    soup = BeautifulSoup(req.text, "html.parser")
    return soup;


    def alist(url):
    comment = []
    for ping in soup.select('.comment-item'):
    pinglundict = {}
    user = ping.select('.comment-info')[0]('a')[0].text
    userUrl = ping.select('.comment-info')[0]('a')[0]['href']
    look = ping.select('.comment-info')[0]('span')[0].text
    score = ping.select('.comment-info')[0]('span')[1]['title']
    time = ping.select('.comment-time')[0]['title']
    pNum = ping.select('.votes')[0].text
    pingjia = ping.select('.short')[0].text
    pinglundict['user'] = user
    pinglundict['userUrl'] = userUrl
    pinglundict['look'] = look
    pinglundict['score'] = score
    pinglundict['time'] = time
    pinglundict['pNum'] = pNum
    pinglundict['pingjia'] = pingjia
    comment.append(pinglundict)
    return comment


    url = 'https://movie.douban.com/subject/27060077/comments?start={}&limit=20&sort=new_score&status=P'

    comment = []

    for i in range(15):
    soup = getHtml(url.format(i * 20))
    comment.extend(alist(soup))
    time.sleep(random.random() * 5)
    print(len(comment))
    print('--------------------------总共爬取 ', len(comment), ' 条-------------------------')
    print(comment)

    pingtheking = pandas.DataFrame(comment)
    pingtheking.to_csv('jia.csv', encoding='utf_8_sig')

    统计词频的代码

    # coding=utf-8
    
    
    # 导入jieba模块,用于中文分词
    import jieba
    
    # 获取所有评论
    import pandas as pd
    # 读取小说
    f = open(r'jia.csv', 'r', encoding='utf8')
    text = f.read()
    f.close()
    print(text)
    ch="《》
    :,,。、-!?0123456789"
    for c in ch:
        text = text.replace(c,'')
    print(text)
    newtext = jieba.lcut(text)
    te = {}
    for w in newtext:
        if len(w) == 1:
            continue
        else:
            te[w] = te.get(w, 0) + 1
    tesort = list(te.items())
    tesort.sort(key=lambda x: x[1], reverse=True)
    
    # 输出次数前TOP20的词语
    for i in range(0, 20):
        print(tesort[i])
    pd.DataFrame(tesort).to_csv('tongji.csv', encoding='utf-8')

    生成词云的代码

    import matplotlib.pyplot as plt
    from wordcloud import WordCloud
    import jieba
    
    text_from_file_with_apath = open('jia.csv',encoding='utf-8').read()
    
    wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)
    wl_space_split = " ".join(wordlist_after_jieba)
    
    my_wordcloud = WordCloud(background_color="white",width=1000,height=860, font_path="C:\Windows\Fonts\STFANGSO.ttf").generate(wl_space_split)
    
    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()

    4.数据统计与分析

            

              我们从中发现很多观众都提到了种族歧视问题,有奥斯卡才来看的,有觉得演得很棒很有细节,有人表示司机的身份很巧妙,白人却是高高的钢琴师的矛盾。总的来说我觉得这片很成功,观众感受到影片要表达的。普遍好评。

  • 相关阅读:
    标志寄存器和跳转指令
    js中top、clientTop、scrollTop、offsetTop的区别 文字详细说明版【转】
    关于mysql的级联删除(之前好多人咨询过我)
    用DIV画个漂亮的Table,根本看不出是div画的
    最简单的Ajax局部提交整体form,无刷新页面
    themeleaf中使用javascript时,字符“&&”的转义问题。
    Mysql 进行sequence的新建,同时建立计划每日重置。
    动态给H5页面绑定数据,基本万能无错误!
    手风琴效果简单实现,修改bootstrap内部事件接口并且自由定义。
    JQuery实现追加表格,不使用拼接html方式
  • 原文地址:https://www.cnblogs.com/hekairui/p/10775703.html
Copyright © 2011-2022 走看看