zoukankan html css js c++ java

Hadoop综合大作业

1.用Hive对爬虫大作业产生的文本文件（或者英文词频统计下载的英文长篇小说）词频统计。

f = open('note.txt', 'r')
song = f.read()
f.close()

def writeFilenote(contnet):

    f = open('newnote.txt', 'a', encoding='utf-8')
    f.write(contnet)
    f.close()

symbol = ''',.？！’;?!:"“”-%$'''

exclude = '''
a an the in on to at and of is was are were i he she you your they us their our it or for be too do no 
that s so as but it's
'''

for i in symbol:
    song = song.replace(i, ' ')
writeFilenote(song)
print(song)

先用python将文本当中的不合法词汇剔除，然后另存为newnote.txt

然后hive一系列猛操作，出现结果如下图。（过程不贴了，毕竟跟上次差不多）

2.用Hive对爬虫大作业产生的csv文件进行数据分析，写一篇博客描述你的分析过程和分析结果。

def getNewsDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    NewsDict={}


    NewsDict['source']=soupd.select('.comeFrom')[0].select('a')[0].text
    NewsDict['title']=soupd.select('.headline')[0].text
    NewsDict['time']=soupd.select('#pubtime_baidu')[0].text
    #NewsDict['content'] = soupd.select('.artical-main-content')[0].text

    return NewsDict

def Get_page(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    pagelist=[]
    soup = BeautifulSoup(res.text, 'html.parser')
    # print(soup.select('.tag-list-box')[0].select('.list'))
    for new in soup.select('.tag-list-box')[0].select('.list'):
        #print(new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href'])
        url =new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href']
        pagedict=getNewsDetail(url)
        pagelist.append(pagedict)

    return pagelist
        #break
        # break

        # print(url)




url = 'https://voice.hupu.com/nba/tag/3023-1.html'
resd = requests.get(url)
resd.encoding = 'utf-8'
soup1 = BeautifulSoup(resd.text, 'html.parser')
total=[]
# listCount = int(soup.select('.a1')[0].text.rstrip('条'))//10+1
pagelist=Get_page(url)
total.extend(pagelist)

for i in range(2, 25):
    total.extend(Get_page('https://voice.hupu.com/nba/tag/3023-{}.html'.format(i)))
    pan = pandas.DataFrame(total)
    pan.to_csv('result3.csv')

---恢复内容结束---

1.用Hive对爬虫大作业产生的文本文件（或者英文词频统计下载的英文长篇小说）词频统计。

f = open('note.txt', 'r')
song = f.read()
f.close()

def writeFilenote(contnet):

    f = open('newnote.txt', 'a', encoding='utf-8')
    f.write(contnet)
    f.close()

symbol = ''',.？！’;?!:"“”-%$'''

exclude = '''
a an the in on to at and of is was are were i he she you your they us their our it or for be too do no 
that s so as but it's
'''

for i in symbol:
    song = song.replace(i, ' ')
writeFilenote(song)
print(song)

先用python将文本当中的不合法词汇剔除，然后另存为newnote.txt

然后hive一系列猛操作，出现结果如下图。（过程不贴了，毕竟跟上次差不多）

2.用Hive对爬虫大作业产生的csv文件进行数据分析，写一篇博客描述你的分析过程和分析结果。

def getNewsDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    NewsDict={}


    NewsDict['source']=soupd.select('.comeFrom')[0].select('a')[0].text
    NewsDict['title']=soupd.select('.headline')[0].text
    NewsDict['time']=soupd.select('#pubtime_baidu')[0].text
    #NewsDict['content'] = soupd.select('.artical-main-content')[0].text

    return NewsDict

def Get_page(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    pagelist=[]
    soup = BeautifulSoup(res.text, 'html.parser')
    # print(soup.select('.tag-list-box')[0].select('.list'))
    for new in soup.select('.tag-list-box')[0].select('.list'):
        #print(new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href'])
        url =new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href']
        pagedict=getNewsDetail(url)
        pagelist.append(pagedict)

    return pagelist
        #break
        # break

        # print(url)




url = 'https://voice.hupu.com/nba/tag/3023-1.html'
resd = requests.get(url)
resd.encoding = 'utf-8'
soup1 = BeautifulSoup(resd.text, 'html.parser')
total=[]
# listCount = int(soup.select('.a1')[0].text.rstrip('条'))//10+1
pagelist=Get_page(url)
total.extend(pagelist)

for i in range(2, 25):
    total.extend(Get_page('https://voice.hupu.com/nba/tag/3023-{}.html'.format(i)))
    pan = pandas.DataFrame(total)
    pan.to_csv('result3.csv')

因为title太长的原因影响到后面的排版，所以我就删掉了，然后就是这个表导出来后id第一列显示是空的，与我后面的操作相违背，所以我直接删掉。

查了以下百度，好像是用了index=false 夹在to_csv（csv,index=false）这样。出来的表就是我想要的

然后就是基本操作，插入excel，然后删掉第一行这些拉。写下我的，水平不行跟老师有点像。这个sh文件如以下

刚开始没注意到时间要改，随意就变成了格式不正确，修改之后数据才正确

小小一步废了我好长时间。

然后发现这个时间，不对于是我就去改了下pre_deal.sh的范围，结果如图。

然后继续hive操作日常猛如虎系列：

然后看到这玩意，我呆滞了，我的时间出来md是null

积极的我时间定义改为string,然后正常输出

然后进行数据分析，本来想把5月10日至5月14日的新闻发布次数提取出来的，竟然结果为0，好的这个操作看来只能拿针对date，而不能用来String，碍于重新开始用爬虫爬其他网站，于是就改成计算表里面一共有多少条数据把。

总结：虽然结果很简单，但是过程很曲折，想要后面不踩坑，前面要踏踏实实一步一步走，不然像我上面这些步骤做了好几次，气得我要死，下次选择爬网站的时候要看看自己扒下来的数据类型是什么，这东西影响到最后的数据分析，我反正我是在这载了，下次我会好好选择爬网站的对象了。

查看全文

相关阅读:
您知道SASS吗？
打破技术壁垒，用SpreadJS 抢占“表格文档协同编辑系统”的入市先机
 7种你应该知道的JavaScript常见的错误
 前端开发：这10个Chrome扩展你不得不知
 疫情下，买菜难，其实卖菜的也是这么想的
 疫情之下远程办公，开启企业办公的全新时代！
“泛在电力物联网”究竟是什么？
2020 春节集五福最详细收集攻略
 怎样使我们的用户不再抵触填写Form表单？
新事业，新征程：换屏哥，您身边的手机维修专家

原文地址：https://www.cnblogs.com/wxf2/p/9048053.html