zoukankan html css js c++ java

信息领域热词分类分析03

（1）项目名称：信息化领域热词分类分析及解释

（2）功能设计：

数据采集：要求从定期自动从网络中爬取信息领域的相关热

词；

数据清洗：对热词信息进行数据清洗，并采用自动分类技术

生成信息领域热词目录，；

热词解释：针对每个热词名词自动添加中文解释（参照百度

百科或维基百科）；

热词引用：并对近期引用热词的文章或新闻进行标记，生成

超链接目录，用户可以点击访问；

数据可视化展示：

① 用字符云或热词图进行可视化展示；

② 用关系图标识热词之间的紧密程度。6) 数据报告：可将所有热词目录和名词解释生成 WORD 版报告

形式导出。

今天爬取相关热词的词语解释，爬取地点是百度词条，爬取的解释可能会有一部分比较不准确。

将数据存入了excel表和数据库中。之前爬取到Excel表中，但在导入数据库时可能会有错误，因而，在这里同时将数据存入数据库。

import requests
import re
import xlwt
import linecache
import pymysql

url = 'https://baike.baidu.com/'
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
def get_page(url):
    try:
        response = requests.get(url,headers=headers)
        response.encoding = 'utf-8'
        if response.status_code == 200:
            print('获取网页成功')
            #print(response.encoding)
            return response.text
        else:
            print('获取网页失败')
    except Exception as e:
        print(e)
f = xlwt.Workbook(encoding='utf-8')
sheet01 = f.add_sheet(u'sheet1', cell_overwrite_ok=True)
sheet01.write(0, 0, '热词')  # 第一行第一列
sheet01.write(0, 1, '热词解释')  # 第一行第二列
sheet01.write(0, 2, '网址')  # 第一行第三列
fopen = open('final_hotword2.txt', 'r',encoding='utf-8')
lines = fopen.readlines()
urls = ['https://baike.baidu.com/item/{}'.format(line) for line in lines]
i=0
alllist=[]
value=()
for url in urls:
     print(url.replace("
", ""))
     page = get_page(url.replace("
", ""))
     items = re.findall('<meta name="description" content="(.*?)">',page,re.S)
     print(items)
     if len(items)>0:
            hot = str(linecache.getline("final_hotword2.txt", i + 1).strip())
            hotexplent = str(items[0])
            link = str(url.replace("
", ""))
            sheet01.write(i + 1, 0,hot)
            sheet01.write(i + 1, 1,hotexplent)
            sheet01.write(i + 1, 2,link)
            value=(hot,hotexplent,link)
            alllist.append(value)
            i+= 1
     print("总爬取完毕数量：" + str(i))
print("打印完！！！")
print(alllist)
tuplist=tuple(alllist)
print(tuplist)
#存到mysql
db = pymysql.connect(host="localhost",user="root",password="1229", database="lianxi", charset='utf8')
cursor = db.cursor()
sql_cvpr = "INSERT INTO website values(%s,%s,%s)"
try:
    cursor.executemany(sql_cvpr,tuplist)
    db.commit()
except:
      print('执行失败，进入回调3')
      db.rollback()
db.close()
f.save('hotword_explain.xls')

作者：哦心有

出处：https://www.cnblogs.com/haobox/

本文版权归作者和博客园共有，欢迎转载，但必须给出原文链接，并保留此段声明，否则保留追究法律责任的权利。

查看全文

相关阅读:
使用Hugo框架搭建博客的过程
 使用Hugo框架搭建博客的过程
 使用Hugo框架搭建博客的过程
 Windows软件包管理工具：Scoop
Centos8 安装ifconfig(net-tools.x86_64)
Centos8 重启网卡方法
 使用Visual Studio 2019--调试汇编32位代码的详细步骤
 linux 三剑客之awk总结
 linux 三剑客之sed常用总结
 mysql数据库的笔记

原文地址：https://www.cnblogs.com/haobox/p/15129841.html