zoukankan html css js c++ java

从新浪财经获取金融新闻类数据并进行打分计算

随着人们获取信息的方式转变，越来越多的人选择通过互联网来进行信息的获取。

新浪财经作为国内较为权威的专业财经新闻网站，通过其发布的新闻资讯可以判断某公司在近期舆论中的情况。

第一步：网络爬虫

这里不再讲，和以往的相比增加了获取新闻内容。

第二步：设计关键字（词）

一些对于金融机构不好的词（初试，不精确）

keywords = ['违规','诉讼','违约','兑付','罚款','失信','欺骗','损失','暴雷','违法','遭','受']

第三步：设计计算规则

每出现一次关键字（词）进行减2

第四步：进行程序设计

# 新浪财经新闻银行类消息评分系统demo
# 导入必要的库函数
import requests
import pymysql
import re
import time
# 获取header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
# 进行新闻的获取
## company为具体的公司，page为选取的信息页数
def find_news(company, page):
    # 获取新浪财经新闻信息
    # trigger用来标记每个公司新闻的个数
    trigger = 0
    # 创造一个数组来存放对于每条新闻的打分结果
    scores = []
    # 负面关键词
    keywords = ['违规','诉讼','违约','兑付','罚款','失信','欺骗','损失','暴雷','违法','遭','受']
    for num in range(page):
        # 新浪财经的url
        url = 'https://search.sina.com.cn/?q=' + company + '&range=all&c=news&sort=time' + '&page=' + str(num + 1) + '&ie=utf-8'
        res = requests.get(url, headers=headers, timeout=10).text
        # 标题
        p_title = '<h2><a href=".*?" target="_blank">(.*?)</a>'
        # 网址
        p_href = '<h2><a href="(.*?)" target="_blank">'
        # 时间
        p_date = '<span class="fgray_time">(.*?)</span>'
        # 来源
        p_source = '<span class="fgray_time">(.*?)</span>'
        
        title = re.findall(p_title, res, re.S)
        href = re.findall(p_href, res, re.S)
        date = re.findall(p_date, res, re.S)
        source = re.findall(p_source, res, re.S)
        
        # 数据清洗
        for index1 in range(len(title)):
            trigger += 1
            # 对于标题进行对齐和去除图像等操作
            title[index1] = title[index1].strip()
            title[index1] = re.sub('<.*?>','',title[index1])
            # 对于时间和来源来说，其位置是一致的，但是分开表示了
            date[index1] = date[index1].split(' ')[1]
            source[index1] = source[index1].split(' ')[0]
            # 对齐
            date[index1] = date[index1].strip()
            source[index1] = source[index1].strip()
            
            # 统一时间格式
            date[index1] = re.sub('年', '-', date[index1])
            date[index1] = re.sub('月', '-', date[index1])
            date[index1] = re.sub('日', '-', date[index1])
            if ('小时' in date[index1]) or ('分钟' in date[index1]):
                date[index1] = time.strftime("%Y-%m-%d")
            else:
                date[index1] = date[index1]
            result = 0
            try:
                # 获取每个新闻的内容
                article = requests.get(href[index1], headers=headers, timeout=10).text
            except:
                article = 'Error!'
            try:
                article = article.encode('ISO-8859-1').decode('utf-8')
            except:
                try:
                    article = article.encode('ISO-8859-1').decode('gbk')
                except:
                    article = article
            p_article = '<p>(.*?)</p>'
            article_main = re.findall(p_article, article)
            # article_main里是某一页的新闻内容
            ## 接下来对于每一页的内容进行计算，应该是按照行来分的
            for index1_1 in range(len(article_main)):
                article_main[index1_1] = re.sub('<.*?>','',article_main[index1_1])
                article_main[index1_1] = article_main[index1_1].strip()
            article = ''.join(article_main)
            for k in keywords:
                if (k in article) or (k in title[index1]):
                    result -= 2
            scores.append(result)
            print(str(trigger) + '.' + title[index1] + '(' + date[index1] + ' ' + source[index1] + ')')
            print(href[index1])
            print(company + '该新闻的不利指数为' + str(scores[index1]))
companys = ['中国银行','工商银行', '建设银行', '农业银行', '交通银行', '邮储银行']
for i in companys:
    try:
        find_news(i,2)
        print(i + '新浪财经新闻获取成功')
    except:
        print(i + '新浪财经新闻获取失败')

结果：

可以追踪到

哈哈哈，但是不是很精准，需要使用正则来进行更精确的匹配还有更好的评分设计。

之后会将数据保存到数据库中，为之后的数据挖掘来做准备。

查看全文

相关阅读:
访问的站点提示输入用户名密码
 ASP.NET Session 过期问题
 c#和VB代码转化网址
 整理 css 小技巧
 asp.net当修改header时提示：The Controls collection cannot be modified because the control contains code blocks (i.e. <% ... %>)
sql server 2005 replication 设置
 动态加载CSS和Javascript文件 javascript 和asp.net.
一台机子上安装两个版本的mysql
sql删除所有表外键和表
 Griview中的删除按钮添加“确认提示”

原文地址：https://www.cnblogs.com/zhuozige/p/14525990.html