zoukankan      html  css  js  c++  java
  • TF-IDF 实践

    打算分以下几个部分进行

    1. 用python写一个爬虫爬取网易新闻

    2. 用分词工具对爬下来的文字进行处理, 形成语料库

    3. 根据TF-IDF, 自动找出新闻的关键词

    4. 根据TF-IDF, 实现相似新闻推荐

    step 1a

    今天一天都在弄python爬虫, 花了好大力气才写出一个勉强可用的版本

     1 # -*- coding: utf-8 -*
     2 
     3 import re, urllib, sys
     4 import pyodbc
     5 
     6 newsLink = set()##获取的所有新闻
     7 processLink = set()##正在处理的新闻
     8 newLink = set()##新读取的新闻
     9 viewedLink = set()##已经读取过的新闻
    10 
    11 ##打开输入的链接, 用正则表达式找出新页面中其他的链接, 并添加到全局set中
    12 def getNewsLink(link):
    13     ##print link
    14     if(link in viewedLink):
    15         return
    16     viewedLink.add(link)
    17     content = ""
    18     try:##这一步可能会抛出异常
    19         content = urllib.urlopen(link).read().decode('gbk').encode('utf-8')
    20     except:
    21         info=sys.exc_info()
    22         print info[0],":",info[1]
    23         print "caused by link : ",  link
    24     m = re.findall(r"news.163.com/d{2}/d{4}/d{2}/w+.html",content,re.M)##网易新闻链接格式为http://news.163.com/14/0621/12/9V8V9AL60001124J.html
    25     for i in m:
    26         url = "http://" + i
    27         newLink.add(url)
    28         newsLink.add(url)
    29     print "crawled %d page, get %d link"%(len(viewedLink),  len(newsLink))
    30     
    31 ##将读取到的新闻ID存入数据库中
    32 def saveNewsIDtoDB():
    33     newsID = dict()
    34     for link in newsLink:
    35         ID = link[31:47]
    36         newsID[ID] = link##截取其中新闻ID
    37     conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456')
    38     cursor = conn.cursor()
    39     for (ID, url) in newsID.items():
    40         sql = "INSERT INTO News(NewsID, Url) VALUES ('%s','%s')"%(ID, url)
    41         try:
    42             cursor.execute(sql)
    43         except:
    44             info=sys.exc_info()
    45             print info[0],":",info[1]
    46             print "caused by sql : ",  sql
    47     conn.commit()
    48     conn.close()
    49     print "total get %d news ID"%(len(newsID))
    50 
    51 ##读取指定数量的新闻
    52 def readNews(count):
    53     processLink = set()
    54     processLink.add("http://news.163.com/")
    55     while(len(newsLink) < count):
    56         for link in processLink:
    57             getNewsLink(link)
    58         processLink = newLink.copy()
    59         newLink.clear()
    60 
    61 readNews(10000)
    62 saveNewsIDtoDB()
    View Code

    实现了自动抓取指定数量的新闻并将其ID存入数据库

    网易新闻没有公开其API, 但是新闻链接的格式都是固定的

    如同http://news.163.com/14/0621/12/9V8V9AL60001124J.html, 14代表年份, 0621代表日期, 12不知道什么意思, 但是一定是两位数字, 后面的16位字符串就是新闻ID

    跑了几十分钟, 抓了10360个新闻链接

    step 1b

    用BeautifulSoup解析链接, 得到新闻的标题, 正文, 和发布时间

    跑了接近一个小时吧, 得到9714条新闻记录,  中间折损了接近一千条,  有的是新闻已经被删除了, 也有的是因为新闻正文格式不对, 抓了一堆JS代码进来, 存到数据库的时候就报错了

    不过已经够了

    解析代码如下

     1 # encoding: utf-8
     2 import re, urllib, sys
     3 import pyodbc, json
     4 import socket
     5 from bs4 import BeautifulSoup
     6 socket.setdefaulttimeout(10.0) 
     7 
     8 def readNews():
     9     conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456')
    10     cursor = conn.cursor()
    11     sql = "SELECT * FROM News"
    12     cursor.execute(sql)
    13     rows = cursor.fetchall()
    14     
    15     updateCount = 0;
    16     
    17     for row in rows:#从数据库中读取链接
    18         print row.NewsID, row.Url
    19         content = ""
    20         ptime = ""
    21         title = ""
    22         body = ""
    23         newsID = row.NewsID.strip()
    24         try:##这一步可能会抛出异常
    25             content = urllib.urlopen(row.Url).read()#读取网页内容
    26             ptime = "20" + row.Url[20:22] + "-" + row.Url[23:25] + "-" + row.Url[25:27]#新闻发布日期
    27             title, body = analyzeNews(content)#解析网页内容, 获取新闻标题与正文
    28         except:
    29             info=sys.exc_info()
    30             print info[0],":",info[1]
    31             print "caused by link : ",  row.Url
    32             continue
    33         
    34         sql = "UPDATE News SET Title = '%s', Body = '%s',ptime = '%s' WHERE NewsID = '%s'"%(title,  body,  ptime,  newsID)#生成sql语句
    35         try:##这一步可能会抛出异常
    36             cursor.execute(sql)
    37         except:
    38             info=sys.exc_info()
    39             print info[0],":",info[1]
    40             print "caused by sql : ",  sql
    41             continue
    42         updateCount += 1
    43         if(updateCount % 100 == 0):
    44             conn.commit()
    45             print "已经更新了%s条数据!"%(updateCount)
    46     conn.commit()
    47     conn.close()
    48     print "数据处理完毕, 一共更新了%s条数据!"%(updateCount)
    49     
    50 def analyzeNews(content):
    51     soup = BeautifulSoup(content, from_encoding="gb18030")
    52     title = soup.title.get_text()[:-7]
    53     bodyHtml = soup.find(id = "endtext")
    54     if(bodyHtml == None):
    55         bodyHtml = soup.find(id = "text")
    56     if(bodyHtml == None):
    57         bodyHtml = soup.find(id = "endText")
    58     body = bodyHtml.get_text()
    59     body = re.sub("
    +", "
    ", body)#去除连续的换行符
    60     print title
    61     return title, body
    62 
    63 readNews()
    View Code

    step 2

    用结巴分词对新闻做分词并存入数据库中, 标题的权重设为正文的五倍

    没想到数据库的效率这么高, 每秒钟居然能执行近万条插入语句

    代码如下

     1 # -*- coding: utf-8 -*
     2 
     3 import re, urllib, sys
     4 import pyodbc
     5 import jieba
     6 
     7 stop = [line.strip().decode('utf-8') for line in open('chinese_stopword.txt').readlines() ]
     8 
     9 def readNewsContent():
    10     conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456')
    11     cursor = conn.cursor()
    12     sql = "SELECT * FROM News"
    13     cursor.execute(sql)
    14     rows = cursor.fetchall()
    15     
    16     word_dict = dict()#所有词的频数
    17     
    18     insert_count = 0;
    19     for row in rows:#从数据库中读取新闻
    20         content = row.Body
    21         title = row.Title
    22         newsID = row.NewsID.strip()
    23         seg_dict = sliceNews(title, content)#切词
    24         
    25         newsWordCount = 0
    26         for(word, count) in seg_dict.items():
    27             newsWordCount += count
    28             sql = "INSERT INTO ContentWord(Word, Count, NewsID) VALUES ('%s',%d, '%s')"%(word, count, newsID)#将每篇新闻的词频存入数据库中
    29             cursor.execute(sql)
    30             insert_count += 1
    31             if(insert_count % 10000 == 0):
    32                 print "插入%d条新闻词频记录!"%(insert_count)
    33             if(word in word_dict):#维护word_dict
    34                 word_dict[word] += 1
    35             else:
    36                 word_dict[word] = 1
    37         sql = "UPDATE News SET WordCount = '%d' WHERE NewsID = '%s'"%(newsWordCount,  newsID)
    38         cursor.execute(sql)
    39         conn.commit()
    40     print "一共插入%d条新闻词频记录!"%(insert_count)
    41     
    42     #将word_dict存入数据库中
    43     for(word, count) in word_dict.items():
    44         sql = "INSERT INTO TotalWord(Word, Count) VALUES ('%s',%d)"%(word, count)
    45         cursor.execute(sql)
    46     print "插入%d条总词频记录!"%(len(word_dict.items()))
    47     conn.commit()
    48     conn.close()
    49 
    50 #对输入文字切词,  并返回去除停用词后的词频
    51 def sliceNews(title, content):
    52     title_segs = list(jieba.cut(title))
    53     segs = list(jieba.cut(content))
    54     for i in range(5):#标题权重算正文权重的五倍
    55         segs += title_segs
    56     
    57     seg_set = set(segs)
    58     seg_dict = dict()
    59     for seg in seg_set:#去除停用词, 并得到这篇新闻里的词频
    60         if(seg not in stop and re.match(ur"[u4e00-u9fa5]+", seg)):#只匹配中文
    61             seg_dict[seg] = segs.count(seg)
    62         
    63     return seg_dict
    64     
    65 readNewsContent()
    View Code

    几分钟就跑完了, 一共插入1475330条新闻词频记录和135961条总词频记录

    step 3

    然后对分词结果做计算, 求其TF-IDF值, 得到每篇新闻的TF-IDF值最高的头20个词语, 作为关键词, 并保存到数据库中

    代码如下

     1 # -*- coding: utf-8 -*
     2 
     3 import re, urllib, sys
     4 import pyodbc
     5 import math
     6 
     7 
     8 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456')
     9 cursor = conn.cursor()
    10 newsCount = 0;
    11 totalWordDict = dict()
    12 
    13 def init():
    14     #读取所有新闻数
    15     sql = "SELECT COUNT(*) FROM News"
    16     cursor.execute(sql)
    17     row = cursor.fetchone()
    18     global newsCount 
    19     newsCount = int(row[0])
    20     #读取总词频并构造字典
    21     sql = "SELECT * FROM TotalWord"
    22     cursor.execute(sql)
    23     rows = cursor.fetchall()
    24     for row in rows:
    25         totalWordDict[row.Word.strip()] = int(row.Count)
    26         
    27 def clean():
    28     conn.commit()
    29     conn.close()
    30 
    31 #计算所有新闻的关键词的tf-idf值
    32 def cacluTFIDF():
    33     sql = "SELECT * FROM NEWS"#遍历新闻
    34     cursor.execute(sql)
    35     rows = cursor.fetchall()
    36     insertCount = 0
    37     for row in rows:#对每一条新闻计算其关键词的TFIDF值
    38         newsID = row.NewsID.strip()
    39         keyWordList = calcuKeyWords(newsID)
    40         for keyWord in keyWordList:#将计算出的TFIDF值存入数据库中
    41             word = keyWord[0]
    42             value = keyWord[1]
    43             sql = "INSERT INTO TFIDF(Word, Value, NewsID) VALUES ('%s',%f, '%s')"%(word, value,  newsID)
    44             cursor.execute(sql)
    45             insertCount += 1
    46             if(insertCount % 10000 == 0):
    47                 print "插入%d条TFIDF记录!"%(insertCount)
    48         conn.commit()
    49     print "一共插入%d条TFIDF记录!"%(insertCount)
    50     
    51 #计算指定新闻的关键词
    52 def calcuKeyWords(newsID):
    53     newsID = newsID.strip()
    54     sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID)
    55     cursor.execute(sql)
    56     newsWordCount = cursor.fetchone().WordCount#新闻的总词数
    57     
    58     sql = "SELECT * FROM ContentWord WHERE NewsID = '%s'"%(newsID)
    59     cursor.execute(sql)
    60     rows = cursor.fetchall()
    61     tfidf_dict = dict()
    62     global newsCount
    63     #构建这篇新闻的tf-idf字典
    64     for row in rows:
    65         word = row.Word.strip()
    66         count = row.Count
    67         tf = float(count) / newsWordCount
    68         idf = math.log(float(newsCount) / (totalWordDict[word] + 1))
    69         tfidf = tf * idf
    70         tfidf_dict[word] = tfidf
    71     #取前20个关键词
    72     keyWordList = sorted(tfidf_dict.items(), key=lambda d: d[1])[-20:]
    73     return keyWordList
    74 
    75 
    76 init()
    77 cacluTFIDF()
    78 clean()
    View Code

    比方说对于 重庆东胜煤矿5名遇难者遗体全部找到 这条新闻

    程序计算出来的关键词, 按权重从低到高排列分别为:

    窜年产采空区工人冒落东翼煤约矸重庆市南川名顶板工作面采煤找到遇难者重庆遗体煤矿东胜

    step 4

    然后就可以根据关键词来做自动推荐了

    具体操作如下(引用自阮一峰的博客)

          (1)使用TF-IDF算法,找出两篇文章的关键词;
      (2)每篇文章各取出若干个关键词(比如20个),合并成一个集合,计算每篇文章对于这个集合中的词的词频(为了避免文章长度的差异,可以使用相对词频);
      (3)生成两篇文章各自的词频向量;
      (4)计算两个向量的余弦相似度,值越大就表示越相似。

    代码如下

      1 # -*- coding: utf-8 -*
      2 
      3 import re, urllib, sys
      4 import pyodbc
      5 import math
      6 
      7 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456')
      8 cursor = conn.cursor()
      9 
     10 def clean():
     11     conn.commit()
     12     conn.close()
     13 
     14 #计算两条新闻的相似度,  返回结果为这两条新闻的关键词之间的余弦距离
     15 def similar(newsID1, newsID2):
     16     newsID1 = newsID1.strip()
     17     newsID2 = newsID2.strip()
     18     #取得待对比的两个新闻的关键词集合
     19     sql = "SELECT * FROM TFIDF WHERE NewsID = '%s' OR NewsID = '%s'"%(newsID1, newsID2)
     20     cursor.execute(sql)
     21     rows = cursor.fetchall()
     22     wordSet = set()
     23     for row in rows:
     24         wordSet.add(row.Word)
     25     #计算两条新闻中关键词的各自出现次数, 用向量表示
     26     vector1 = []
     27     vector2 = []
     28     for word in wordSet:
     29         sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID1, word)
     30         cursor.execute(sql)
     31         rows = cursor.fetchall()
     32         if len(rows) == 0:
     33             vector1.append(0)
     34         else:
     35             vector1.append(int(rows[0].Count))
     36         sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID2, word)
     37         cursor.execute(sql)
     38         rows = cursor.fetchall()
     39         if len(rows) == 0:
     40             vector2.append(0)
     41         else:
     42             vector2.append(int(rows[0].Count))
     43     return calcuCosDistance(vector1, vector2)
     44 
     45 #计算两个输入向量之间的余弦距离
     46 def calcuCosDistance(a, b):
     47     if len(a) != len(b):
     48         return None
     49     part_up = 0.0
     50     a_sq = 0.0
     51     b_sq = 0.0
     52     for a1, b1 in zip(a,b):
     53         part_up += a1*b1
     54         a_sq += a1**2
     55         b_sq += b1**2
     56     part_down = math.sqrt(a_sq*b_sq)
     57     if part_down == 0.0:
     58         return None
     59     else:
     60         return part_up / part_down
     61     
     62 #输入一个新闻ID, 输出与其最相似的头几条新闻
     63 def recommand(newsID):
     64     limit = 5
     65     result = dict()
     66     sql = "SELECT * FROM NEWS"#遍历新闻
     67     cursor.execute(sql)
     68     rows = cursor.fetchall()
     69     
     70     newsID = newsID.strip()
     71     calcuCount = 0
     72     for row in rows:
     73         calcuCount += 1
     74         if calcuCount % 200 == 0:
     75             print "已经计算了%d对新闻的相似度"%(calcuCount)
     76         if row.NewsID.strip() != newsID:#去掉本身
     77             distance = similar(newsID, row.NewsID)#计算两个新闻的相似度
     78             if len(result) < limit:
     79                 result[distance] = row.NewsID
     80             else:
     81                 minDis = min(result.keys())
     82                 if(minDis < distance):
     83                     del result[minDis]
     84                     result[distance] = row.NewsID
     85     
     86     print "输入的新闻编号为%s"%(newsID)
     87     sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID)
     88     cursor.execute(sql)
     89     row = cursor.fetchone()
     90     print "输入的新闻链接为:   %s"%(row.Url.encode('utf-8'))
     91     print "输入的新闻标题为:   %s"%(row.Title.decode('gb2312').encode('utf-8'))
     92     print "--------------------------------------"
     93     for sim, newsID in result.items():
     94         sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID)
     95         cursor.execute(sql)
     96         row = cursor.fetchone()
     97         print "推荐新闻的相似度为: %f"%(sim)
     98         print "推荐新闻的编号为:   %s"%(row.NewsID.encode('utf-8'))
     99         print "推荐新闻的链接为:   %s"%(row.Url.encode('utf-8'))
    100         print "推荐新闻的标题为:   %s"%(row.Title.decode('gb2312').encode('utf-8'))
    101         print ""
    102     
    103 #print similar("2IK789GB0001121M", "2IKJ8KRJ0001121M")
    104 recommand("A4AVPKLA00014JB5")
    105 clean()
    View Code
    输入刚才的新闻ID, 得到的结果为
     
     1 输入的新闻编号为:   A4AVPKLA00014JB5
     2 输入的新闻链接为:   http://news.163.com/14/0823/10/A4AVPKLA00014JB5.html
     3 输入的新闻标题为:   重庆东胜煤矿5名遇难者遗体全部找到
     4 --------------------------------------
     5 推荐新闻的相似度为: 0.346214
     6 推荐新闻的编号为:   A4BHA5OO0001124J    
     7 推荐新闻的链接为:   http://news.163.com/14/0823/15/A4BHA5OO0001124J.html
     8 推荐新闻的标题为:   安徽淮南煤矿爆炸事故救援再次发现遇难者遗体
     9 
    10 推荐新闻的相似度为: 0.356118
    11 推荐新闻的编号为:   8H0Q439K00011229    
    12 推荐新闻的链接为:   http://news.163.com/12/1123/16/8H0Q439K00011229.html
    13 推荐新闻的标题为:   安徽淮北首富被曝用500万元买通矿难遇难者家属
    14 
    15 推荐新闻的相似度为: 0.320387
    16 推荐新闻的编号为:   A3MBB7CF00014JB6    
    17 推荐新闻的链接为:   http://news.163.com/14/0815/10/A3MBB7CF00014JB6.html
    18 推荐新闻的标题为:   黑龙江鸡西煤矿透水事故9人升井 仍有16名矿工被困
    19 
    20 推荐新闻的相似度为: 0.324280
    21 推荐新闻的编号为:   5Q92I93D000120GU    
    22 推荐新闻的链接为:   http://news.163.com/09/1211/16/5Q92I93D000120GU.html
    23 推荐新闻的标题为:   土耳其煤矿发生瓦斯爆炸 19名矿工全部遇难
    24 
    25 推荐新闻的相似度为: 0.361950
    26 推荐新闻的编号为:   6D7J4VLR00014AED    
    27 推荐新闻的链接为:   http://news.163.com/10/0804/05/6D7J4VLR00014AED.html
    28 推荐新闻的标题为:   贵州一煤矿发生煤与瓦斯突出事故

    推荐内容的关联性很好

    不过, 由于推荐操作需要对数据库进行遍历, 时间复杂度非常高, 对单个新闻做关联推荐耗时大约在10分钟左右, 实际使用肯定是无法接受的

    但是, 毕竟只是个很粗糙的测试, 我个人还是非常满意的

    我的感受是: 算法挺神奇, 在上面的代码中, 完全不需要知道具体的新闻内容, 程序就能自动做出相当准确的判断, 非常方便而且有趣

    中间还是有很多可以优化的地方

    比如爬取新闻的时候可以删除部分无用信息(来源, 记者姓名之类)

    根据词语出现的位置, 对TF-IDF值进行修正, 比方说第一段和每一段的第一句话的TF-IDF值应当更高一点

    对新闻进行粗略分类, 在对一篇新闻做关联推荐的时候, 不需要遍历整个新闻库

    点击此处下载所有相关代码

    参考资料: 阮一峰的博客  http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

  • 相关阅读:
    累加和校验算法(CheckSum算法)
    云锵投资 2021 年 09 月简报
    云锵投资 2021 年 08 月简报
    断言与忽略断言
    出现 undefined reference to `cv::String::deallocate()'的解决方法
    about of string
    esp32: A stack overflow in task spam_task has been detected.
    IDEA部署Tomcat报错:No artifacts marked for deployment
    在safari浏览器上使用php导出文件失败
    laravel中使用vue热加载时 Cannot read property 'call' of undefined BUG解决方案
  • 原文地址:https://www.cnblogs.com/stevenczp/p/3931354.html
Copyright © 2011-2022 走看看