zoukankan      html  css  js  c++  java
  • 【346】TF-IDF

    Ref: 文本挖掘预处理之向量化与Hash Trick

    Ref: 文本挖掘预处理之TF-IDF

    Ref: sklearn.feature_extraction.text.CountVectorizer

    Ref: TF-IDF与余弦相似性的应用(一):自动提取关键词

    Ref: TF-IDF与余弦相似性的应用(二):找出相似文章

    Ref: TF-IDF与余弦相似性的应用(三):自动摘要

    >>> from sklearn.feature_extraction.text import TfidfTransformer
    >>> from sklearn.feature_extraction.text import CountVectorizer
    >>> corpus=["I come to China to travel", 
        "This is a car polupar in China",          
        "I love tea and Apple ",   
        "The work is to write some papers in science"]
    >>> vectorizer=CountVectorizer()
    >>> transformer = TfidfTransformer()
    >>> tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
    >>> print(tfidf)
      (0, 16)	0.4424621378947393
      (0, 15)	0.697684463383976
      (0, 4)	0.4424621378947393
      (0, 3)	0.348842231691988
      (1, 14)	0.45338639737285463
      (1, 9)	0.45338639737285463
      (1, 6)	0.3574550433419527
      (1, 5)	0.3574550433419527
      (1, 3)	0.3574550433419527
      (1, 2)	0.45338639737285463
      (2, 12)	0.5
      (2, 7)	0.5
      (2, 1)	0.5
      (2, 0)	0.5
      (3, 18)	0.3565798233381452
      (3, 17)	0.3565798233381452
      (3, 15)	0.2811316284405006
      (3, 13)	0.3565798233381452
      (3, 11)	0.3565798233381452
      (3, 10)	0.3565798233381452
      (3, 8)	0.3565798233381452
      (3, 6)	0.2811316284405006
      (3, 5)	0.2811316284405006
    >>> print(vectorizer.get_feature_names())
    ['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']
    

    说明:其中 (0, 16) 表示第一行文本,索引为 16 的词,对应的是“travel”,以此类推。

    继续上面的信息,获取对应 term 的 tfidf 值,tfidf 变量对应的是 (4, 19) 矩阵的值,对应不同的句子,不同的 term。

    >>> tfidf_array = tfidf.toarray()    #获取array,然后遍历array,并分别转为list
    >>> names_list = vectorizer.get_feature_names()    #获取names的list
    >>> for i in range(0, len(corpus)):
    	print(corpus[i],'
    ')
    	tmp_list = tfidf_array[i].tolist()
    	for j in range(0, len(names_list)):
    		if tmp_list[j] != 0:
    			if len(names_list[j])>=7:
    				print(names_list[j],'	',tmp_list[j])
    			else:
    				print(names_list[j],'		',tmp_list[j])
    	print('')
    
    	
    I come to China to travel 
    
    china 		 0.348842231691988
    come 		 0.4424621378947393
    to 		 0.697684463383976
    travel 		 0.4424621378947393
    
    This is a car polupar in China 
    
    car 		 0.45338639737285463
    china 		 0.3574550433419527
    in 		 0.3574550433419527
    is 		 0.3574550433419527
    polupar 	 0.45338639737285463
    this 		 0.45338639737285463
    
    I love tea and Apple  
    
    and 		 0.5
    apple 		 0.5
    love 		 0.5
    tea 		 0.5
    
    The work is to write some papers in science 
    
    in 		 0.2811316284405006
    is 		 0.2811316284405006
    papers 		 0.3565798233381452
    science 	 0.3565798233381452
    some 		 0.3565798233381452
    the 		 0.3565798233381452
    to 		 0.2811316284405006
    work 		 0.3565798233381452
    write 		 0.3565798233381452
    
    >>> 
    

    获取 TF(Term Frequency)

    >>> X = vectorizer.fit_transform(corpus)
    >>> X.toarray()
    array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0],
           [0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
           [1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]],
          dtype=int64)
    >>> vector_array = X.toarray()
    >>> for i in range(0, len(corpus)):
    	print(corpus[i],'
    ')
    	tmp_list = vector_array[i].tolist()
    	for j in range(0, len(names_list)):
    		if tmp_list[j] != 0:
    			if len(names_list[j])>=7:
    				print(names_list[j],'	',tmp_list[j])
    			else:
    				print(names_list[j],'		',tmp_list[j])
    	print('')
    
    I come to China to travel 
    
    china 		 1
    come 		 1
    to 		 2
    travel 		 1
    
    This is a car polupar in China 
    
    car 		 1
    china 		 1
    in 		 1
    is 		 1
    polupar 	 1
    this 		 1
    
    I love tea and Apple  
    
    and 		 1
    apple 		 1
    love 		 1
    tea 		 1
    
    The work is to write some papers in science 
    
    in 		 1
    is 		 1
    papers 		 1
    science 	 1
    some 		 1
    the 		 1
    to 		 1
    work 		 1
    write 		 1
    
    >>> 
    

  • 相关阅读:
    PHPxiao程序用户登录页面,嘿嘿,模仿滴
    IE、 FireFox 的 javascript 日历控件
    经典推荐的 Smarty模板 教程
    推荐兼容 IE、 FireFox 的 javascript 日历控件
    最新php100视频教程的解压密码(截止83课时)
    上下左右 无缝隙 滚动代码
    ThinkPHP学习笔记一
    dedecms代码解密1:index.php简单分析
    40条技巧优化php代码
    导出excel小结(C#,.NET,Wpf)
  • 原文地址:https://www.cnblogs.com/alex-bn-lee/p/10212235.html
Copyright © 2011-2022 走看看