zoukankan html css js c++ java

Python数据挖掘-词云

词云绘制

1、语料库的搭建、分词来源、移除停用词、词频统计

使用方法：os.path.join(path,name) #连接目录与文件名或目录结果为path/name

import os
import os.path
import codecs

filePaths=[]
fileContents=[]
for root,dirs,files in os.walk("D:\Python\Python数据挖掘\Python数据挖掘实战课程课件\2.4\SogouC.mini\Sample"):
    for name in files:
        filePath=os.path.join(root,name)
        filePaths.append(filePath)
        f=codecs.open(filePath,"r","utf-8")
        fileContent=f.read()
        f.close()
        fileContents.append(fileContent)
        
import pandas
corpos=pandas.DataFrame({
                         "filePath":filePaths,
                         "fileContent":fileContents})

#分词来源哪个文章
import jieba

segments=[]
filePaths=[]
for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
        
segmentDataFrame=pandas.DataFrame({
                                   "segment":segments,
                                   "filepath":filePaths})


import numpy
#进行词频统计
#by是要分组的列，[]是要统计的列
segStat=segmentDataFrame.groupby(
            by="segment"
            )["segment"].agg({
            "计数":numpy.size
            }).reset_index().sort(columns=["计数"],
            ascending=False)

#移除停用词
stopwords=pandas.read_csv(
    "D:\Python\Python数据挖掘\Python数据挖掘实战课程课件\2.4\StopwordsCN.txt",
    encoding="utf-8",
    index_col=False)
fSegStat=segStat[
        ~segStat.segment.isin(stopwords.stopword)]


#第二种去除分词的方法
import jieba
segments=[]
filePaths=[]
for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        if seg not in stopwords.stopword.values and len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)

segmentDataFrame=pandas.DataFrame({
        "segment":segments,
        "filePath":filePaths})
segStat=segmentDataFrame.groupby(
                    by="segment"
                    )["segment"].agg({
                    "计数":numpy.size
                    }).reset_index().sort(
                        columns=["计数"],
                        ascending=False)

View Code

2、词云绘制

首先要引入WordCloud，然后在引入画图模块matplotlib中pyplot函数

一般先设定词云的背景和字体，用到background和font_path

词云统计的话，一般是字典形式，这时候分词就需要作为序列，然后统计的词频数作为列，然后再作为参数传入fit_words

图形的展示通过plt函数的方法imshow()来展示

from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud =WordCloud(
    font_path="D:\Python\Python数据挖掘\Python数据挖掘实战课程课件\2.4\simhei.ttf",
    background_color="black"
    )

words=fSegStat.set_index("segment").to_dict()

wordcloud.fit_words(words["计数"])
plt.imshow(wordcloud)
plt.close()

View Code

查看全文

相关阅读:
leetcode 39 Combination Sum
C/C++ 单元测试 catch
二叉树
 线性表
 POJ1002
HDU4329
hdu 4329
java代码优化总结1
Linux操作系统常用命令总结1
java开发基础知识总结1

原文地址：https://www.cnblogs.com/U940634/p/9736009.html