zoukankan html css js c++ java

python生成职业要求词云

接着上篇的说的，爬取了大数据相关的职位信息，http://www.17bigdata.com/jobs/。

# -*- coding: utf-8 -*-
"""
Created on Thu Aug 10 07:57:56 2017

@author: lenovo
"""

from wordcloud import WordCloud
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jieba

def cloud(root,name,stopwords):
    filepath = root +'\' + name
    f = open(filepath,'r',encoding='utf-8')
    txt = f.read()
    f.close()
    cut = jieba.cut(txt)
    words = []
    for i in cut:
        words.append(i)
    df = pd.DataFrame({'words':words})
    s= df.groupby(df['words'])['words'].agg([('size',np.size)]).sort_values(by='size',ascending=False)
    s = s[~s.index.isin(stopwords['stopword'])].to_dict()
    wordcloud = WordCloud(font_path =r'E:Pythonmachine learningsimhei.ttf',background_color='black')
    wordcloud.fit_words(s['size'])
    plt.imshow(wordcloud)
    pngfile = root +'\' + name.split('.')[0] + '.png'
    wordcloud.to_file(pngfile)
    
import os 
jieba.load_userdict(r'E:Pythonmachine learningNLPstopwords.txt')
stopwords = pd.read_csv(r'E:Pythonmachine learningStopwordsCN.txt',encoding='utf-8',index_col=False)
for root,dirs,file in os.walk(r'E:职位信息'):
    for name in file:
        if name.split('.')[-1]=='txt':
            print(name)
            cloud(root,name,stopwords)

词云如图所示：

可以看出有些噪声词没能被去除，比如相关、以上学历等无效词汇。本想通过DF判断停用词，但是我爬的时候没顾及到这个问题，外加本身记录数也不高，就没再找职位信息的停用词。当然也可看出算法和经验是很重要的。加油

查看全文

相关阅读:
MyBatis学习之二----应用
 网逛收藏
 Dubbo+Zookeeper 入门Demo
React + umi +antd+antv/g6 实现力图
 npm、yarn 简单使用记录
 React yarn安装umi后 umi -v查询版本失败
 Eclipse 快速打开文件所在的本地目录
 Windows激活工具
 Win7 node多版本管理gnvm采坑记录
 自定义环形进度条RoundProgressBar

原文地址：https://www.cnblogs.com/chenyaling/p/7338416.html