zoukankan      html  css  js  c++  java
  • python生成职业要求词云

    接着上篇的说的,爬取了大数据相关的职位信息,http://www.17bigdata.com/jobs/。

    # -*- coding: utf-8 -*-
    """
    Created on Thu Aug 10 07:57:56 2017
    
    @author: lenovo
    """
    
    from wordcloud import WordCloud
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import jieba
    
    def cloud(root,name,stopwords):
        filepath = root +'\' + name
        f = open(filepath,'r',encoding='utf-8')
        txt = f.read()
        f.close()
        cut = jieba.cut(txt)
        words = []
        for i in cut:
            words.append(i)
        df = pd.DataFrame({'words':words})
        s= df.groupby(df['words'])['words'].agg([('size',np.size)]).sort_values(by='size',ascending=False)
        s = s[~s.index.isin(stopwords['stopword'])].to_dict()
        wordcloud = WordCloud(font_path =r'E:Pythonmachine learningsimhei.ttf',background_color='black')
        wordcloud.fit_words(s['size'])
        plt.imshow(wordcloud)
        pngfile = root +'\' + name.split('.')[0] + '.png'
        wordcloud.to_file(pngfile)
        
    import os 
    jieba.load_userdict(r'E:Pythonmachine learningNLPstopwords.txt')
    stopwords = pd.read_csv(r'E:Pythonmachine learningStopwordsCN.txt',encoding='utf-8',index_col=False)
    for root,dirs,file in os.walk(r'E:职位信息'):
        for name in file:
            if name.split('.')[-1]=='txt':
                print(name)
                cloud(root,name,stopwords)        

    词云如图所示:

    可以看出有些噪声词没能被去除,比如相关、以上学历等无效词汇。本想通过DF判断停用词,但是我爬的时候没顾及到这个问题,外加本身记录数也不高,就没再找职位信息的停用词。当然也可看出算法和经验是很重要的。加油

  • 相关阅读:
    Google Accounts,OpenID,OAuth
    Namespaces(命名空间)
    <Araxis Merge>Windows平台下的Merge概览
    <Araxis Merge>快速一览文件的比较与合并
    <Araxis Merge>保存文件
    <Stackoverflow> 声望和节制
    <Stackoverflow> 如何提问
    收集一些好用的搜索引擎
    一个简单的scrapy爬虫抓取豆瓣刘亦菲的图片地址
    应用python编写简单新浪微博应用(一)
  • 原文地址:https://www.cnblogs.com/chenyaling/p/7338416.html
Copyright © 2011-2022 走看看