zoukankan      html  css  js  c++  java
  • 结巴并行分词

     直接在jieba中设置并行并不能真正的并行。所以 用joblib进行并行分词。

    源文件有4列

    import os
    import sys
    
    
    import pandas as pd
    from joblib import Parallel, delayed
    import jieba
    
    import yaml
    config = yaml.load(open('config.yaml', 'r'))
    
    
    def read_df(trainfile):
        data = pd.read_csv(trainfile, sep='\t', header=None, nrows=60000,
                           encoding='utf-8', names=['id', 'title', 'content', 'label'])
        return data
    
    
    def word_cut(df):
        with open(config['train_cut'], 'a+') as f:
            line = '	'.join([df[0],' '.join(jieba.cut(df[1])) ,' '.join(jieba.cut(df[2])),df[3]])   
            f.writelines(line)
            f.writelines('
    ')
    
    
    def applyParallel(content, func, n_thread):
        with Parallel(n_jobs=n_thread) as parallel:
            parallel(delayed(func)(c) for c in content)
    
    
    def main():
        overwrite = True
        if overwrite:
            if os.path.exists(config['train_cut']):
                os.remove(config['train_cut'])
    
        trainfile = 'data/train_fusai.tsv'
        df = read_df(trainfile)
        content = df.values
        applyParallel(content, word_cut, 22)
    if __name__ == '__main__':
        main()
  • 相关阅读:
    SQL2008安装重启失败
    UML学习笔记
    强大的wget
    记录几款不错的chrome主题
    关于nginx配置的不完全总结
    关于Mac下的SSH客户端iterm2等配置
    安装配置sock5代理
    配置DNS
    复习一些编译原理
    了解CentOS及周边
  • 原文地址:https://www.cnblogs.com/zle1992/p/8967644.html
Copyright © 2011-2022 走看看