zoukankan      html  css  js  c++  java
  • Distributed NLTK with execnet | StreamHacker

    Distributed NLTK with execnet | StreamHacker

    NLTK with execnet

    WP Greet Box icon

    Hi! If you enjoy this post, you might want to subscribe to the RSS feed or follow me on Twitter here.

    (for a Belorussian translation of this article, go here)

    Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don't know how to distribute NLTK across many cores?

    Well, here's how you can use execnet to do distributed part of speech tagging with NLTK.

    execnet

    execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK's recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.

    Runner

    The runner does the following:

    1. Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
    2. Loads and pickles the default NLTK part of speech tagger.
    3. Opens each gateway and creates a remote execution channel with the tag_files module (the remote module covered below).
    4. Sends the pickled tagger and the name of a corpus (brown) thru the channel.
    5. Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
    6. Finally, it creates a receive queue and prints the accuracy response from each channel.

    run_tag_files.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    import execnet
    import nltk.tag, nltk.data
    import cPickle as pickle
    import tag_files
     
    HOSTS = {
        'localhost': 2
    }
     
    NICE = 20
     
    channels = []
     
    tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
     
    for host, count in HOSTS.items():
        print 'opening %d gateways at %s' % (count, host)
     
        for i in range(count):
            gw = execnet.makegateway('ssh=%s//nice=%d' % (host, NICE))
            channel = gw.remote_exec(tag_files)
            channels.append(channel)
            channel.send(tagger)
            channel.send('brown')
     
    count = 0
    chan = 0
     
    for fileid in nltk.corpus.brown.fileids():
        print 'sending %s to channel %d' % (fileid, chan)
        channels[chan].send(fileid)
        count += 1
        # alternate channels
        chan += 1
        if chan >= len(channels): chan = 0
     
    multi = execnet.MultiChannel(channels)
    queue = multi.make_receive_queue()
     
    for i in range(count):
        channel, response = queue.get()
        print response

    Remote Module

    The remote module is much simpler.

    1. Receives and unpickles the tagger.
    2. Receives the corpus name and loads it.
    3. For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.

    tag_files.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    import nltk.corpus
    import cPickle as pickle
     
    if __name__ == '__channelexec__':
        tagger = pickle.loads(channel.receive())
        corpus_name = channel.receive()
        corpus = getattr(nltk.corpus, corpus_name)
     
        for fileid in channel:
            accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
            channel.send('%s: %f' % (fileid, accuracy))

    Putting it all together

    Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).

    run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.

    If you'd like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.

    If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK's tagger. Or you can train it on the master first, then send it once training is complete.

    Distributed File Processing

    In practice, it's often a PITA to make sure every host has every file you want to process, and you'll want to process files outside of NLTK's builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.

  • 相关阅读:
    VSCode编写 Vue 项目标签内显示写CSS提示设置
    Vue 炫酷 Echarts 图表
    vue 动态生成拓扑图
    Vue 全局 websocket
    Vue 自定义组件v-model父子组件传值双向绑定
    vue项目Echarts更新数据是数据表展示错版
    Vue图片加载错误、图片加载失败的处理
    Vue 使用 Ant-d 简单实现左侧菜单栏和面包屑功能
    Vue Echarts图表dataZoom缩放区域根据数据量显示
    Echarts图例数据太多实现滚动效果
  • 原文地址:https://www.cnblogs.com/lexus/p/2355184.html
Copyright © 2011-2022 走看看