zoukankan html css js c++ java

学习随笔 --SparkStreaming WordCount Python实现

# -*- coding:utf-8 -*-
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# StreamingContext 流功能的主要入口点
# 创建一个具有两个执行线程的本地StreamingContext，批处理间隔为1秒
#SparkStreaming 中local后必须为大于等于2的数字【即至少2条线程】。因为receiver 占了一个不断循环接收数据
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
# 创建一个DStream来表示来自TCP源的流数据，指定为主机名（例如localhost）和端口（例如9999）
lines = ssc.socketTextStream("localhost", 9999)
# lines（DStream）表示将从数据服务器接收的数据流。此流中的每条记录都是一行文本。然后用空格分割为单词
#flatMap是一个DStream操作，通过从源DStream中的每个记录生成多个新记录来创建新的DStream
#DStream是RDD产生的模板，在Spark Streaming发生计算前，其实质是把每个Batch的DStream的操作翻译成为了RDD操作
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

使用Netcat（在大多数类Unix系统中找到的小实用程序）作为数据服务器运行

$ nc -lk 9999

启动示例

$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py localhost 9999

查看全文

相关阅读:
Find and Grep
Device trees, Overlays and Parameters of Raspberry Pi
Ubuntu终端常用的快捷键
 ARM Linux 3.x的设备树（Device Tree）
Linux GPIO子系统
 Linux Pin Control 子系统
 JavaScript运算符优先级
 JavaScript中的严格模式
 小程序云开发调用HTTP请求中got第三方库使用失败解决方法
 小程序生命周期函数

原文地址：https://www.cnblogs.com/ToDoNow/p/9555733.html