zoukankan      html  css  js  c++  java
  • 大规模数据处理Apache Spark开发

    大规模数据处理Apache Spark开发

    Spark是用于大规模数据处理的统一分析引擎。它提供了Scala、Java、Python和R的高级api,以及一个支持用于数据分析的通用计算图的优化引擎。它还支持一组丰富的高级工具,包括用于SQL和DataFrames的Spark SQL、用于机器学习的MLlib、用于图形处理的GraphX以及用于流处理的结构化流。

    https://github.com/apache/spark

    https://spark.apache.org/

    Online Documentation

    可以在project web页面上找到最新的Spark文档,包括编程指南。此readme文件仅包含基本的安装说明。

    Building Spark

    Spark是使用Apache Maven构建的。要构建Spark及其示例程序,请运行:

    ./build/mvn -DskipTests clean package

    (如果下载了预构建包,则无需执行此操作。)             

    更详细的文件可从项目现场“Building Spark”获取。             

    有关一般开发技巧,包括使用IDE开发Spark的信息,请参阅"Useful Developer Tools"

    Interactive Scala Shell

    The easiest way to start using Spark is through the Scala shell:

    ./bin/spark-shell

    Try the following command, which should return 1,000,000,000:

    scala> spark.range(1000 * 1000 * 1000).count()

    Interactive Python Shell

    Alternatively, if you prefer Python, you can use the Python shell:

    ./bin/pyspark

    And run the following command, which should also return 1,000,000,000:

    >>> spark.range(1000 * 1000 * 1000).count()

    Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

    ./bin/run-example SparkPi

    will run the Pi example locally.

    You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

    MASTER=spark://host:7077 ./bin/run-example SparkPi

    Many of the example programs print usage help if no params are given.

    Running Tests

    Testing first requires building Spark. Once Spark is built, tests can be run using:

    ./dev/run-tests

    Please see the guidance on how to run tests for a module, or individual tests.

    There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

    关于Hadoop版本的说明             

    Spark使用Hadoop核心库与HDFS和其他Hadoop支持的存储系统进行通信。由于协议在不同版本的Hadoop中发生了变化,因此必须针对集群运行的同一版本构建Spark。             

    请参阅构建文档"Specifying the Hadoop Version and Enabling YARN",以获取构建特定Hadoop发行版的详细指导,包括为特定的配置单元和配置单元节俭服务器发行版构建。             

    配置             

    有关如何配置Spark的概述,请参阅联机文档中的配置指南。             

    贡献             

    请查阅Spark指南,以了解如何开始为项目作出贡献。

    A Note About Hadoop Versions

    Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

    Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

    Configuration

    Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

    Contributing

    Please review the Contribution to Spark guide for information on how to get started contributing to the project.

    人工智能芯片与自动驾驶
  • 相关阅读:
    中文繁体简体互换试验
    msSQL2005数据库备份
    C#实现Base64编码与解码
    C#中正则表达式的高级应用
    滚动条样式锦集
    远程MSMQ
    Silverlight+WCF 新手实例 象棋 主界面事件区求和认输(三十二)
    MSSQL 2005 数据库变成可疑状态
    CYQ.Data 轻量数据层之路 华丽升级 V1.3出世(五)
    CYQ.Data 轻量数据层之路 终极升级(二)
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/14059306.html
Copyright © 2011-2022 走看看