zoukankan      html  css  js  c++  java
  • Conquer Big Data through Spark

    Course Background:

    Apache Spark™ is a fast and general engine for large-scale data processing.Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. You can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.:

     

    Spark powers a stack of high-level tools including Spark SQL,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application:

     

    You can run Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can read from HDFSHBaseCassandra, and any Hadoop data source:

     

    Write applications quickly in Java, Scala or Python.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.

    Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes.

    In the past few years ,Databricks, with the help of the Spark community, has contributed many improvements to Apache Spark to improve its performance, stability, and scalability. This enabled Databricks to use Apache Spark to sort 100 TB of data on 206 machines in 23 minutes, which is 3X faster than the previous Hadoop 100TB result on 2100 machines. Similarly, Databricks sorted 1 PB of data on 190 machines in less than 4 hours, which is over 4X faster than the previous Hadoop 1PB result on 3800 machines.

     

     

     Spark is fulfilling its promise to serve as a faster and more scalable engine for data processing of all sizes. Spark enables equally dramatic improvements in time and cost for all Big Data users.

     

    Course Introduction:

    This course almost covers everything for Application Developer to build diverse Spark applications to fulfill all kinds of business requirements: Architecture of Spark、the programming model in Spark、internals of Spark、Spark SQLMLlibGraphXSpark Streaming、Testing、Tuning、Spark on Yarn、JobServer and SparkR.

    Additional,this course also covers the very necessary skills you need to write Scala code in Spark, to help whom is not familiar with Scala.

     

    Who Needs to Attend

    Anyone who is interested in Big Data Development;

    Hadoop Developer;

    Other Big Data Developer;

     

    Prerequisites

       Be familiar with the basics of object-oriented programming;

    Course Outline

    Day 1

    Class 1:The architecture of Spark

    1 Ecosystem of Spark

    2 Design of Spark

    3 RDD

    4  Fault-tolerance in Spark

    Class 2Programming with Scala

    1 Classes and Objects in Scala

    2 Funtional Object

    3 Traits

    4 Case class and Pattern Matching

    5 Collections

    6 Implicit Conversions and Parameters

    7 Actors and Concurrency

    Class 3:Spark Programming Model

    1 RDD

    2 transformation

    3 action

    4 lineage

    5 Dependency

    Class 4:Spark Internals

    1 Spark Cluster

    2 Job Scheduling

    3 DAGScheduler

    4 TaskScheduler

    5 Task Internal

     

    TIME

    CONTENT

    Note

     

     

     

     

    Day 2

    Class 5:Broadcasts and Accumulators

    1 Broadcast Internal

    2  Best practice in Broadcast

    3 Accumulators Internal

    4 Best practice in Accumulators

    Class 6:Action in programming Spark

    1 Data Source:File、HDFS、HBase、S3;

    2 IDEA

    3 Maven

    4 sbt.

    5 Code

    6 Deployment

    Class 7:Deep in Spark Driver

    1 The Secret of SparkContext

    2 The Secret of  SparkConf

    4 The Secret of  SparkEnv

    Class 8:Deep in RDD

    1 DAG

    2 Scala RDD Function

    3 Spark Java RDD Function

    4 RDD Tuning

     

    TIME

    CONTENT

    NOTE

    Day 3

    Class 9:Machine Learning on Spark

    1 LinearRegression

    2 K-Means

    3 Collaborative Filtering

    Class 10: Graph Computation on Spark

    1 Table Operators

    2 Graph Operators

    3 GraphX Algorithms

    Class 11: Spark SQL

    1 Parquet、JSON、JDBC

    2 DSL

    3 SQL on RDD

    Class 12:Spark Streaming

    1 DStream

    2transformation

    3 checkpoint

    4 Tuning

    TIME

    CONTENT

    NOTE

    Day 4

    Class 13:Spark on Yarn

    1 Internals of Spark on Yarn

    2 Best practice of Spark on Yarn

    Class 14:JobServer

    1 Restful Architecture of JobServer

    2 JobServer APIs

    3 Best Practice of JobServer

    Class 15:SparkR

    1 Programming in R

    2 R on Spark

    3 Internals of SparkR

    4 SparkR API

    Class 16:Spark Tuing

    1 Logs

    2 Concurency

    3 Memory

    4 GC

    5 Serializers

    6 Safety

    7 14s cases of Tuning

  • 相关阅读:
    ionic serve 报【ionic-app-scripts' 不是内部或外部命令 】问题解
    Angular4.x+Ionic3 踩坑之路之打包时出现JAVASCRIPT HEAP OUT OF MEMORY的几种解决办法
    webpack打包---报错内存溢出javaScript heap out of memory
    IDEA版本控制工具VCS中使用Git,以及快捷键总结(不使用命令)
    详解Intellij IDEA中.properties文件中文显示乱码问题的解决
    PostgreSQL判断一个表是否存在
    关于Hadoop结合RDBMS应用的一些思考
    hadoop的hdfs文件操作实现上传文件到hdfs
    用hdfs存储海量的视频数据的设计思路
    hbase+hive应用场景
  • 原文地址:https://www.cnblogs.com/spark-hadoop/p/4183414.html
Copyright © 2011-2022 走看看