【慕课网实战】六、以慕课网日志分析为例进入大数据 Spark SQL 的世界 - 走看看

zoukankan html css js c++ java

【慕课网实战】六、以慕课网日志分析为例进入大数据 Spark SQL 的世界

DataFrame它不是Spark SQL提出的，而是早起在R、Pandas语言就已经有了的。

A Dataset is a distributed collection of data：分布式的数据集

A DataFrame is a Dataset organized into named columns.

以列（列名、列的类型、列值）的形式构成的分布式数据集，按照列赋予不同的名称

student

id:int

name:string

city:string

It is conceptually equivalent to a table in a relational database

or a data frame in R/Python

RDD：

    java/scala  ==> jvm

    python ==> python runtime

DataFrame:

    java/scala/python ==> Logic Plan

DataFrame和RDD互操作的两种方式：

1）反射：case class   前提：事先需要知道你的字段、字段类型

2）编程：Row          如果第一种情况不能满足你的要求（事先不知道列）

3) 选型：优先考虑第一种

val rdd = spark.sparkContext.textFile("file:///home/hadoop/data/student.data")

DataFrame = Dataset[Row]

Dataset：强类型  typed  case class

DataFrame：弱类型   Row

SQL:

    seletc name from person;  compile  ok, result no

DF:

    df.select("name")  compile no

    df.select("nname")  compile ok

DS:

    ds.map(line => line.itemid)  compile no

查看全文

相关阅读:
《安富莱嵌入式周报》第222期：2021.07.19--2021.07.25
嵌入式新闻早班车-第14期
 状态压缩动态规划【DP】
Spring事务
 设计模式--组合模式
 设计模式--状态模式
 设计模式--中介者模式
 设计模式--责任链模式
 设计模式--享元模式
 设计模式--委派模式

原文地址：https://www.cnblogs.com/kkxwz/p/8493732.html

Copyright © 2011-2022 走看看