当前位置:Gxlcms > 数据库问题 > Spark SQL概述

Spark SQL概述

时间:2021-07-01 10:21:17 帮助过:15人阅读

people.json”) dataframe.filter("salary > 1000").show() //局限性 Throws Runtime exception org.apache.spark.sql.AnalysisException: cannot resolve salary given input columns age, name; //Create RDD[Person] val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20))) //Create dataframe from a RDD[Person] val personDF = sqlContext.createDataFrame(personRDD) //We get back RDD[Row] and not RDD[Person] personDF.rdd //局限性 RDD 转换为 DF , DF再转回 RDD 后,会丢失一些信息

注:Spark RDD、DataFrame和DataSet的区别自行网上搜索

5、Dataset

  Dataset 扩展自 DataFrame API,提供了编译时类型安全,面向对象风格的 API

技术分享
case class Person(name: String, age: Int)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person]
// Compute histogram of age by name
val hist = ds.groupBy(_.name).mapGroups({
case (name, people) => {
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
})
View Code

  Dataset API

    类型安全:可直接作用在domain对象上

//Create RDD[Person]
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create Dataset from a RDD
val personDS = sqlContext.createDataset(personRDD)
personDS.rdd
//We get back RDD[Person] and not RDD[Row] in Dataframe

    高效:代码生成编解码器,序列化更高效

    协作:Dataset与Dataframe可相互转换

  编译时类型检查

case class Person(name: String, age: Long)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person] 
ds.filter(p => p.age > 25) ds.filter(p => p.salary > 12500) //error: value salary is not a member of Person

 

Spark SQL概述

标签:查询   rdd   make   函数   面向对象   编写   member   map   编译   

人气教程排行