时间:2021-07-01 10:21:17 帮助过:15人阅读
注:Spark RDD、DataFrame和DataSet的区别自行网上搜索
Dataset 扩展自 DataFrame API,提供了编译时类型安全,面向对象风格的 API
case class Person(name: String, age: Int) val dataframe = sqlContext.read.json("people.json") val ds : Dataset[Person] = dataframe.as[Person] // Compute histogram of age by name val hist = ds.groupBy(_.name).mapGroups({ case (name, people) => { val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) } })View Code
Dataset API
类型安全:可直接作用在domain对象上
//Create RDD[Person] val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20))) //Create Dataset from a RDD val personDS = sqlContext.createDataset(personRDD) personDS.rdd //We get back RDD[Person] and not RDD[Row] in Dataframe
高效:代码生成编解码器,序列化更高效
协作:Dataset与Dataframe可相互转换
编译时类型检查
case class Person(name: String, age: Long) val dataframe = sqlContext.read.json("people.json") val ds : Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.age > 25) ds.filter(p => p.salary > 12500) //error: value salary is not a member of Person
Spark SQL概述
标签:查询 rdd make 函数 面向对象 编写 member map 编译