当前位置:Gxlcms > 数据库问题 > Spark SQL内置函数

Spark SQL内置函数

时间:2021-07-01 10:21:17 帮助过:14人阅读

平常在使用mysql的时候,我们在写SQL的时候会使用到MySQL为我们提供的一些内置函数,如数值函数:求绝对值abs()、平方根sqrt()等,还有其它的字符函数、日期函数、聚合函数等等。使我们利用这些内置函数能够快速实现我们的业务逻辑。在SparkSQL里其实也为我们提供了近两百多种内置函数,我们通过

  1. <span style="font-size: 15px">import org.apache.spark.sql.functions._</span>

导入内置函数包,来使用。也可以在SQL语句中直接使用。SparkSQL内置函数分类:聚合函数、集合函数、日期函数、数学函数、混杂函数、非聚合函数、排序函数、字符串函数、UDF函数和窗口函数这10类函数。

1 内置函数的使用
使用内置函数的方式有两种,一种是通过编程的方式的使用,另一种是通过SQL的方式使用。

例如:我们有如下数据,想要使用SparkSQL内置函数lower()来将名字全部转为小写

  1. <span style="font-size: 15px">+----+---+-----------+
  2. |name|age| phone|
  3. +----+---+-----------+
  4. |Ming| <span style="color: #800080">20</span>|<span style="color: #800080">15552211521</span>|
  5. |hong| <span style="color: #800080">19</span>|<span style="color: #800080">13287994007</span>|
  6. | zhi| <span style="color: #800080">21</span>|<span style="color: #800080">15552211523</span>|
  7. +----+---+-----------+</span>

以编程的方式使用内置函数

  1. <span style="font-size: 15px"><span style="color: #000000">import org.apache.spark.sql.functions._
  2. df.</span><span style="color: #0000ff">select</span>(lower(col(<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>)).<span style="color: #0000ff">as</span>(<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>), col(<span style="color: #800000">"</span><span style="color: #800000">age</span><span style="color: #800000">"</span>), col(<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span>)).show()</span>

以SQL的方式使用

  1. <span style="font-size: 15px">df.createOrReplaceTempView(<span style="color: #800000">"</span><span style="color: #800000">people</span><span style="color: #800000">"</span><span style="color: #000000">)
  2. spark.sql(</span><span style="color: #800000">"</span><span style="color: #800000">select lower(name) as name,age,phone from people</span><span style="color: #800000">"</span>).show()</span>

2 UDF函数的使用
有的时候,SparkSQL提供的内置函数无法满足我们的业务的时候,我们可以使用过UDF函数来自定义我们的实现逻辑。例如:需要对上面的数据添加一列id,要求id的生成是name+随机生成的uuid+phone。这时候我们可以使用UDF自定义函数实现。如下所示:

  1. <span style="font-size: 15px"><span style="color: #008000">//</span><span style="color: #008000">根据name和phone生成组合,并加上一段uud生成唯一表示id</span>
  2. def idGenerator(name: String, phone: Long): String =<span style="color: #000000"> {
  3. name </span>+ <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> + UUID.randomUUID().toString + <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> +<span style="color: #000000"> phone.toString
  4. }
  5. </span><span style="color: #008000">//</span><span style="color: #008000">生成udf函数</span>
  6. val idGeneratorUDF =<span style="color: #000000"> udf(idGenerator _)
  7. </span><span style="color: #008000">//</span><span style="color: #008000">加入隐式转换</span>
  8. <span style="color: #000000">import spark.implicits._
  9. df.withColumn(</span><span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span>, idGeneratorUDF($<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>, $<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span>)).show()</span>

也可以这样写:

  1. <span style="font-size: 15px"><span style="color: #008000">//</span><span style="color: #008000">加入隐式转换</span>
  2. <span style="color: #000000">import spark.implicits._
  3. </span><span style="color: #008000">//</span><span style="color: #008000">根据name和phone生成组合,并加上一段uud生成唯一表示id</span>
  4. def idGenerator(name: String, phone: Long): String =<span style="color: #000000"> {
  5. name </span>+ <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> + UUID.randomUUID().toString + <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> +<span style="color: #000000"> phone.toString
  6. }
  7. </span><span style="color: #008000">//</span><span style="color: #008000">注册udf函数</span>
  8. spark.udf.register(<span style="color: #800000">"</span><span style="color: #800000">idGenerator</span><span style="color: #800000">"</span><span style="color: #000000">,idGenerator _)
  9. </span><span style="color: #008000">//</span><span style="color: #008000">使用idGenerator</span>
  10. df.withColumn(<span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span>,callUDF(<span style="color: #800000">"</span><span style="color: #800000">idGenerator</span><span style="color: #800000">"</span>,$<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>,$<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span>)).show()</span>

结果都是一样的:

  1. <span style="font-size: 15px">+----+---+-----------+--------------------+
  2. |name|age| phone| id|
  3. +----+---+-----------+--------------------+
  4. |Ming| <span style="color: #800080">20</span>|<span style="color: #800080">15552211521</span>|Ming-9b87d4d5-91d...|
  5. |hong| <span style="color: #800080">19</span>|<span style="color: #800080">13287994007</span>|hong-7a91f7d8-66a...|
  6. | zhi| <span style="color: #800080">21</span>|<span style="color: #800080">15552211523</span>|zhi-f005859c-<span style="color: #800080">4516</span>...|
  7. +----+---+-----------+--------------------+</span>

同样,我们可以将我们自定义的UDF函数注册到SparkSQL里,然后用SQL实现

  1. <span style="font-size: 15px"><span style="color: #008000">//</span><span style="color: #008000">将自定义函数注册到SparkSQL里</span>
  2. spark.udf.register(<span style="color: #800000">"</span><span style="color: #800000">idGeneratorUDF</span><span style="color: #800000">"</span><span style="color: #000000">,idGeneratorUDF)
  3. </span><span style="color: #008000">//</span><span style="color: #008000">创建临时表</span>
  4. df.createOrReplaceTempView(<span style="color: #800000">"</span><span style="color: #800000">people</span><span style="color: #800000">"</span><span style="color: #000000">)
  5. </span><span style="color: #008000">//</span><span style="color: #008000">使用sql查询</span>
  6. spark.sql(<span style="color: #800000">"</span><span style="color: #800000">select idGeneratorUDF(name,phone) as id,name,age,phone from people</span><span style="color: #800000">"</span>).show()</span>

注意:上面加入import spark.implicits._隐式转换是为了方便使用$”列名”来代替col(“列名”)

完整代码:

  1. <span style="font-size: 15px"><span style="color: #000000">import java.util.UUID
  2. import org.apache.spark.sql.SparkSession
  3. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  4. * spark sql 内置函数
  5. </span><span style="color: #008000">*/</span>
  6. <span style="color: #0000ff">object</span><span style="color: #000000"> SparkSQLFunctionApp {
  7. def main(args: Array[String]): Unit </span>=<span style="color: #000000"> {
  8. val spark </span>= SparkSession.builder().appName(<span style="color: #0000ff">this</span>.getClass.getSimpleName).master(<span style="color: #800000">"</span><span style="color: #800000">local</span><span style="color: #800000">"</span><span style="color: #000000">).getOrCreate()
  9. import org.apache.spark.sql.functions._
  10. </span><span style="color: #008000">//</span><span style="color: #008000">加入隐式转换: 本例子里可以使用toDF方法和$"列名"代替col("列名")</span>
  11. <span style="color: #000000"> import spark.implicits._
  12. val df </span>= Seq((<span style="color: #800000">"</span><span style="color: #800000">Ming</span><span style="color: #800000">"</span>, <span style="color: #800080">20</span>, <span style="color: #800080">15552211521L</span>), (<span style="color: #800000">"</span><span style="color: #800000">hong</span><span style="color: #800000">"</span>, <span style="color: #800080">19</span>, <span style="color: #800080">13287994007L</span>), (<span style="color: #800000">"</span><span style="color: #800000">zhi</span><span style="color: #800000">"</span>, <span style="color: #800080">21</span>, <span style="color: #800080">15552211523L</span>)).toDF(<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">age</span><span style="color: #800000">"</span>, <span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span><span style="color: #000000">)
  13. df.show()
  14. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  15. * +----+---+-----------+
  16. * |name|age| phone|
  17. * +----+---+-----------+
  18. * |Ming| 20|15552211521|
  19. * |hong| 19|13287994007|
  20. * | zhi| 21|15552211523|
  21. * +----+---+-----------+
  22. </span><span style="color: #008000">*/</span>
  23. <span style="color: #008000">//</span><span style="color: #008000">1 使用内置函数将所有名字都转为小写
  24. </span><span style="color: #008000">//</span><span style="color: #008000">1.1 编程的方式:</span>
  25. df.<span style="color: #0000ff">select</span>(lower($<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>).<span style="color: #0000ff">as</span>(<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>), $<span style="color: #800000">"</span><span style="color: #800000">age</span><span style="color: #800000">"</span>, $<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span><span style="color: #000000">).show()
  26. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  27. * +----+---+-----------+
  28. * |name|age| phone|
  29. * +----+---+-----------+
  30. * |ming| 20|15552211521|
  31. * |hong| 19|13287994007|
  32. * | zhi| 21|15552211523|
  33. * +----+---+-----------+
  34. </span><span style="color: #008000">*/</span>
  35. <span style="color: #008000">//</span><span style="color: #008000">1.2 SQL的方式
  36. </span><span style="color: #008000">//</span><span style="color: #008000">注册表</span>
  37. df.createOrReplaceTempView(<span style="color: #800000">"</span><span style="color: #800000">people</span><span style="color: #800000">"</span><span style="color: #000000">)
  38. spark.sql(</span><span style="color: #800000">"</span><span style="color: #800000">select lower(name) as name,age,phone from people</span><span style="color: #800000">"</span><span style="color: #000000">).show()
  39. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  40. * +----+---+-----------+
  41. * |name|age| phone|
  42. * +----+---+-----------+
  43. * |ming| 20|15552211521|
  44. * |hong| 19|13287994007|
  45. * | zhi| 21|15552211523|
  46. * +----+---+-----------+
  47. </span><span style="color: #008000">*/</span>
  48. <span style="color: #008000">//</span><span style="color: #008000">2 UDF函数的使用
  49. </span><span style="color: #008000">//</span><span style="color: #008000">2.1 直接使用
  50. </span><span style="color: #008000">//</span><span style="color: #008000">根据name和phone生成组合,并加上一段uud生成唯一表示id</span>
  51. def idGenerator(name: String, phone: Long): String =<span style="color: #000000"> {
  52. name </span>+ <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> + UUID.randomUUID().toString + <span style="color: #800000">"</span><span style="color: #800000">-</span><span style="color: #800000">"</span> +<span style="color: #000000"> phone.toString
  53. }
  54. </span><span style="color: #008000">//</span><span style="color: #008000">生成udf函数</span>
  55. val idGeneratorUDF =<span style="color: #000000"> udf(idGenerator _)
  56. df.withColumn(</span><span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span>, idGeneratorUDF($<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>, $<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span><span style="color: #000000">)).show()
  57. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  58. * +----+---+-----------+--------------------+
  59. * |name|age| phone| id|
  60. * +----+---+-----------+--------------------+
  61. * |Ming| 20|15552211521|Ming-74338e40-548...|
  62. * |hong| 19|13287994007|hong-4f058f2b-9d3...|
  63. * | zhi| 21|15552211523|zhi-f42bea86-a9cf...|
  64. * +----+---+-----------+--------------------+
  65. </span><span style="color: #008000">*/</span>
  66. <span style="color: #008000">//</span><span style="color: #008000">将自定义函数注册到SparkSQL里</span>
  67. spark.udf.register(<span style="color: #800000">"</span><span style="color: #800000">idGeneratorUDF</span><span style="color: #800000">"</span><span style="color: #000000">, idGeneratorUDF)
  68. </span><span style="color: #008000">//</span><span style="color: #008000">创建临时表</span>
  69. df.createOrReplaceTempView(<span style="color: #800000">"</span><span style="color: #800000">people</span><span style="color: #800000">"</span><span style="color: #000000">)
  70. </span><span style="color: #008000">//</span><span style="color: #008000">使用sql查询</span>
  71. spark.sql(<span style="color: #800000">"</span><span style="color: #800000">select idGeneratorUDF(name,phone) as id,name,age,phone from people</span><span style="color: #800000">"</span><span style="color: #000000">).show()
  72. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  73. * +----+---+-----------+--------------------+
  74. * |name|age| phone| id|
  75. * +----+---+-----------+--------------------+
  76. * |Ming| 20|15552211521|Ming-74338e40-548...|
  77. * |hong| 19|13287994007|hong-4f058f2b-9d3...|
  78. * | zhi| 21|15552211523|zhi-f42bea86-a9cf...|
  79. * +----+---+-----------+--------------------+
  80. </span><span style="color: #008000">*/</span>
  81. <span style="color: #008000">//</span><span style="color: #008000">2.2 通过callUDF使用
  82. </span><span style="color: #008000">//</span><span style="color: #008000">注册udf函数</span>
  83. spark.udf.register(<span style="color: #800000">"</span><span style="color: #800000">idGenerator</span><span style="color: #800000">"</span><span style="color: #000000">, idGenerator _)
  84. </span><span style="color: #008000">//</span><span style="color: #008000">使用idGenerator</span>
  85. df.withColumn(<span style="color: #800000">"</span><span style="color: #800000">id</span><span style="color: #800000">"</span>, callUDF(<span style="color: #800000">"</span><span style="color: #800000">idGenerator</span><span style="color: #800000">"</span>, $<span style="color: #800000">"</span><span style="color: #800000">name</span><span style="color: #800000">"</span>, $<span style="color: #800000">"</span><span style="color: #800000">phone</span><span style="color: #800000">"</span><span style="color: #000000">)).show()
  86. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  87. * +----+---+-----------+--------------------+
  88. * |name|age| phone| id|
  89. * +----+---+-----------+--------------------+
  90. * |Ming| 20|15552211521|Ming-74338e40-548...|
  91. * |hong| 19|13287994007|hong-4f058f2b-9d3...|
  92. * | zhi| 21|15552211523|zhi-f42bea86-a9cf...|
  93. * +----+---+-----------+--------------------+
  94. </span><span style="color: #008000">*/</span>
  95. <span style="color: #008000">//</span><span style="color: #008000">创建临时表</span>
  96. df.createOrReplaceTempView(<span style="color: #800000">"</span><span style="color: #800000">people</span><span style="color: #800000">"</span><span style="color: #000000">)
  97. </span><span style="color: #008000">//</span><span style="color: #008000">使用sql查询</span>
  98. spark.sql(<span style="color: #800000">"</span><span style="color: #800000">select idGenerator(name,phone) as id,name,age,phone from people</span><span style="color: #800000">"</span><span style="color: #000000">).show()
  99. </span><span style="color: #008000">/*</span><span style="color: #008000">*
  100. * +--------------------+----+---+-----------+
  101. * | id|name|age| phone|
  102. * +--------------------+----+---+-----------+
  103. * |Ming-d4236bac-e21...|Ming| 20|15552211521|
  104. * |hong-bff84c0d-67d...|hong| 19|13287994007|
  105. * |zhi-aa0174b0-c8b3...| zhi| 21|15552211523|
  106. * +--------------------+----+---+-----------+
  107. </span><span style="color: #008000">*/</span><span style="color: #000000">
  108. }
  109. }</span></span>

 

Spark SQL内置函数

标签:代码   组合   排序   targe   show   快速   bsp   getc   要求   

人气教程排行