当前位置:Gxlcms > 数据库问题 > 使用Hive SQL窗口函数进行商务数据分析

使用Hive SQL窗口函数进行商务数据分析

时间:2021-07-01 10:21:17 帮助过:7人阅读

本文会从一个商务分析案例入手,说明SQL窗口函数的使用方式。通过本文的5个需求分析,可以看出SQL窗口函数的功能十分强大,不仅能够使我们编写的SQL逻辑更加清晰,而且在某种程度上可以简化需求开发。

数据准备

本文主要分析只涉及一张订单表orders,操作过程在Hive中完成,具体数据如下:

  1. <code class="language-sql">-- 建表
  2. CREATE TABLE orders(
  3. order_id int,
  4. customer_id string,
  5. city string,
  6. add_time string,
  7. amount decimal(10,2));
  8. -- 准备数据
  9. INSERT INTO orders VALUES
  10. (1,"A","上海","2020-01-01 00:00:00.000000",200),
  11. (2,"B","上海","2020-01-05 00:00:00.000000",250),
  12. (3,"C","北京","2020-01-12 00:00:00.000000",200),
  13. (4,"A","上海","2020-02-04 00:00:00.000000",400),
  14. (5,"D","上海","2020-02-05 00:00:00.000000",250),
  15. (5,"D","上海","2020-02-05 12:00:00.000000",300),
  16. (6,"C","北京","2020-02-19 00:00:00.000000",300),
  17. (7,"A","上海","2020-03-01 00:00:00.000000",150),
  18. (8,"E","北京","2020-03-05 00:00:00.000000",500),
  19. (9,"F","上海","2020-03-09 00:00:00.000000",250),
  20. (10,"B","上海","2020-03-21 00:00:00.000000",600);</code>

需求1:收入增长

在业务方面,第m1个月的收入增长计算如下:100 *(m1-m0)/ m0

其中,m1是给定月份的收入,m0是上个月的收入。因此,从技术上讲,我们需要找到每个月的收入,然后以某种方式将每个月的收入与上一个收入相关联,以便进行上述计算。计算当时如下:

  1. <code class="language-sql">WITH
  2. monthly_revenue as (
  3. SELECT
  4. trunc(add_time,‘MM‘) as month,
  5. sum(amount) as revenue
  6. FROM orders
  7. GROUP BY 1
  8. )
  9. ,prev_month_revenue as (
  10. SELECT
  11. month,
  12. revenue,
  13. lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
  14. FROM monthly_revenue
  15. )
  16. SELECT
  17. month,
  18. revenue,
  19. prev_month_revenue,
  20. round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
  21. FROM prev_month_revenue
  22. ORDER BY 1</code>

结果输出

month revenue prev_month_revenue revenue_growth
2020-01-01 650 NULL NULL
2020-02-01 1250 650 92.3
2020-03-01 1500 1250 20

我们还可以按照按城市分组进行统计,查看某个城市某个月份的收入增长情况

  1. <code class="language-sql">WITH
  2. monthly_revenue as (
  3. SELECT
  4. trunc(add_time,‘MM‘) as month,
  5. city,
  6. sum(amount) as revenue
  7. FROM orders
  8. GROUP BY 1,2
  9. )
  10. ,prev_month_revenue as (
  11. SELECT
  12. month,
  13. city,
  14. revenue,
  15. lag(revenue) over (partition by city order by month) as prev_month_revenue
  16. FROM monthly_revenue
  17. )
  18. SELECT
  19. month,
  20. city,
  21. revenue,
  22. round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
  23. FROM prev_month_revenue
  24. ORDER BY 2,1</code>

结果输出

month city revenue revenue_growth
2020-01-01 上海 450 NULL
2020-02-01 上海 950 111.1
2020-03-01 上海 1000 5.3
2020-01-01 北京 200 NULL
2020-02-01 北京 300 50
2020-03-01 北京 500 66.7

需求2:累计求和

累计汇总,即当前元素和所有先前元素的总和,如下面的SQL:

  1. <code class="language-sql">WITH
  2. monthly_revenue as (
  3. SELECT
  4. trunc(add_time,‘MM‘) as month,
  5. sum(amount) as revenue
  6. FROM orders
  7. GROUP BY 1
  8. )
  9. SELECT
  10. month,
  11. revenue,
  12. sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
  13. FROM monthly_revenue
  14. ORDER BY 1</code>

结果输出

month revenue running_total
2020-01-01 650 650
2020-02-01 1250 1900
2020-03-01 1500 3400

我们还可以使用下面的组合方式进行分析,SQL如下:

  1. <code class="language-sql">SELECT
  2. order_id,
  3. customer_id,
  4. city,
  5. add_time,
  6. amount,
  7. sum(amount) over () as amount_total, -- 所有数据求和
  8. sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累计求和
  9. sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer,
  10. avg(amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg -- 滚动求平均
  11. FROM orders
  12. ORDER BY 1</code>

结果输出

order_id customer_id city add_time amount amount_total running_sum running_sum_by_customer trailing_avg
1 A 上海 2020-01-01 00:00:00.000000 200 3400 200 200 200
2 B 上海 2020-01-05 00:00:00.000000 250 3400 450 250 225
3 C 北京 2020-01-12 00:00:00.000000 200 3400 650 200 216.666667
4 A 上海 2020-02-04 00:00:00.000000 400 3400 1050 600 262.5
5 D 上海 2020-02-05 00:00:00.000000 250 3400 1300 250 260
5 D 上海 2020-02-05 12:00:00.000000 300 3400 1600 550 266.666667
6 C 北京 2020-02-19 00:00:00.000000 300 3400 1900 500 283.333333
7 A 上海 2020-03-01 00:00:00.000000 150 3400 2050 750 266.666667
8 E 北京 2020-03-05 00:00:00.000000 500 3400 2550 500 316.666667
9 F 上海 2020-03-09 00:00:00.000000 250 3400 2800 250 291.666667
10 B 上海 2020-03-21 00:00:00.000000 600 3400 3400 850

需求3:处理重复数据

从上面的数据可以看出,存在两条重复的数据(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
显然需要对其进行清洗去重,保留最新的一条数据,SQL如下:

我们先进行分组排名,然后保留最新的那条数据即可:

  1. <code class="language-sql">SELECT *
  2. FROM (
  3. SELECT *,
  4. row_number() over (partition by order_id order by add_time desc) as rank
  5. FROM orders
  6. ) t
  7. WHERE rank=1</code>

结果输出

t.order_id t.customer_id t.city t.add_time t.amount t.rank
1 A 上海 2020-01-01 00:00:00.000000 200 1
2 B 上海 2020-01-05 00:00:00.000000 250 1
3 C 北京 2020-01-12 00:00:00.000000 200 1
4 A 上海 2020-02-04 00:00:00.000000 400 1
5 D 上海 2020-02-05 12:00:00.000000 300 1
6 C 北京 2020-02-19 00:00:00.000000 300 1
7 A 上海 2020-03-01 00:00:00.000000 150 1
8 E 北京 2020-03-05 00:00:00.000000 500 1
9 F 上海 2020-03-09 00:00:00.000000 250 1
10 B 上海 2020-03-21 00:00:00.000000 600 1

经过上面的清洗过程,对数据进行了去重。重新计算上面的需求1,正确SQL脚本为:

  1. <code class="language-sql">WITH
  2. orders_cleaned as (
  3. SELECT *
  4. FROM (
  5. SELECT *,
  6. row_number() over (partition by order_id order by add_time desc) as rank
  7. FROM orders
  8. )t
  9. WHERE rank=1
  10. )
  11. ,monthly_revenue as (
  12. SELECT
  13. trunc(add_time,‘MM‘) as month,
  14. sum(amount) as revenue
  15. FROM orders_cleaned
  16. GROUP BY 1
  17. )
  18. ,prev_month_revenue as (
  19. SELECT
  20. month,
  21. revenue,
  22. lag(revenue) over (order by month) as prev_month_revenue
  23. FROM monthly_revenue
  24. )
  25. SELECT
  26. month,
  27. revenue,
  28. round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
  29. FROM prev_month_revenue
  30. ORDER BY 1</code>

结果输出

month revenue revenue_growth
2020-01-01 650 NULL
2020-02-01 1000 53.8
2020-03-01 1500 50

将清洗后的数据创建成视图,方便以后使用

  1. <code class="language-sql">CREATE VIEW orders_cleaned AS
  2. SELECT
  3. order_id,
  4. customer_id,
  5. city,
  6. add_time,
  7. amount
  8. FROM (
  9. SELECT *,
  10. row_number() over (partition by order_id order by add_time desc) as rank
  11. FROM orders
  12. )t
  13. WHERE rank=1</code>

需求4:分组取TopN

分组取topN是最长见的SQL窗口函数使用场景,下面的SQL是计算每个月份的top2订单金额,如下:

  1. <code class="language-sql">WITH orders_ranked as (
  2. SELECT
  3. trunc(add_time,‘MM‘) as month,
  4. *,
  5. row_number() over (partition by trunc(add_time,‘MM‘) order by amount desc, add_time) as rank
  6. FROM orders_cleaned
  7. )
  8. SELECT
  9. month,
  10. order_id,
  11. customer_id,
  12. city,
  13. add_time,
  14. amount
  15. FROM orders_ranked
  16. WHERE rank <=2
  17. ORDER BY 1</code>

需求5:重复购买行为

下面的SQL计算重复购买率:重复购买的人数/总人数*100%以及第一笔订单金额与第二笔订单金额之间的典型差额:avg(第二笔订单金额/第一笔订单金额)

  1. <code class="language-sql">WITH customer_orders as (
  2. SELECT *,
  3. row_number() over (partition by customer_id order by add_time) as customer_order_n,
  4. lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
  5. FROM orders_cleaned
  6. )
  7. SELECT
  8. round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重复购买率
  9. avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重复购买较上次购买差异,第一笔订单金额与第二笔订单金额之间的典型差额
  10. FROM customer_orders</code>

结果输出

WITH结果输出:

orders_cleaned.order_id orders_cleaned.customer_id orders_cleaned.city orders_cleaned.add_time orders_cleaned.amount customer_order_n prev_order_amount
1 A 上海 2020-01-01 00:00:00.000000 200 1 NULL
4 A 上海 2020-02-04 00:00:00.000000 400 2 200
7 A 上海 2020-03-01 00:00:00.000000 150 3 400
2 B 上海 2020-01-05 00:00:00.000000 250 1 NULL
10 B 上海 2020-03-21 00:00:00.000000 600 2 250
3 C 北京 2020-01-12 00:00:00.000000 200 1 NULL
6 C 北京 2020-02-19 00:00:00.000000 300 2 200
5 D 上海 2020-02-05 12:00:00.000000 300 1 NULL
8 E 北京 2020-03-05 00:00:00.000000 500 1 NULL
9 F 上海 2020-03-09 00:00:00.000000 250

最终结果输出:

repeat_purchases revenue_expansion
50 1.9666666666666668

总结

本文主要分享了SQL窗口函数的基本使用方式以及使用场景,并结合了具体的分析案例。通过本文的分析案例,可以加深对SQL窗口函数的理解。

公众号『大数据技术与数仓』,回复『资料』领取大数据资料包

使用Hive SQL窗口函数进行商务数据分析

标签:结果   技术   操作   lin   输出   简化   大数据   需求开发   相关   

人气教程排行