资源优化经验

计算

一、引擎：选用最优的引擎，推荐spark，spark比hive平均节约60%

二、缓存：最大限度地用广播/缓存（如mapjoin），性能可提高数倍

三、为了节约下游消耗，上游表的优化手段：

1. text表转orc，适合下游读取个别字段的使用场景

2. 去除多余维度提高聚合度，减少存储的数据量

3. 增加适当的分区字段，向精准扫描靠拢

4. 做累积表、做月表，应对下游扫描历史全量的场景

四、减少参与计算的数据量：

1. 不要写漏分区条件，避免无用扫描或全表扫描

2. 谓词下推，第一时间减少数据量再计算（比如where条件尽量往内层写）

3. 先聚合再计算（比如先group by再join, 先group by再grouping sets）

4. 选择聚合度更高的上游表，读更少的数据

5. 删减汇总规则：优先选择grouping sets>rollup>cube，尽量删减group by的汇总维度和grouping sets的汇总个数，多一个维度就多指数倍消耗，多一个汇总就多一倍消耗

五、优化sql逻辑：

1. 唯一值做join key的join，可酌情改成union all，特别适合一连串join或要拿join不中的情况（见案例1）

2. 相同的表由于关联条件不同写成多个join，可酌情合并成一个join，特别适合重复join太多join消耗大且没有数据倾斜的情况（见案例2）

3. 用row_number取第一条或最后一条，可酌情改成普通的max或min(拼接字符串)的方法，可避免row_number时的order by（见案例3）

4. 多次对同一个字段进行distinct的性能差，改成先group by后sum的方法（见案例4）

5. 复用重复计算，一般在多个作业之间，可酌情落地中间表

6. 减少重复扫描，若需要多个子数据集，可把总的缓存起来再走不同条件（见案例5）

3. 对无效数据编写特殊表达式跳过计算，特别适合很多无效数据参与计算的情况（见案例6）

7. 解决数据倾斜3种方法：剔除倾斜值、热值分组、行转列关联后再转行（见案例7,8,9）

8. 字符串矩阵的应用：可应用高效的正则表达式逻辑（见案例10）

9. 动态分组替代笛卡尔积（见案例11）

10. 大表join大表，可酌情预join达到减少其中一个表的大小，变成大表join小表，特别适合其中一个表大量join不中的情况（见案例12）

11. 大表join大表，A表关联B表，可酌情转换为A表关联C表（B表对前一天的变动部分），变成大表join小表，再整合到前一天的结果表里，特别适合其中一个大表要扫历史全量且另一个大表比较静态的情况（见案例13）

12. 嵌套过多可能导致引擎解析后步骤增多，消耗增大，可以酌情解开嵌套（见案例14）

13. 超大排序解决方案：选spark，制造合适的辅助排序字段，进行分位数计算；用分位数将数据切成n组，组内各自排序，再叠加每组的累积总数。此方案的关键步骤是跑通合适的分位数，尽量让percentile处理分布均匀，精度合适，重复值适中的数据。重复率越高，分位数性能越好，分组均衡度越差；重复率越低，分位数性能越差，分组均衡度越好；调整重复率，达到分组的性能和均衡度的均衡点。如果不是，可以用数学方法转化达到。如果数据量太多可先做抽样（见案例15）

数仓储存

1. 一个字段里重复值多的，用orc存储比较省，没有重复的，text和orc差距不大，但访问速度orc始终优于text

2. 调整orc表记录的顺序，重复值尽量挨着，可增强压缩力度，相当于用排序计算消耗换取存储空间

3. 分区设计要适当，不宜太细，产生大量小文件和元数据，计算的扫描成本也增加了

案例
案例1：唯一值做join key, 连续join, 取join不中c的

优化前	select a.key,a.a1,b.b1,c.c1 from a left join b on a.key=b.key left join c on a.key=c.key where c.key is null
优化后	select key,max(a1) a1,max(b1) b1,max(c1) c1 from ( select key,a1,null b1,null c1,1 flag_a,0 flag_b,0 flag_c from a --打标签 union all select key,null a1,b1,null c1,0 flag_a,1 flag_b,0 flag_c from b–打标签 union all select key,null a1,null b1,c1,0 flag_a,0 flag_b,1 flag_c from c–打标签 ) x group by key having max(flag_a)>0 and max(flag_c)=0 --用max判断关联结果

案例2：读两次ab表

优化前	select a.key,a.a1,‘1’ type from a left join b on a.key=b.key and b.b1>1 union all select a.key,a.a1,‘2’ type from a left join b on a.key=b.key and b.b1>2
优化后	select key,a1,type from ( select a.key,a.a1,if(b.b1>2,‘1,2’,‘1’) types from a left join b on a.key=b.key and b.b1>1 --打标签 ) x lateral view explode(split(types,‘,’)) xs as type --lateral view explode实现1条变多条

案例3：取row_number=1

优化前	select userid,scid from ( select userid,scid,row_number() over(partition by userid order by pv desc) rn from a ) x where rn=1
优化后	select userid,split(max(concat(lpad(pv,10,0),‘@’,scid)),‘@’)[1] scid from a group by userid --数值左补零对齐位数可转成字符串排序

案例4：同一字段多个distinct

优化前	select count(distinct userid) uv,count(distinct if(条件1,userid)) uv1,count(distinct if(条件2,userid)) uv2 from a
优化后	select count(1) uv,sum(flag1) uv1,sum(flag2) uv2 from ( select userid,if(条件1,1,0) flag1,if(条件2,1,0) flag2 from a group by userid --打标签 ) x

案例5：读两次c表

优化前	select a.key,a.col,‘1’ type from a left join c on a.key=c.key and c.c1=1 union all select b.key,b.col,‘2’ type from b left join c on b.key=c.key and c.c2=1
优化后	cache table v1 as (select key,c1,c2 from c where (c.c1=1 or c.c2=1)); --缓存有用数据的全集 select a.key,a.col,‘1’ type from a left join v1 on a.key=v1.key and v1.c1=1 union all select b.key,b.col,‘2’ type from b left join v1 on b.key=v1.key and v1.c2=1

案例6：userid=0可以不参与计算

优化前	select userid,count(distinct scid) song_cnt from 播放流水 group by userid
优化后	select userid,count(distinct scid) song_cnt from 播放流水 where userid<>0 group by userid – userid=0特地不参与distinct计算

案例7：userid=0倾斜，可剔除

优化前	select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid
优化后	select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid where a.userid<>0 – 剔除热值userid=0

案例8：userid=0倾斜，不可剔除

优化前 select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid

优化后 select if(a.userid like ‘解决倾斜_%’,split(a.userid,‘_’)[1],userid) userid,a.a1,b.b1 --随机组重新合并
from (
select if(userid<>0,userid,concat(‘解决倾斜_’,userid,‘_’,int(rand()*10))) userid from a --热值拆成随机10组
) a left join (
select userid from b where userid<>0
union all
select concat(‘解决倾斜_’,userid,‘_’,grp) userid from (select userid,‘0,1,2,3,4,5,6,7,8,9’ grps from b where userid=0) b lateral view explode(split(grps,‘,’)) xs as grp – 右表同时扩充10组随机值确保join不丢数
) b on a.userid=b.userid

优化前	select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid
优化后	select if(a.userid like ‘解决倾斜_%’,split(a.userid,‘_’)[1],userid) userid,a.a1,b.b1 --随机组重新合并 from ( select if(userid<>0,userid,concat(‘解决倾斜_’,userid,‘_’,int(rand()*10))) userid from a --热值拆成随机10组 ) a left join ( select userid from b where userid<>0 union all select concat(‘解决倾斜_’,userid,‘_’,grp) userid from (select userid,‘0,1,2,3,4,5,6,7,8,9’ grps from b where userid=0) b lateral view explode(split(grps,‘,’)) xs as grp – 右表同时扩充10组随机值确保join不丢数 ) b on a.userid=b.userid

案例9：合唱歌曲会导致用scid取singerid严重发散，带来严重数据倾斜

优化前	select a.userid,b.singerid from 播放流水 a left join 歌曲歌手关系表 b on a.scid=b.scid
优化后	select userid,singerid from ( select a.userid,b.singerid_list from 播放流水 a left join ( select scid,concat_ws(‘,’,collect_list(singerid)) singerid_list from 歌曲歌手关系表 group by scid --合唱歌曲的歌手放在一个singerid_list中 ) b on a.scid=b.scid ) x lateral view explode(split(singerid_list,‘,’)) xs as singerid --关联完后再爆开singerid_list

案例10：求n天留存（n天前的活跃用户，在近7天内还活跃，n=14,30,60,90,180）

优化前 –每个n都执行一次
select count(1)
from (select userid from 活跃流水 where dt=‘ndays_ago’) a
inner join (select userid from 活跃流水 where dt between ‘7days_ago’ and ‘today’ group by userid) b on a.userid=b.userid

优化后

优化前	–每个n都执行一次 select count(1) from (select userid from 活跃流水 where dt=‘ndays_ago’) a inner join (select userid from 活跃流水 where dt between ‘7days_ago’ and ‘today’ group by userid) b on a.userid=b.userid
优化后	–步骤1只需执行一次，所有n只需执行一次步骤2，只要步骤1足够完善，可以支持后续多种步骤2规则拓展。如不用01表示，用活跃度表示，逗号隔开。步骤1，建历史累计表，用active_his字段记录userid的历史每天活跃情况，0代表不活跃，1代表活跃，最右一位代表当前一天： dt userid active_his 2023-01-10 1 1111111111111111111111111111111111111111 2023-01-10 2 1100000110000011000001100000110000011000 2023-01-10 3 101 insert overwrite table active_his_table partition (dt=‘today’) select userid,concat(max(active_his),max(today_flag)) active_his --字符串拼接（太长可截取某个最大长度，避免无限拼接） from ( select userid,active_his,‘0’ today_flag from active_his_table where dt=‘yesterday’ --昨日累计结果 union all select userid,‘’ active_his,‘1’ today_flag from 活跃流水 where dt=‘today’ --今日增量 ) x group by userid 步骤2，用正则表达式匹配，性能非常高效，哪怕有些过分的逻辑，也可制定强大的正则表达式。有新需求拓展，加一个正则表达式可立即出数： select count(1) from active_his_table where dt=‘today’ and active_his rlike ‘1.{23}(?=0*1).{7}$’ --（30天前的活跃用户，在近7天内还活跃）

–步骤1只需执行一次，所有n只需执行一次步骤2，只要步骤1足够完善，可以支持后续多种步骤2规则拓展。如不用01表示，用活跃度表示，逗号隔开。
步骤1，建历史累计表，用active_his字段记录userid的历史每天活跃情况，0代表不活跃，1代表活跃，最右一位代表当前一天：
dt userid active_his
2023-01-10 1 1111111111111111111111111111111111111111
2023-01-10 2 1100000110000011000001100000110000011000
2023-01-10 3 101
insert overwrite table active_his_table partition (dt=‘today’)
select userid,concat(max(active_his),max(today_flag)) active_his --字符串拼接（太长可截取某个最大长度，避免无限拼接）
from (
select userid,active_his,‘0’ today_flag from active_his_table where dt=‘yesterday’ --昨日累计结果
union all
select userid,‘’ active_his,‘1’ today_flag from 活跃流水 where dt=‘today’ --今日增量
) x group by userid

步骤2，用正则表达式匹配，性能非常高效，哪怕有些过分的逻辑，也可制定强大的正则表达式。有新需求拓展，加一个正则表达式可立即出数：
select count(1) from active_his_table where dt=‘today’ and active_his rlike ‘1.{23}(?=0*1).{7}$’ --（30天前的活跃用户，在近7天内还活跃）

案例11：恶意撞库：登录流水中，失败行为的设备，和最近一次成功的设备不同的话，视为恶意撞库。统计每个账号被恶意撞库的次数

优化前 – 产生类似笛卡尔积，消耗巨大甚至跑不出来
select userid,count(1)
from (
select userid,mid,lst,lead(lst) over(partition by userid order by lst) next_lst from 登录流水 where status=‘success’
) a left outer join (
select userid,mid,lst from 登录流水 where status=‘fail’
) b on a.userid=b.userid
where (b.lst between a.lst and a.next_lst or (a.next_lst is null and b.lst>a.lst)) and a.mid<>b.mid
group by userid

优化后 select userid,sum(if(status=‘fail’ and mid<>success_mid,1,0)) fail_cnt
from (
select userid,group_id,mid,lst,status,max(if(status=‘success’,mid)) over(partition by userid,group_id) success_mid --每个动态分组中用窗口函数取出success的设备
from (
select userid,mid,lst,status,sum(if(status=‘success’,1,0)) over(partition by userid order by lst ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) group_id from 登录流水 --打标签，success标1 fail标0，累加形成动态分组group_id
) x
where group_id>0
) x
group by userid
–若出现userid+group_id数据倾斜，可结合解决数据倾斜的方法，将group_id按时间再细分为多个分组，但又必须保留同一个success设备，实现挺复杂可参考作业

优化前	– 产生类似笛卡尔积，消耗巨大甚至跑不出来 select userid,count(1) from ( select userid,mid,lst,lead(lst) over(partition by userid order by lst) next_lst from 登录流水 where status=‘success’ ) a left outer join ( select userid,mid,lst from 登录流水 where status=‘fail’ ) b on a.userid=b.userid where (b.lst between a.lst and a.next_lst or (a.next_lst is null and b.lst>a.lst)) and a.mid<>b.mid group by userid
优化后	select userid,sum(if(status=‘fail’ and mid<>success_mid,1,0)) fail_cnt from ( select userid,group_id,mid,lst,status,max(if(status=‘success’,mid)) over(partition by userid,group_id) success_mid --每个动态分组中用窗口函数取出success的设备 from ( select userid,mid,lst,status,sum(if(status=‘success’,1,0)) over(partition by userid order by lst ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) group_id from 登录流水 --打标签，success标1 fail标0，累加形成动态分组group_id ) x where group_id>0 ) x group by userid –若出现userid+group_id数据倾斜，可结合解决数据倾斜的方法，将group_id按时间再细分为多个分组，但又必须保留同一个success设备，实现挺复杂可参考作业

案例12：大表join大表，播放流水join曲库维表

优化前	select a.,b. from 播放明细表 a left join 曲库维表 b on a.scid=b.scid
优化后	cache table c1 as (select scid from 播放明细表 group by scid) – 圈定需要的scid范围 cache table c2 as (select a.scid,a.* from 曲库维表 a inner join c1 b on a.scid=b.scid) – 缩小维表范围，得到c2，刚好c1可以走广播 select a.,b. from 播放明细表 a left join c2 b on a.scid=b.scid – 刚好c2可以走广播

案例13：大表join大表，每天用历史流水join最新的用户画像

优化前	select b.年龄段,count(1) pv from 历史流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段
优化后	步骤1：圈定前后2天维表变更部分 cache table c1 as (select userid,年龄段,sum(if(dt=昨天,-1,1)) flag_adddel from 用户画像 where dt in (昨天,今天) group by userid,年龄段 having flag_adddel<>0) 步骤2：算出历史统计表的变更部分，c1可以走广播 insert table c2 as (select b.年龄段,sum(flag_adddel) pv_adddel from 历史流水 a inner join c1 b on a.userid=b.userid group by b.年龄段) 步骤3：结合旧的统计表和变更部分，算出新的历史统计表 insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select 年龄段,pv from 统计表 where dt=昨天 union all select 年龄段,pv_adddel pv from c2) x group by 年龄段步骤4：新的统计表不含今天流水，很多时候，还需要补上今天流水 insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select b.年龄段,count(1) pv from 今天流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段 union all select 年龄段,pv from 统计表 where dt=今天) x group by 年龄段

优化前

select b.年龄段,count(1) pv from 历史流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段

优化后

步骤1：圈定前后2天维表变更部分
cache table c1 as (select userid,年龄段,sum(if(dt=昨天,-1,1)) flag_adddel from 用户画像 where dt in (昨天,今天) group by userid,年龄段 having flag_adddel<>0)
步骤2：算出历史统计表的变更部分，c1可以走广播
insert table c2 as (select b.年龄段,sum(flag_adddel) pv_adddel from 历史流水 a inner join c1 b on a.userid=b.userid group by b.年龄段)
步骤3：结合旧的统计表和变更部分，算出新的历史统计表
insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select 年龄段,pv from 统计表 where dt=昨天 union all select 年龄段,pv_adddel pv from c2) x group by 年龄段
步骤4：新的统计表不含今天流水，很多时候，还需要补上今天流水
insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select b.年龄段,count(1) pv from 今天流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段 union all select 年龄段,pv from 统计表 where dt=今天) x group by 年龄段

案例14：嵌套过多

优化前	–引擎解析为3步 select a.userid,a.pv from (select userid,sum(pv) pv from a group by userid) a inner join (select userid from b group by userid) b on a.userid=b.userid
优化后	–引擎解析为2步 select a.userid,sum(pv) from a inner join (select userid from b group by userid) b on a.userid=b.userid group by a.userid

案例15：超大数据量全局排序

优化前	–17亿数据，特点为80%play_count为1，执行2个多小时 select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’;
优化后	1. 80%play_count为1，重复率太高。抽样1%，创建辅助排序字段：play_count+3位随机小数，求得0.01-0.99共100个分位数，耗时4分钟 create table tmp_dac.t_zkp_order_percentile_rand_spark as select percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99 from (select play_count+cast(rand() as decimal(4,3)) play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’ and rand()<=0.01) x) x; 2. 按分位数求分组，组内排序，耗时8分钟 create table tmp_dac.t_zkp_testorder2_spark as select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from ( select userid,scid,play_count,case when play_count2<1.012 then 0 when play_count2<1.024 then 1 … when play_count2<7.439 then 98 else 99 end grp from ( select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’ ) x ) x; 3. 组间排序累加，耗时9分钟 cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark group by grp) b); insert overwrite table tmp_dac.t_zkp_testorder2_spark select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark a inner join c1 b on a.grp=b.grp;
优化前	–17亿数据，特点为play_count为1-1亿的随机整数，执行3个多小时 select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’;
优化后	1. play_count为1-1亿的随机整数，重复率太低。抽样1%，创建辅助排序字段：play_count后四位置为0，求得0.01-0.99共100个分位数，耗时2分钟 create table tmp_dac.t_zkp_order_percentile_spark_rand as select percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99 from (select int(play_count/10000)*10000 play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’ and rand()<=0.01) x) x 2. 按分位数求分组，组内排序，耗时11分钟 select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from ( select userid,scid,play_count,case when play_count2<990000.0 then 0 when play_count2<2000000.0 then 1 … when play_count2<9.899E7 then 98 else 99 end grp from ( select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’ ) x ) x; 3. 组间排序累加，耗时10分钟 cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_rand group by grp) b); insert overwrite table tmp_dac.t_zkp_testorder2_spark_rand select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_rand a inner join c1 b on a.grp=b.grp;
优化前	–450亿数据，特点为80%play_count为1，24小时无法成功 select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’;
优化后	1. 60%play_count为1，重复率太高。抽样0.1%，创建辅助排序字段：play_count+3位随机小数，求得0.01-0.99共100个分位数，耗时4小时 create table tmp_dac.t_zkp_order_percentile_rand_spark_2month as select percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99 from (select play_count+cast(rand() as decimal(4,3)) play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’ and rand()<=0.001) x) x; 2. 按分位数求分组，组内排序，耗时3小时 create table tmp_dac.t_zkp_testorder2_spark_2month as select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from ( select userid,scid,play_count,case when play_count2<1.016 then 0 when play_count2<1.032 then 1 … when play_count2<27.303 then 98 else 99 end grp from ( select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’ ) x ) x; 3. 组间排序累加，耗时5小时 cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_2month group by grp) b); insert overwrite table tmp_dac.t_zkp_testorder2_spark_2month select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_2month a inner join c1 b on a.grp=b.grp;
优化前	–450亿数据，特点为play_count为1-1亿的随机整数，24小时无法成功 select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’;
优化后	1. play_count为1-1亿的随机整数，重复率太低。抽样0.1%，创建辅助排序字段：play_count后四位置为0，求得0.01-0.99共100个分位数，耗时2小时 create table tmp_dac.t_zkp_order_percentile_spark_2monthrand as select percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99 from (select int(play_count/10000)*10000 play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’ and rand()<=0.001) x) x; 2. 按分位数求分组，组内排序，耗时4小时 create table tmp_dac.t_zkp_testorder2_spark_2monthrand as select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from ( select userid,scid,play_count,case when play_count2<1000000.0 then 0 when play_count2<1990000.0 then 1 … when play_count2<9.899E7 then 98 else 99 end grp from ( select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’ ) x ) x 3. 组间排序累加，耗时3小时 cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_2monthrand group by grp) b); insert overwrite table tmp_dac.t_zkp_testorder2_spark_2monthrand select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_2monthrand a inner join c1 b on a.grp=b.grp;

目录CONTENT

大数据sql性能优化技巧

资源优化经验

评论区