侧边栏壁纸
博主头像
Terry

『LESSON 5』

  • 累计撰写 90 篇文章
  • 累计创建 21 个标签
  • 累计收到 1 条评论

目 录CONTENT

文章目录

大数据sql性能优化技巧

Terry
2023-11-21 / 0 评论 / 0 点赞 / 108 阅读 / 5,184 字 / 正在检测是否收录...

资源优化经验

计算

一、引擎:选用最优的引擎,推荐spark,spark比hive平均节约60%

二、缓存:最大限度地用广播/缓存(如mapjoin),性能可提高数倍

三、为了节约下游消耗,上游表的优化手段:

1. text表转orc,适合下游读取个别字段的使用场景

2. 去除多余维度提高聚合度,减少存储的数据量

3. 增加适当的分区字段,向精准扫描靠拢

4. 做累积表、做月表,应对下游扫描历史全量的场景

四、减少参与计算的数据量:

1. 不要写漏分区条件,避免无用扫描或全表扫描

2. 谓词下推,第一时间减少数据量再计算(比如where条件尽量往内层写)

3. 先聚合再计算(比如先group by再join, 先group by再grouping sets)

4. 选择聚合度更高的上游表,读更少的数据

5. 删减汇总规则:优先选择grouping sets>rollup>cube,尽量删减group by的汇总维度和grouping sets的汇总个数,多一个维度就多指数倍消耗,多一个汇总就多一倍消耗

五、优化sql逻辑:

1. 唯一值做join key的join,可酌情改成union all,特别适合一连串join或要拿join不中的情况(见案例1)

2. 相同的表由于关联条件不同写成多个join,可酌情合并成一个join,特别适合重复join太多join消耗大且没有数据倾斜的情况(见案例2)

3. 用row_number取第一条或最后一条,可酌情改成普通的max或min(拼接字符串)的方法,可避免row_number时的order by(见案例3)

4. 多次对同一个字段进行distinct的性能差,改成先group by后sum的方法(见案例4)

5. 复用重复计算,一般在多个作业之间,可酌情落地中间表

6. 减少重复扫描,若需要多个子数据集,可把总的缓存起来再走不同条件(见案例5)

3. 对无效数据编写特殊表达式跳过计算,特别适合很多无效数据参与计算的情况(见案例6)

7. 解决数据倾斜3种方法:剔除倾斜值、热值分组、行转列关联后再转行(见案例7,8,9)

8. 字符串矩阵的应用:可应用高效的正则表达式逻辑(见案例10)

9. 动态分组替代笛卡尔积(见案例11)

10. 大表join大表,可酌情预join达到减少其中一个表的大小,变成大表join小表,特别适合其中一个表大量join不中的情况(见案例12)

11. 大表join大表,A表关联B表,可酌情转换为A表关联C表(B表对前一天的变动部分),变成大表join小表,再整合到前一天的结果表里,特别适合其中一个大表要扫历史全量且另一个大表比较静态的情况(见案例13)

12. 嵌套过多可能导致引擎解析后步骤增多,消耗增大,可以酌情解开嵌套(见案例14)

13. 超大排序解决方案:选spark,制造合适的辅助排序字段,进行分位数计算;用分位数将数据切成n组,组内各自排序,再叠加每组的累积总数。此方案的关键步骤是跑通合适的分位数,尽量让percentile处理分布均匀,精度合适,重复值适中的数据。重复率越高,分位数性能越好,分组均衡度越差;重复率越低,分位数性能越差,分组均衡度越好;调整重复率,达到分组的性能和均衡度的均衡点。如果不是,可以用数学方法转化达到。如果数据量太多可先做抽样(见案例15)

数仓储存

1. 一个字段里重复值多的,用orc存储比较省,没有重复的,text和orc差距不大,但访问速度orc始终优于text

2. 调整orc表记录的顺序,重复值尽量挨着,可增强压缩力度,相当于用排序计算消耗换取存储空间

3. 分区设计要适当,不宜太细,产生大量小文件和元数据,计算的扫描成本也增加了

案例
案例1:唯一值做join key, 连续join, 取join不中c的

优化前 select a.key,a.a1,b.b1,c.c1 from a left join b on a.key=b.key left join c on a.key=c.key where c.key is null
优化后 select key,max(a1) a1,max(b1) b1,max(c1) c1
from (
 select key,a1,null b1,null c1,1 flag_a,0 flag_b,0 flag_c from a --打标签
union all
 select key,null a1,b1,null c1,0 flag_a,1 flag_b,0 flag_c from b–打标签
union all
 select key,null a1,null b1,c1,0 flag_a,0 flag_b,1 flag_c from c–打标签
) x
group by key having max(flag_a)>0 and max(flag_c)=0 --用max判断关联结果

案例2:读两次ab表

优化前 select a.key,a.a1,‘1’ type from a left join b on a.key=b.key and b.b1>1
union all 
select a.key,a.a1,‘2’ type from a left join b on a.key=b.key and b.b1>2
优化后 select key,a1,type
from (
 select a.key,a.a1,if(b.b1>2,‘1,2’,‘1’) types from a left join b on a.key=b.key and b.b1>1 --打标签
) x lateral view explode(split(types,‘,’)) xs as type --lateral view explode实现1条变多条

案例3:取row_number=1

优化前 select userid,scid
from (
 select userid,scid,row_number() over(partition by userid order by pv desc) rn from a
) x where rn=1
优化后 select userid,split(max(concat(lpad(pv,10,0),‘@’,scid)),‘@’)[1] scid from a group by userid --数值左补零对齐位数可转成字符串排序

案例4:同一字段多个distinct

优化前 select count(distinct userid) uv,count(distinct if(条件1,userid)) uv1,count(distinct if(条件2,userid)) uv2 from a
优化后 select count(1) uv,sum(flag1) uv1,sum(flag2) uv2
from (
 select userid,if(条件1,1,0) flag1,if(条件2,1,0) flag2 from a group by userid --打标签
) x

案例5:读两次c表

优化前 select a.key,a.col,‘1’ type from a left join c on a.key=c.key and c.c1=1
union all 
select b.key,b.col,‘2’ type from b left join c on b.key=c.key and c.c2=1
优化后 cache table v1 as (select key,c1,c2 from c where (c.c1=1 or c.c2=1)); --缓存有用数据的全集
select a.key,a.col,‘1’ type from a left join v1 on a.key=v1.key and v1.c1=1
union all 
select b.key,b.col,‘2’ type from b left join v1 on b.key=v1.key and v1.c2=1

案例6:userid=0可以不参与计算

优化前 select userid,count(distinct scid) song_cnt from 播放流水 group by userid
优化后 select userid,count(distinct scid) song_cnt from 播放流水 where userid<>0 group by userid – userid=0特地不参与distinct计算

案例7:userid=0倾斜,可剔除

优化前 select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid
优化后 select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid where a.userid<>0 – 剔除热值userid=0

案例8:userid=0倾斜,不可剔除

优化前 select a.userid,a.a1,b.b1 from a left join b on a.userid=b.userid
优化后 select if(a.userid like ‘解决倾斜_%’,split(a.userid,‘_’)[1],userid) userid,a.a1,b.b1 --随机组重新合并
from (
 select if(userid<>0,userid,concat(‘解决倾斜_’,userid,‘_’,int(rand()*10))) userid from a --热值拆成随机10组
) a left join (
 select userid from b where userid<>0
union all
 select concat(‘解决倾斜_’,userid,‘_’,grp) userid from (select userid,‘0,1,2,3,4,5,6,7,8,9’ grps from b where userid=0) b lateral view explode(split(grps,‘,’)) xs as grp – 右表同时扩充10组随机值确保join不丢数
) b on a.userid=b.userid

案例9:合唱歌曲会导致用scid取singerid严重发散,带来严重数据倾斜

优化前 select a.userid,b.singerid from 播放流水 a left join 歌曲歌手关系表 b on a.scid=b.scid
优化后

select userid,singerid
from (
 select a.userid,b.singerid_list
 from 播放流水 a left join (
  select scid,concat_ws(‘,’,collect_list(singerid)) singerid_list from 歌曲歌手关系表 group by scid --合唱歌曲的歌手放在一个singerid_list中
 ) b on a.scid=b.scid
) x lateral view explode(split(singerid_list,‘,’)) xs as singerid --关联完后再爆开singerid_list

案例10:求n天留存(n天前的活跃用户,在近7天内还活跃,n=14,30,60,90,180)

优化前 –每个n都执行一次
select count(1)
from (select userid from 活跃流水 where dt=‘ndays_ago’) a
inner join (select userid from 活跃流水 where dt between ‘7days_ago’ and ‘today’ group by userid) b on a.userid=b.userid
优化后

–步骤1只需执行一次,所有n只需执行一次步骤2,只要步骤1足够完善,可以支持后续多种步骤2规则拓展。如不用01表示,用活跃度表示,逗号隔开。
步骤1,建历史累计表,用active_his字段记录userid的历史每天活跃情况,0代表不活跃,1代表活跃,最右一位代表当前一天:
dt userid active_his
2023-01-10 1 1111111111111111111111111111111111111111
2023-01-10 2 1100000110000011000001100000110000011000
2023-01-10 3 101
insert overwrite table active_his_table partition (dt=‘today’)
select userid,concat(max(active_his),max(today_flag)) active_his --字符串拼接(太长可截取某个最大长度,避免无限拼接)
from (
 select userid,active_his,‘0’ today_flag from active_his_table where dt=‘yesterday’ --昨日累计结果
union all
 select userid,‘’ active_his,‘1’ today_flag from 活跃流水 where dt=‘today’ --今日增量
) x group by userid

步骤2,用正则表达式匹配,性能非常高效,哪怕有些过分的逻辑,也可制定强大的正则表达式。有新需求拓展,加一个正则表达式可立即出数:
select count(1) from active_his_table where dt=‘today’ and active_his rlike ‘1.{23}(?=0*1).{7}$’ --(30天前的活跃用户,在近7天内还活跃)

案例11:恶意撞库:登录流水中,失败行为的设备,和最近一次成功的设备不同的话,视为恶意撞库。统计每个账号被恶意撞库的次数

优化前 – 产生类似笛卡尔积,消耗巨大甚至跑不出来
select userid,count(1)
from (
 select userid,mid,lst,lead(lst) over(partition by userid order by lst) next_lst from 登录流水 where status=‘success’
) a left outer join (
 select userid,mid,lst from 登录流水 where status=‘fail’
) b on a.userid=b.userid
where (b.lst between a.lst and a.next_lst or (a.next_lst is null and b.lst>a.lst)) and a.mid<>b.mid
group by userid
优化后 select userid,sum(if(status=‘fail’ and mid<>success_mid,1,0)) fail_cnt
from (
 select userid,group_id,mid,lst,status,max(if(status=‘success’,mid)) over(partition by userid,group_id) success_mid --每个动态分组中用窗口函数取出success的设备
 from (
  select userid,mid,lst,status,sum(if(status=‘success’,1,0)) over(partition by userid order by lst ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) group_id from 登录流水 --打标签,success标1 fail标0,累加形成动态分组group_id
 ) x
 where group_id>0
) x
group by userid
–若出现userid+group_id数据倾斜,可结合解决数据倾斜的方法,将group_id按时间再细分为多个分组,但又必须保留同一个success设备,实现挺复杂可参考作业

案例12:大表join大表,播放流水join曲库维表

优化前 select a.*,b.* from 播放明细表 a left join 曲库维表 b on a.scid=b.scid
优化后 cache table c1 as (select scid from 播放明细表 group by scid) – 圈定需要的scid范围
cache table c2 as (select a.scid,a.* from 曲库维表 a inner join c1 b on a.scid=b.scid) – 缩小维表范围,得到c2,刚好c1可以走广播
select a.*,b.* from 播放明细表 a left join c2 b on a.scid=b.scid – 刚好c2可以走广播

案例13:大表join大表,每天用历史流水join最新的用户画像

优化前 select b.年龄段,count(1) pv from 历史流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段
优化后

步骤1:圈定前后2天维表变更部分
cache table c1 as (select userid,年龄段,sum(if(dt=昨天,-1,1)) flag_adddel from 用户画像 where dt in (昨天,今天) group by userid,年龄段 having flag_adddel<>0)
步骤2:算出历史统计表的变更部分,c1可以走广播
insert table c2 as (select b.年龄段,sum(flag_adddel) pv_adddel from 历史流水 a inner join c1 b on a.userid=b.userid group by b.年龄段)
步骤3:结合旧的统计表和变更部分,算出新的历史统计表
insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select 年龄段,pv from 统计表 where dt=昨天 union all select 年龄段,pv_adddel pv from c2) x group by 年龄段
步骤4:新的统计表不含今天流水,很多时候,还需要补上今天流水
insert table 统计表 partition (dt=今天) select 年龄段,sum(pv) from (select b.年龄段,count(1) pv from 今天流水 a inner join 用户画像 b on a.userid=b.userid and b.dt=今天 group by b.年龄段 union all select 年龄段,pv from 统计表 where dt=今天) x group by 年龄段

案例14:嵌套过多

优化前 –引擎解析为3步
select a.userid,a.pv
from (select userid,sum(pv) pv from a group by userid) a
inner join (select userid from b group by userid) b on a.userid=b.userid
优化后 –引擎解析为2步
select a.userid,sum(pv)
from a inner join (select userid from b group by userid) b on a.userid=b.userid
group by a.userid

案例15:超大数据量全局排序

优化前 –17亿数据,特点为80%play_count为1,执行2个多小时
select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’;
优化后

1. 80%play_count为1,重复率太高。抽样1%,创建辅助排序字段:play_count+3位随机小数,求得0.01-0.99共100个分位数,耗时4分钟
create table tmp_dac.t_zkp_order_percentile_rand_spark as
select
percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99
from (select play_count+cast(rand() as decimal(4,3)) play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’ and rand()<=0.01) x) x;

2. 按分位数求分组,组内排序,耗时8分钟
create table tmp_dac.t_zkp_testorder2_spark as
select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from (
select userid,scid,play_count,case 
when play_count2<1.012 then 0 when play_count2<1.024 then 1 … when play_count2<7.439 then 98 
else 99
end grp from (
select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2023-09-29’
) x
) x;

3. 组间排序累加,耗时9分钟
cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark group by grp) b);
insert overwrite table tmp_dac.t_zkp_testorder2_spark
select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark a
inner join c1 b on a.grp=b.grp;

优化前 –17亿数据,特点为play_count为1-1亿的随机整数,执行3个多小时
select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’;
优化后

1. play_count为1-1亿的随机整数,重复率太低。抽样1%,创建辅助排序字段:play_count后四位置为0,求得0.01-0.99共100个分位数,耗时2分钟
create table tmp_dac.t_zkp_order_percentile_spark_rand as
select
percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99
from (select int(play_count/10000)*10000 play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’ and rand()<=0.01) x) x

2. 按分位数求分组,组内排序,耗时11分钟
select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from (
select userid,scid,play_count,case 
when play_count2<990000.0 then 0 when play_count2<2000000.0 then 1 … when play_count2<9.899E7 then 98 
else 99
end grp from (
select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand’
) x
) x;

3. 组间排序累加,耗时10分钟
cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_rand group by grp) b);
insert overwrite table tmp_dac.t_zkp_testorder2_spark_rand
select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_rand a
inner join c1 b on a.grp=b.grp;

优化前 –450亿数据,特点为80%play_count为1,24小时无法成功
select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’;
优化后

1. 60%play_count为1,重复率太高。抽样0.1%,创建辅助排序字段:play_count+3位随机小数,求得0.01-0.99共100个分位数,耗时4小时
create table tmp_dac.t_zkp_order_percentile_rand_spark_2month as
select
percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99
from (select play_count+cast(rand() as decimal(4,3)) play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’ and rand()<=0.001) x) x;

2. 按分位数求分组,组内排序,耗时3小时
create table tmp_dac.t_zkp_testorder2_spark_2month as
select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from (
select userid,scid,play_count,case 
when play_count2<1.016 then 0 when play_count2<1.032 then 1 … when play_count2<27.303 then 98 
else 99
end grp from (
select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘2month’
) x
) x;

3. 组间排序累加,耗时5小时
cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_2month group by grp) b);
insert overwrite table tmp_dac.t_zkp_testorder2_spark_2month
select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_2month a
inner join c1 b on a.grp=b.grp;

优化前 –450亿数据,特点为play_count为1-1亿的随机整数,24小时无法成功
select userid,scid,play_count,row_number() over(order by play_count desc) rn from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’;
优化后

1. play_count为1-1亿的随机整数,重复率太低。抽样0.1%,创建辅助排序字段:play_count后四位置为0,求得0.01-0.99共100个分位数,耗时2小时
create table tmp_dac.t_zkp_order_percentile_spark_2monthrand as
select
percentile(play_count, 0.01) p1,percentile(play_count, 0.02) p2,…,percentile(play_count, 0.99) p99
from (select int(play_count/10000)*10000 play_count from (select play_count from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’ and rand()<=0.001) x) x;

2. 按分位数求分组,组内排序,耗时4小时
create table tmp_dac.t_zkp_testorder2_spark_2monthrand as
select userid,scid,play_count,grp,row_number() over(partition by grp order by play_count desc) rn,0 final_rn from (
select userid,scid,play_count,case 
when play_count2<1000000.0 then 0 when play_count2<1990000.0 then 1 … when play_count2<9.899E7 then 98 
else 99
end grp from (
select userid,scid,play_count,play_count+rand() play_count2 from tmp_dac.oa_bi_profile_play_scid_userid_d where dt=‘rand2’
) x
) x

3. 组间排序累加,耗时3小时
cache table c1 as (select grp,sum© over(order by grp)-c acc_rn from (select grp,count(1) c from tmp_dac.t_zkp_testorder2_spark_2monthrand group by grp) b);
insert overwrite table tmp_dac.t_zkp_testorder2_spark_2monthrand
select userid,scid,play_count,a.grp,rn,acc_rn+rn final_rn from tmp_dac.t_zkp_testorder2_spark_2monthrand a
inner join c1 b on a.grp=b.grp;

0

评论区