登录
  • #面试
  • #数据科学

A/‌‌‍‍‌‍‍‌‍‍‌‍‌‍‍‍‌‌‌‍‍‍‌‍‍‍‍‍‌‌‌‌B Testing 中常犯错误总结

youyu0625
2867
6
第一类错误:对于统计概念的错误理解

1. Data peeking: 发现结果显著后就不再收集数据,提前结束experiment。错误原因是,experiment的时长是根据statistical power,significance level, day of week effect,seasonality等因素计算出来的,数据不够结果可能很不一样(estimated treatment effect ≠ true treatment effect)。更重要的是,这样的结果在launch product后可能无法reproduced。

2. Multiple testing problem: 有多个target metrics时,根据一个或部分metric显著,就决定launch new feature。错误原因是,如果根据未调整过的significance level来得出simultaneous testing的结果,很有可能导致type I error (其实也是一种data peeking)。

常见场景包括:



  • multiple metrics in an A/B test

  • one metric in an a/b test with multiple treatment groups

  • a segment of the population

  • multiple iterations of an A/B test

  • multiple a/b tests in parallel



解决办法:



  • experiment之前把所有metrics分为三组:those you expect to be impacted(比如metric A),those potentially to be impacted(比如B、C),those unlikely to be impacted(比如D)。

  • 对于不同组使用tiered significance levels (A:0.05,B、C:0.01,D:0.001)。

  • 如果结果是A显著,BCD不显著,与预期一致;如果A不显著,BCD显著,需要debug一下。



3. Lack of statistical power:没有足够的randomization units来detect the effect size,但得出no treatment effect的结论。比如计算得出每组需要1000个用户来达到80%的statistical power,如果试验结束后只有900个,结果不显著,也不能说明no treatment effect,因为test是underpowered的。这时需要继续收集到足够的数据。

第二类错误:忽略了一些影响因素导致结果无效

1. Sample ratio mismatch: sample ratio between control and treatment is not as designed,比如1 (design ratio) vs 1.1 (observe ratio),test结果会受到影响。

常见原因:



  • bugs or problems in assigning users to different groups (ramping up plans, multiple experiments in parallel, segmentation is based on some attributes that can change over time等)

  • bug in the pipeline (比如test前filter out fraudulent users) that causes the false positive rate to be different in different groups



Debug方法:



  • Gap upstream of the randomization point

  • Check if the variant assignment is done correctly

  • Look into the data processing pipeline

  • Check different segments of population



2. Violation of SUTVA (Stable unit treatment value assumption): A/B testing的一个假设是randomization units是互相独立没有interaction的,如果假设不满足,结果也不可靠。常见场景如social networks (Facebook),用户行为相互影响,或者是two-sided markets (Ebay, Uber and Lyft),control和treatment groups compete for the same resources。解决方法包括,在不同的地理位置分别选取control和treatment groups,尽量监测interference。

3. Changes in user’s behaviors: 包括novelty effect (更喜欢尝试新事物)和primacy effect(更喜欢现有的东西),常发生于initial period after users see a new product or feature。虽然无法解决,但可以monitor if such effects exist and quantify them,并在做决定时将这种影响去除。

继续总结了一些笔记,非常感谢小姐姐的视频:



6条回复
热度排序

发表回复