登录
  • #做项目

Kaggle比赛经验 - IEEE Fraud Detection

luohongchen1993
2089
12
最近非常需要积分,故而来此分享一些之前做Kaggle IEEE Fraud Detection比赛的经验,希望对大家有用。

整体来讲有几个方面:

1 特征工程

2 特征选择

3 建模和交叉验证(also sampling tricks)

4 Ensemble / Stacking

每个方面都值得思考。感兴趣的朋友欢迎交流。

下为正文:

First I want to thank VESTA / Kaggle for hosting this competition. This is my first serious Kaggle competition ever. I've learnt a lot from the discussion boards, previous competition winning solutions and other people's kernels. Also my teammates did great and helped me a lot. I truly appreciate all of their efforts. @nullrecurrent @zhenxuanli @harrylpc

I'll try to share all the stuff that worked for me or I enjoyed most.

Feature Engineering:

We did not find the magic, although I did try D1-DT somehow, but it did not occur to me that we could use it as part of the order id. We heard about using cardhash and devicehash somehow, but was not able to figure out the whole picture.

We did a lot of count encoding and numeric aggregation borrowed from @kyakovlev , also read a lot of times the discussion post by @cdeotte . Congratulations to both of them for winning first place! Truly learnt a lot from you guys. Especially from Konstantin's kernels. In the last day I also copied a lot of features from @duykhanh99 's great kernel.

Some unique things that we did in FE part:

Group the TransactionAmt into several bins (say 0-50, 50-100, 100-400 etc.) to reduce noise. Then combine ProductCD with AmtBin to create a new semi-categorical feature.

Instead of directly doing numeric interaction, do it for log(Amt) or Rank(Amt) to make it more robust.

Plot the features' mean and std by time, so we get a pretty good sense of whether the feature changes a lot in time / between train and test.

We use some pseudo id and tried to find prev / next fraud probability (this was a strong feature in our model).

Adversarial analysis was another thing that I gave a try (mostly using the function provided by @nroman ). This basically try to use the features to predict whether a data point belongs to train or test set and identify the features that display strong time-series pattern.

Feature Selection

Building up the model by adding features one by one:

I read the negative down sampling post by @johnpateha , and somehow I got into this thinking that I should be aiming for models with say 50 features that can achieve at least 9500. His trick really helped reduce the training time, and the approach to add each feature only when it increases CV(and probably for at least 2 or 3 folds) was brilliant. Too bad I could not come up with more strong features. (Still feeling unhappy for not getting even a bronze LOL). Also @cpmpml gave me some great advice on feature selection and his approach is similar to that of Evgeny.

RFECV with feature importance, this one is tricky, I don't think it worked well, probably because I did not get to the point where I got enough / strong features to select.

Null importance, I borrowed the code from @ogrellier 's great kernel. The basic idea is to randomly shuffle the target variable and evaluate the performance of features for real vs random target. I would say I like this one most, and I was able to use it in our later stacking model to help boost the score.

Permutation Importance, this one was recommended by a lot of people, and it does the random permutation for the feature itself instead of target and evaluate the performance difference between real vs random X. Intuitively it still makes sense, but when there are interactions between different features, this can be distorted, so I would still prefer the Null importance thing.

My special version of feature selection metric: as you run CV, generate gain importance for all folds, compute the max / min across folds, if some feature appears on the bottom 5% in both max and min importance, it's probably just not good / important for the model.

Sequential Forward Selection, this one was good but slow. It's kind of like stepwise regression but if you use LGBM, it would be slow. I read someone's posting saying he used lasso to do SFS and it worked well and I tried it as well, didn't work out probably again because I did not have enough good features.

CV and Sampling

We used GroupKFold, but we split the dataset into 5 equal length groups instead of by month.

We did negative downsampling to speed up training.

In later stage, since we were worried about the spike in December, we over sampled from December.

Even later, we split the first part of the CV (December part) into 2 parts and over sampled them 1 time, which made it possible to have a pseudo-6x fold data set that heavily weight on December but with a somewhat equal fold size (I thought this was brilliant LOL) and this boosted performance a little bit. The thinking here is that December is far in the private set (also the OOF performance on this fold was worst).

After the competition ended I saw the low ranking and was I thinking probably I overfitted a lot. However then I did a linear regression of my CV score vs Private board score and it became clear that our CV was really robust (t-stat was significant and the trend was clear). I guess this was the only part that we are super-proud that we did well. So the sad story is we underfitted. I think this has to do with the finance background of me and my friends where overfit is the big disaster. (By the way in finance you never do such things like count encoding, that's data leak!!! that's forward-looking!!! these are WRONG stuff to do!!!)

We did some Pseudo Labeling, which was a great idea and helped boost performance and probably reduce overfitting. We tried to do the PL sampling more on December.

Stacking

I built the pipeline in a way that it's easy to do stacking. Basically each of my experiment / model run would generate predictions and OOF. Then later I just import them and append OOF / predictions to the training / test set. So that framework saved me a lot of time (BTW I did everything in kernels). We ended up with probably 50x or so single models with different features and parameters. Negative down sampling helped us train a lot of models in a fast and reliable way. The idea that you should save your OOF to do stacking / blending and check your CV score was a new concept to me, and I felt so good about it. This also proved to be very useful, and in the end the reliable CV / Private correlation also supports this. Maybe that boosted my confidence too much, and I should have been more aggressive in pushing new / more features.

After I finished the competition, I kind of regret not using Konstantin's code to generate OOF for his Internal-Blend. At least I should have thought about it and tried, then I might have the chance to get a bronze. But … this is definitely overfitting / look-ahead bias. Back when I was still in the competition, I was so confident that I did it in the correct way and the shake up would be big. I might have been too blind in trusting our own CV and ignored some of the simple things that could have boosted our score.

Para Tuning

We just did some random (human) jumps in the para space and also used hyperopt to run a little bit parameter tuning, which helped boost the performance a little bit. I should have spent

Submission Strategy

Like I said, I should have at least tried to run Konstantin's (relatively reliable) solutions and generate OOF to see if I could blend / stack a better result. I think that was a big big mistake. (probably also some other great single model kernels, I should have done the same (generate OOF and try to blend a better version out of them). But, there is no way I can travel back in time and finish this type of magic. I can only say next time I'll definitely try that kind of stuff and be less-over-confident. And probably try to blend some relatively reliable stuff as a diversifying choice. Next time I should spend more time to think and carefully plan on this
12条回复
热度排序

发表回复