Yuanchen He on finishing third in the Melbourne University competition | by Kaggle Team | Kaggle Blog
I’m Yuanchen He, a senior engineer in McAfee lab. I’ve been engaged on giant knowledge evaluation and classification modeling for community safety issues.
Many due to Kaggle for organising this competitors. And congratulations to the winners! I loved it and realized rather a lot from engaged on this difficult knowledge and studying the winners’ posts. I’m sorry I didn’t discover free time final week to jot down this report.
The info got here with quite a lot of categorical options with a excessive variety of values. On the very starting, I eliminated ineffective options (by weka.filters.unsupervised.attribute.RemoveUseless -M 99.0) and eliminated the options with virtually 100% lacking values. After that, I attempted to rework the specific options into a bunch of binary options with every is a sure or no on a particular worth. I additionally generated four quarter options and 12 month options from startdate and generated binary indicator options for lacking values. The binary options, date-based options, indicator options, in addition to different numerical options, after merely filling lacking values with imply, have been fed into R randomForest classifier for RFE. With that I received 94.9x on the leaderboard. I saved tuning alongside this fashion however the accuracy can’t be improved additional. Then I began to suspect there have been some data loss through the strategy of function transformation and have choice.
So I attempted to construct classifiers straight on the specific options with out reworking them into binary options. A easy frequency primarily based pre-filtering was utilized. For a uncooked categorical function, all values introduced lower than 10 situations within the knowledge have been mixed into a particular frequent worth “-1”. Nevertheless, R randomForest can not settle for a categorical function with greater than 32 values. So I needed to cut up every categorical function once more into “sub options”, with every has not more than 32 values. The way in which I cut up the values into totally different sub options was sorting the values with data acquire first, after which high 31 values have been assigned into sub function 1, the following 31 values have been assigned into sub function 2, and so forth. With this function transformation technique I received 94.6x on the leaderboard.
The following one I attempted was merely combining the highest options from the above two strategies. The randomForest classifiers on the mixed function units can enhance the leaderboard ROC to 95.1x-95.3x, relying on the situations used for coaching. The very best classifiers have been generated from coaching solely on situations after 0606, solely on situations after 0612, and solely on situations after 0706. Lastly, I noticed the prediction outcomes from these classifiers have been totally different sufficient and therefore it was price to make a serious voting from them, and I received my greatest leaderboard AUC 95.555, which generalized to the opposite 75% check situations with the ultimate AUC 96.1051