Quan Sun on finishing in second place in Predict Grant Applications | by Kaggle Team | Kaggle Blog
I’m a PhD scholar of the Machine Studying Group within the College of Waikato, Hamilton, New Zealand. I’m additionally a part-time software program developer for 11ants analytics. My PhD analysis focuses on meta-learning and the total mannequin choice drawback. In 2009 and 2010, I participated the united states/FICO information mining contests.
What I attempted and What ended up working
I attempted many alternative algorithms (primarily weka and matlab implementations) and have units in almost 80 submissions. This report will briefly introduce two approaches that labored for this competitors. Every of them shall be mentioned sequentially within the order of submissions.
After the primary 10 testing submissions, I realised that there was an idea drift occurring between 2007 and 2008. The success charges decline step by step from 2007. Additionally, on the data web page of the competition, it states that “In Australia, success charges have fallen to 20–25 per cent…”. To me, this in all probability means, the choice guidelines for grant functions had been by some means modified throughout 2007 and 2008. Listed below are some penalties that I may consider, together with however not restricted to:
- The general success charges will proceed to drop
- Profitable functions in 2005/2006 could be declined in 2007/2008, so for 2009/2010
- Success patterns turning into to be “extra” random
- Resolution guidelines for 12 months 2009/2010 shall be near that for 2007/2008, in contrast with guidelines for 12 months 2006 and prior.
Primarily based on the data and assumptions above, I made a decision to primarily use information factors from 2007 and 2008 for coaching my classifiers, which seems to be an inexpensive alternative.
Strategy A: Ensemble Choice with remodeled function set (used within the first 20 submissions) Information engineering/transformation half
to numeric, 12 months, month, day in numbers
RFCD.Code.X (X=1 to five)
Individual.ID.X (X=1 to 15)
Quantity.of.Grant.X (X=1 to 15)
Whole variety of profitable/unsuccessful grants per software
Publications AA, A, B, C
Whole variety of AA, A, B, C publications per software
Whole variety of CHIEF_INVESTIGATORs, PRINCIPAL_SUPERVISORs, DELEGATED_RESEARCHER, EXT_CHIEF_INVESTIGATORs per software
Whole variety of Asia_Pacific born, Australia, Great_Britain, Western_Europe, Eastern_Europe, North_America, New_Zealand, Middle_East_and_Africa per software
Whole variety of PhDs per software
Whole quantity of people that has been within the College for greater than 5 years
In any case these transformations are carried out, I additionally had a java program to rework all nominal attributes to its corresponding frequency. The frequency counting is predicated on all of the obtainable information factors. So, the ultimate function set consists of the unique options, remodeled options and frequency.
My primary methodology is named Ensemble Choice, initially proposed by Wealthy Caruana and co-authors of Cornell College (http://portal.acm.org/quotation.cfm?id=1015432). The next pseudocode demonstrates the fundamental concept of Ensemble Choice:
0. Break up the information into two elements: The construct set and the hillclimb set
1. Begin with the empty ensemble.
2. Add to the ensemble the mannequin (skilled on “construct” set) within the library that maximizes the ensemble’s efficiency to the error metric (AUC for this contest) on a “hillclimb” (validation) set.
3. Repeat Step 2 for a ﬁxed variety of iterations or till all of the fashions have been used.
4. Return the ensemble from the nested set of ensembles that has most efficiency on the hillclimb (validation) set.
Mannequin library used for my Ensemble Choice system:
AdaBoost, LogitBoost, RealAdaBoost, DecisionTable, RotationForest, BayesNet, NaiveBayes, 7 algorithms with completely different parameters, in whole 28 base classifiers.
Constructing set and hillclimb set for Ensemble Choice:
Information factors from 12 months 2007 are used because the “construct set”
Information factors from 12 months 2008 are used because the “hillclimb set”
Information factors from 12 months 2007/01/01 to 2008/04/30 are used because the “construct set”
Information factors after 12 months 2008/04/30 are used because the “hillclimb set”
Each setups labored effectively for the Ensemble Choice strategy.
In abstract, the ultimate system for Strategy A consists of three primary elements:
Information factors from 2007 for coaching and 2008 for hillclimbing.
Ensemble Choice, num of baggage: 10, hillclimb iterations = measurement of the mannequin library.
In whole 352 options.
Learderboard AUC: 0.956X, Finest remaining check set AUC: 0.961X
From submission 20 to the tip of the competitors, the next options are added to Strategy A function set:
Variety of lacking values
Variety of non-missing values
Lacking worth charge
Remodel “Contract.Worth.Band” to numeric values
Common contract worth
RFCD.CODE imply, sum, max, min, commonplace deviation per software based mostly
RFCD.PCT imply, sum, max, min, std per software based mostly
search engine marketing.CODE imply, sum, max, min, std per software based mostly
search engine marketing.PCT imply, sum, max, min, std per software based mostly
Profitable.grant imply, sum, max, min, std per software based mostly
Unsuccessful.grant imply, sum, max, min, std per software based mostly
Profitable.grant imply common per software based mostly
Profitable.grant sum common per software based mostly
All of the above options for the primary three candidates
All of the above options for Unsuccessful.grant
Success charge of applicant 1, applicant 2, and applicant Three per software based mostly
Success charge of all candidates per software based mostly
Imply, max, std success charges of all candidates per software based mostly
Variety of publications imply, sum, max, min, std per software based mostly
Besides the frequency counting described in Strategy A, solely “row-based (per-application-based)” statistical options had been step by step launched to my system throughout the competitors, as a result of I believed that, in contrast with “time based mostly/column based mostly options”, “row-based” statistical options would cut back the prospect of overfitting.
Additionally, the next algorithms (with completely different/various parameter settings) had been step by step added to the mannequin library whereas the competitors:
Bagging with bushes
RandomCommittee with Random Timber
Strategy B: Rotation Forest with the function set from Strategy A
I attempted utilizing solely Rotation forest (http://www.laptop.org/portal/internet/csdl/doi/10.1109/TPAMI.2006.211) with the next setup:
Base classifier: M5P mannequin tree (weka default is J48)
Rotation methodology: Random Projection with Gaussian distribution (weka default is PCA)
The Rotation forest classifier was skilled on information factors from 2007 and 2008 with the function set from Strategy A. Listed below are the outcomes:
Leaderboard AUC: 0.947X, Last check set AUC: 0.962X
Averaging the 2 approaches may enhance the ultimate check set AUC to 0.963X.
What instruments I used
Software program/Instruments used for modelling and information evaluation:
Weka 3.7.1 is used for modelling (with my very own improved model of the Ensemble Choice algorithm)
Matlab and SAS are used for information visualization and statistical evaluation
Java is used as the primary programming language for this undertaking
Most experiments had been carried out on my house PC: AMD 6-core, 16G ram on Home windows system.