Jeremy Howard on winning the Predict Grant Applications Competition | by Kaggle Team | Kaggle Blog
As a result of I’ve just lately began employment with Kaggle, I’m not eligible to win any prizes. Which implies the prize-winner for this comp is Quan Solar (workforce ‘student1’)! Congratulations!
My method to this competitors was to first analyze the info in Excel pivottables. I appeared for teams which had excessive or low software success charges. On this means, I discovered a lot of sturdy predictors — together with by date (new years day is a robust predictor, as are functions processed on a Sunday), and for a lot of fields a null worth was extremely predictive.
I then used C# to normalize the info into Grants and Individuals objects, and constructed a dataset for modeling together with these options: CatCode, NumPerPerson, PersonId, NumOnDate, AnyHasPhd, Nation, Dept, DayOfWeek, HasPhd, IsNY, Month, NoClass, NoSpons, RFCD, Position, search engine optimisation, Sponsor, ValueBand, HasID, AnyHasID, AnyHasSucc, HasSucc, Folks.Rely, AStarPapers, APapers, BPapers, CPapers, Papers, MaxAStarPapers, MaxCPapers, MaxPapers, NumSucc, NumUnsucc, MinNumSucc, MinNumUnsucc, PctRFCD, PctSEO, MaxYearBirth, MinYearUni, YearBirth, YearUni .
Most of those are pretty apparent as to what they imply. Discipline names beginning with ‘Any’ are true if any individual hooked up to the grant has that function (e.g. ‘AnyHasPhd’). For many fields I had one predictor that simply seems to be at individual 1 (e.g. ‘APapers’ is variety of A papers from individual 1), and one for the utmost of all individuals within the software (e.g. ‘MaxAPapers’).
As soon as I had created these options, I used a generalization of the random forest algorithm to construct a mannequin. I’ll attempt to write some element about how this algorithm works when I’ve extra time, however actually, the distinction between it and an everyday random forest is just not that nice.
I pre-processed the info earlier than working it by way of the mannequin by grouping up small teams in categorical variables, and changing steady columns with null values with 2 columns (one containing a binary predictor that’s true solely the place the continual column is null, the opposite containing the unique column, with nulls changed by the median). Apart from the Excel pivottables at the beginning, all of the pre-processing and modelling was accomplished in C#, utilizing libraries I developed throughout this competitors. I hope to doc and launch these libraries sooner or later — maybe after tuning them in future comps.