Max Lin on finishing second in the R Challenge | by Kaggle Team | Kaggle Blog


Kaggle Team

I participated within the R package recommendation engine competition on Kaggle for 2 causes. First, I exploit R lots. I can not be taught statistics with out R. This competitors is my probability to offer again to the group a R bundle advice engine. Second, throughout my day job as an engineer behind a machine studying service within the cloud, product advice is without doubt one of the hottest purposes our early adopters need to use the net service for. This competitors is my alternative to face in customers’ footwear and determine their ache factors related to constructing a advice system.

I deal with R bundle advice as a binary classification activity: a classifier $latex f(u, p)$ takes as enter a person $latex u$ and a bundle $latex p$, and predicts whether or not the person will set up the bundle or not. My remaining submission (staff title: Record Me Men) combines 4 classifiers, labeled as massive dots in Determine 1. The 4 classifiers share the identical type of goal perform that minimizes loss $latex L$ plus regularizers $latex R$:

$latex J(theta) = sum_i L(y_i, f(u_i, p_i; theta)) + lambda R(theta)$

, the place $latex f(u, v)$ is a classification mannequin, $latex theta$ is the parameters of the classification mannequin, $latex y_i$, $latex u_i$, $latex p_i$ are the category label (bundle put in or not), person and bundle of the i-th coaching instance, respectively. $latex lambda$ controls the penalty from the regularizing perform, and is chosen utilizing cross validation.

Mannequin 1: Baseline. Mannequin 1 is example_model_2.R that the competitors organizer supplies as a baseline. On this mannequin, a person is encoded as dummy variables $latex mathbf{u}$, and a bundle is represented by seven options $latex mathbf{p} = {p_1, dots, p_7}$ (e.g, the logarithmic rely of dependency. See the R script for the entire record). The classification mannequin is a linear mixture of person dummy variables and bundle options with weight parameters, $latex f(u, p) = mathbf{theta_u}^T mathbf{u} + mathbf{theta_p}^T mathbf{p}$. The parameters are $latex theta_u$ and $latex theta_p$ (and intercept). The loss perform $latex L$ is unfavourable logistic log chance. There isn’t any regularizer within the mannequin.

The primary mannequin establishes a robust baseline, reaching AUC of ~0.94, labeled as M1b2 in Determine 1. A variant of this mannequin that omits the person function (see example_model_1.R, additionally supplied by the competition organizers) achieves noticeable decrease AUC of 0.81 (not proven in Determine 1). This implies that some customers usually tend to set up R packages than others. Within the subsequent mannequin, I’ll discover a classification mannequin that includes not solely person variations but in addition bundle variations.

Mannequin 2: Latent issue Fashions. In distinction to Mannequin 1 with options derived from metadata, I concentrate on the user-package ranking matrix. Mannequin 2 consists of two elements: baseline estimates and latent elements. Baseline estimates are linear combos of three elements: world (one parameter $latex mu$), person (one parameter per person $latex mu_u$), and bundle (one parameter per bundle $latex mu_p$). For latent elements, I assumes that there are Ok latent elements for every person, $latex mathbf{beta}_u$ and every bundle $latex mathbf{beta}_p$, and the interior product of those two elements captures the interplay between a person and a bundle. The classifier mannequin in Mannequin 2 is $latex f(u, p) = mu + mu_u + mu_p + mathbf{beta}_u^T mathbf{beta}_p$. This sort of latent issue mannequin, also called “singular worth decomposition”, has been reported with nice success in earlier collaborative filtering research and on the Netlifx Prize. I select an exponential loss perform for Mannequin 2, $latex L(y, f(u, p)) = exp(- y f(u, p))$,

the place $latex y_i in {1, -1}$. I select exponential loss over squared loss as a result of 1) exponential loss matches 0–1 loss higher than squared loss 2) exponential loss is differentiable. I apply L2 regularizers, $latex R(cdot) = ||mu_u||² + ||mu_p||² + ||beta_u||² + ||beta_p||²$. I decrease the target perform utilizing stochastic gradient descent. The variety of latent elements, Ok, is chosen by cross validation.

The latent issue mannequin works very nicely on the R bundle advice information, reaching AUC of ~0.97 (See the M2 household in Determine 1). I plot the efficiency of 5 latent issue fashions with completely different Ks, starting from 10 to 50, labeled as M2k10, M2k20, …, M2K50 in Determine 1. As Ok will increase, the latent issue mannequin turns into extra expressive and match the info higher, leading to greater AUC.

Mannequin 3: Bundle LDA matter. In Mannequin 3, I discover new options not utilized in Mannequin 1 and a pair of. The brand new function is a bundle’s matter based mostly LDA (Latent Dirichlet Allocation). The LDA matter of a bundle is inferred from the phrase counts of its man pages. I exploit the topics.csv, kindly ready by the competition organizers, to map a R bundle to one of many 25 LDA matters (See topic_models.R for particulars on operating LDA, supplied by the competition organizer).

The classification mannequin of Mannequin Three is much like Mannequin 2. Mannequin Three replaces person elements in Mannequin 2 with T LDA elements weights $latex mathbf{t}_u$ (T=25 right here), and replaces bundle elements in Mannequin Three with T dummy variables $latex mathbf{p}$. The classification mannequin is $latex f(u, v) = mu + mu_u + mu_p + mathbf{t}_u^T mathbf{p}$. The loss perform is identical as Mannequin 2, and the regularizer is L2, $latex R(cdot) = ||mu_u||² + ||mu_p||² + ||mathbf{t}_u||² $.

Mannequin Three achieves higher AUC than Mannequin 1, leading to AUC of ~0.97 (labeled as M3u in Determine 1). Previous to M3u, I discover an easier mannequin that customers share the identical weights $latex mathbf{t}$. The parameter house turns into smaller, however Mannequin Three with shared parameters (labeled as M3b in Determine 1) performs barely worse than Mannequin Three with user-specific parameters. This is sensible as a result of it’s unlikely each R person shares the identical curiosity in the identical set of LDA matters.

Mannequin 4: Bundle activity view. R packages are organized into activity views (e.g., high-performance computing, survival evaluation, and time sequence. See the page for the entire record of views). It’s doable {that a} person serious about a specific activity would set up not only one however many, even all, packages in the identical activity view (e.g., utilizing set up.views() R perform). We may enhance advice if we all know how a lot a person is serious about a specific activity view. I exploit the views.csv, supplied by the competition organizers, to map a bundle to its activity view.

The classification mannequin of Mannequin Four is much like Mannequin 3, besides that LDA matters in Mannequin Three are changed with activity views (T=29, 28 activity views plus 1 unknown view). The efficiency of Mannequin Four is labeled as M4u in Determine 1. Mannequin Four performs nicely and higher than Mannequin 1. Equally, I experiment with a variant of Mannequin Four that every one customers share the identical activity view parameters $latex mathbf{t}$, labeled as M4b i n Determine 1, and it performs worse than Mannequin Four with per-user parameters. The discovering is in line with Mannequin 3, and R customers, no less than on the R advice information set, appear to have completely different preferences in activity views.

Ensemble studying. I mix 4 classifiers utilizing logistic regression. To gather out-of-sample coaching information for the ensemble learner, I divide the coaching information into three units: 80%, 10%, and 10%. I practice particular person classifier on the 80%. I then apply the educated classifiers on the 20%, and mix the scores from particular person classifiers as coaching information for the ensemble learner. Each particular person classifiers and the combiner are evaluated on the final 10% information set (as these proven in Determine 1 and Determine 2). After evaluating one fold, I shift information and conduct one other folds till all information are used, leading to a complete of 10 ensemble learners. The output of those 10 ensemble learners are averaged to supply remaining predictions.

The ensemble studying works very nicely, as proven in Determine 2. Combining M1 and M2 achieves AUC of 0.9721, which is greater than both M1 0.94 or M2 0.9702 alone. Once I mix extra fashions and embrace M3 and M4, the efficiency will get even higher. The submission of mixing 4 fashions achieves greatest efficiency on the coaching set amongst all fashions. On the take a look at set the ultimate mannequin achieves AUC of 0.983289 (after post-processing, see beneath), the very best amongst my submissions. The success of ensemble coaching is probably as a result of the person fashions are sturdy performers, and fashions are numerous with completely different classification fashions and options such that they complement one another.

Put up-processing. I apply two post-processing steps earlier than submitting every entry. First, the label and (person, bundle) affiliation on the coaching set is memorized. For any (person, bundle) pairs within the take a look at set which might be already seen within the coaching set, their labels on the coaching set are recalled and used as a substitute. Second, I assume when a person set up a bundle P, the person additionally installs the packages that P is dependent upon. I file all packages a person installs on the coaching set in addition to their dependent packages. Given a pair (U, P) on the take a look at set, if the bundle P is discovered within the set of the packages on which the person U’ put in packages rely, I ignore the prediction and output 1.Zero as a substitute. Though the dependency assumption doesn’t utterly maintain true (there are recognized contradictory examples on the coaching set), the 2 filters mixed will increase absolute AUC ~0.004, which issues in a contest with razor-thin margins between main submissions.

I develop the classification coaching applications for Mannequin 2, 3, and Four in Python. I exploit R to discover information, run logistic regression (glm() within the stats library), calculate AUC (efficiency() within the ROCR library), and plot outcomes (ggplot() within the ggplot2 library). All applications are developed and run on a commodity PC with dual-core 2GHz CPU and 4G RAM. I make the supply codes out there at github.

Though my submissions are based mostly on solely the info the competition organizers present and easy linear fashions, not at all you must cease right here. There are numerous instructions price exploring, for examples, crawling CRAN to research the wealthy graphs of relies upon, imports, suggests, and enhances between R packages. In an open-end contest like this, the sky is the restrict.

Max Lin is a software program engineer with Google Analysis in New York Metropolis workplace, and the tech lead of the Google Prediction API. Previous to Google, he printed analysis work in video content material evaluation, sentiment evaluation, machine studying, and cross-lingual info retrieval. He has a PhD in Laptop Science from Carnegie Mellon College. He want to thank the organizers of the R bundle advice engine competitors for his or her exhausting work and efforts in placing this competitors collectively. He participates the competitors as people, and writes all codes in his spare time. Nothing expressed right here might or must be hooked up to his employer.


Source link

Write a comment