Marcin Pionnier on finishing 5th in the RTA competition | by Kaggle Team | Kaggle Blog
I graduated on Warsaw College of Expertise with grasp thesis about textual content mining matter (clever internet crawling strategies). I work for Polish IT consulting firm (Sollers Consulting), the place I develop and design varied insurance coverage business associated stuff, (one among them is insurance coverage fraud detection platform). Occasionally I attempt to compete in knowledge mining contests (Netflix, competitions on Kaggle and tunedit.org) — from my perspective it’s a excellent strategy to get actual knowledge mining expertise.
What I attempted
So far as I keep in mind, the premise of the answer I outlined on the very starting: to create separate predictors for every particular person loop and time interval. So my resolution required me to construct 61×10=610 regression fashions. I used to be taking part in with varied regression algorithms, however rapidly selected linear regression — as a result of the outcomes had been good and the computation time was quick. I feel the important thing to get fairly good outcome (particularly on public RMSE 🙂 ) was the set of attributes used. I used the next attributes for the linear regression for every particular person loop&time interval:
— variety of minutes from 0:00 hours as much as present second (“now”)
— common drive time for given loop&interval
— loop instances for present second and a few variety of historic moments
earlier than (the variety of time factors and the loop diversified between the
— variations between “neighboring” time moments for the above knowledge:
simply variations or variations remodeled with logistic operate
(1/1+e^-difference). Use of logistic operate gave a soar from public
RMSE at about 198 to 189. The concept to make use of of sigmoid operate right here was
simply my instinct impressed by variations distribution.
— “saturations” for for every loop (besides the two first loops at each
I launched the easy (and really naive) mannequin of site visitors progress:
If the pace at given loop is as much as 40 km/h — the saturation is 1;
If the distinction between the earlier loop and the given loop is greater than 5 km/h: it’s assumed that this street half is partially saturated: there’s section that’s shifting at 30 km/h and second section with the identical pace as within the loop that’s earlier than given loop. The saturation is derived because the proportion of first section to the entire street half. Every loop detector has its minimal worth in RTAData file — after the regression this minimal worth was used if predicted worth was lower than minimal.
I didn’t use historic knowledge in any respect — I discovered them ineffective throughout the preliminary checks (perhaps too unexpectedly). The one supply of information for studying and testing was RTAData and lengths recordsdata (additionally no weekends, holidays, climate circumstances).
What ended up working
For every of 610 regression fashions the next Three fashions had been competing. Fashions had been being skilled with all knowledge availabe in RTAData
Mannequin 1: For all (61) loops: present + 5 instances moments earlier than and 5 easy variations — 675 attributes,
Mannequin 2: For 10 earlier than, present and subsequent 9 loops (if accessible or much less): present + 9 instances moments earlier than and 9 easy variations, saturations (for present time second solely) — 204 to 404 atrributes,
Mannequin 3: For 10 earlier than, present and subsequent 9 loops (if accessible or much less): present + 9 instances moments earlier than and 9 sigmoided variations,
saturations (for present time second solely) 204 to 404 atrributes,
Mannequin with least RMSE computed on the prepare file was chosen for specific loop. It’s not an excellent technique, nevertheless I believed
that usually linear regression was proof against overfitting (it’s not true — because the variety of variable grows, the extra variance might be defined — that is what I’ve learnt).
This technique gave me public RMSE 189.3
I added additionally 4th mannequin, that I simply used for 15, 30 minutes predictions arbitrarily:
Mannequin 4: For all (61) loops: present + 5 instances moments earlier than and 5 sigmoided variations, saturations (for present time second solely) — 614 attributes. This flip gave mi 188.6 public outcome.
What’s fascinating, the very best non-public resolution (nevertheless not chosen by me since I relied to a lot on public outcomes) was 190.819 (public 197.979) , it was simply the mannequin Three described above mixed with mannequin 5 (mannequin 5 was used for 15,30,45,60,90 minutes predictions arbitrarily, relaxation mannequin 3):
Mannequin 5: like mannequin Three but in addition loop instances are “sigmoided” not solely variations.
What instruments I used
My resolution is written as Java utility with Weka linked as library (as all the time when I attempt to compete in knowledge mining contests). Since linear regression requires to resolve matrix equation (on this case fairly enormous), the reminiscence allotted by this system was changing into increasingly more necessary situation (3,5GB for one thread) — on the of the competitors i used to be utilizing pc with Four processors and 12 GB of RAM — with Three separate threads constructing and testing the fashions. The entire computation for my final makes an attempt took about 48 hours of computations.