Predicting Student Outcomes by How They Interact With Their Learning Environment | by Josh Johnson | Dec, 2020


Can we predict whether a student will pass an online course without knowing anything about who they are?

Photo by Frank Romero on Unsplash

Learning online has been a growing trend for decades now. In 2018, 35% of college students took at least one course online and 17% took all of their classes remotely (NCES study). With COVID-19 a reality, learning online has exploded and become a necessary health and safety issue for more people than ever. While students will eventually return to school, the industry has had opportunity, funding, and impetus to improve and expand. This will undoubtedly lead to a sharper rise in the importance of internet based learning in the post COVID future.

In my work as a teacher I leveraged online learning opportunities and digital learning environments to allow students to learn at the pace that was right for them. The could take more time here and zip through there, rather than being forced to march at the pace of the rest of their peers. Virtual learning also streamlined data collection and analytics to help me plan interventions and adjust curriculum. Services such as Dreambox, Khan Academy, Coursera, and Udemy became vital tools for my self-driven learners.

But, online learning comes with pitfalls as well. Online college courses have a 10–20% higher dropout rate and other online courses can range between 40% and 80% dropout (Bawa). In large courses students who are struggling can fall through the cracks. It’s often true that the stronger students are the ones that reach out for help, rather than the ones vulnerable to failure.

I wanted to know if there were algorithmic ways we could detect the warning signs of failure or withdrawal from courses in time to turn things around. I created a predictive model using data from Open University covering to identify students who needed intervention. This dataset contains information about 22k students in 7 different courses over the 2012/2013 and 2013/2014 school years, including their interactions with the virtual learning environment.

I simulated a mid-course intervention recommender by using only virtual learning environment interaction data from the first half of the learning period. I collected statistics on how they interacted with the activities, when they studied, and how they fared on the content assessments along the way, but did not include any data about who they were or where they came from. These kinds of activity and assessment data can be collected anonymously from any online learning portal.

Using these statistics I could predict 75% of the students who would not pass the course by using data from the first half of the course. They scored lower on the assessments, clicked fewer times, and spent fewer total days engaging the activities. In fact, the warning signs showed themselves in the first month for 68% of students who would go on to fail the course.

The point at which the model makes its predictions has an effect on its accuracy, but not as much as one might think. The range of accuracy of predictions made in the first 10% of the course to the last 10% is less than 20%. Warning signs start early, and students who don’t start strong often don’t do well down the road.

Successful students tended to put in work consistently, day by day, rather than in long cramming sessions. They also did not put off the assessments, but completed them on time. While scoring highly on assessments was certainly correlated to eventual success, the consistency of study was a better predictor.

I used a fascinating algorithm called gradient boosting to make this prediction. My implementation creates a series of decision trees that individually perform quite poorly. These are stacked on top of each other, and only the bottom one actually attempts to predicts the course outcomes. Each successive tree predicts where the one before it will make mistakes, and by the time the top tree has its say, the ones below have caught most of the errors and the model returns an accurate prediction. You can read and explore more about the XGBoost (eXtreme Gradient Boost) implementation I used, and apply it to your own data, if you like. It takes time to train, but the results are worth it.

However, the relationship between activity and success is actually fairly linear. For the most part, the more work a student puts in and the better they do on their assessments, the more likely they are to succeed. Simple logistic regression was enough to identify 72% of students who would fail, and even my model had an especially hard time predicting students who worked hard and scored well but would withdraw before the end of the course. One has to wonder what circumstances in their lives may have led to that outcome.

There is great value in examining the patterns in virtual learning environment interactions by students, and I hope you are inspired to take my work further. The link to my git for this project is below. Please feel free to clone it and try adding features or using different modeling techniques to improve on my work. Also, if you know of any similar datasets I could use to test and further train my model with, please link them in the comments!

Github Link: Predicting Online Student Success


Bawa, Papia, Retention in Online Courses: Exploring Issues and Solutions — A Literature Review, SAGE Open January-March 2016: 1–11 © The Author(s) 2016 DOI: 10.1177/2158244015621777

National Center for Education Statistics

Kuzilek J., Hlosta M., Zdrahal Z. Open University Learning Analytics dataset Sci. Data 4:170171 doi: 10.1038/sdata.2017.171 (2017).

Read More …


Write a comment