2024 3M Open Leaderboard: Real-Time Tracking and Player Stats

Okay, so let me walk you through my experience with the 3M Open Leaderboard 2024. It was a wild ride, full of ups and downs, mostly downs if I’m being honest, but hey, that’s how you learn, right?

It all started when I saw a post about it on some forum. I thought, “Why not? I’ve got some time, and it looks kinda fun.” Famous last words, I know. I jumped in headfirst, thinking I could just wing it. Boy, was I wrong.

First, I spent like a week just trying to figure out the data. It was a mess! Dates all over the place, weird formatting, and just a ton of information I didn’t even know what to do with. I started by cleaning it up in Python, using Pandas. I basically just looped through the columns, fixing the dates, getting rid of the irrelevant stuff, and trying to make sense of the whole thing. It was tedious, but necessary.

Data Cleaning Steps:

Loaded the CSV into a Pandas DataFrame.
Converted date columns to datetime objects.
Handled missing values (filled with 0 or the mean, depending on the column).
Removed unnecessary columns (like player ID, which wasn’t useful for my analysis).

Then came the “fun” part – trying to build a model to predict something. I wasn’t even sure what I wanted to predict at first. I messed around with a few ideas, like predicting the winner based on previous scores, but that turned out to be a dead end. The data was just too noisy. After that I tried to focus on single round performance.

I decided to keep it simple and use a basic linear regression model with scikit-learn in python. I figured it was a good starting point. I spent ages fiddling with the features, trying different combinations to see what would give me the best results. Spoiler alert: nothing really worked that well. I tried adding features like average score, best score, and even some stuff like number of birdies, but nothing seemed to significantly improve the predictions.

Model Building:

Split the data into training and testing sets.
Used Linear Regression as the base model.
Tried different feature combinations (average score, best score, birdies, etc.).
Evaluated the model using metrics like Mean Absolute Error (MAE) and R-squared.

After that I tried to use other more complex models. Boosted trees and SVMs were a no go after some initial tries because I really struggled with the hyperparameter tuning. I just ran out of time to make them work.

Next, I tried to validate the model’s reliability. To do this, I wrote some simple backtesting code to simulate betting on the players using the model. This revealed some clear problems because some players were rated much higher than other simply because they had a higher frequency of good performance. It was quite frustrating actually.

I even tried to incorporate some weather data, thinking that might give me an edge. I found some historical weather data for the tournament location, but merging it with the golf data was a pain. And, in the end, it didn’t really help much. The weather data was too general, and I couldn’t really tell how it affected individual players.

I spent the last few days just tweaking things here and there, trying to squeeze out a little bit more accuracy. I even tried some fancy stuff like neural networks, but I didn’t have enough data to train them properly. It was a complete mess.

In the end, my model was okay-ish. It wasn’t terrible, but it wasn’t great either. I definitely didn’t win any prizes or anything. But I learned a ton about data analysis, machine learning, and the importance of cleaning your data. Plus, I got to mess around with some cool tools and libraries. So, all in all, it was a worthwhile experience.

The biggest takeaway? Don’t underestimate the amount of time it takes to clean and preprocess data. And don’t be afraid to start simple. Sometimes, the simplest models are the best.

Learnings: