A summary of mid-distance race analyses

Hi V,

I’ve alluded to this project to you a couple of times over the last couple years, but I’ve never gotten around to putting this ‘report’ together until now. The origins of this project come from my love for mid-distance running, especially the 400m and 800m. There’s something about racing on the double-edged blade of the anaerobic-aerobic threshold that’s inspiring, captivating, and a lot of fun. At some point, every person who coaches or runs these events asks themselves, “What’s the best way to run these?” The answer is obviously one word: fast. But in my head, that question evolved into, “What’s the most efficient way to run them?” And after discovering statistics after not taking your class in high school, which I still regret to a small degree, I started thinking about how there must be a certain way to run these events such that on average, if you hold to that practice, you’ll be able to run faster.

This diving board took me down several avenues. I learned several new statistical methods in an attempt to see which could really explain efficiency, and vicariously success, in mid-distance. Clustering, dimension reduction, random forests, multiple regression, dozens of race metrics, and other approaches didn’t elucidate a story from the data. Then, as I served as a statistical consultant on a master’s project, I learned two new methods that illuminated something important. I don’t take the things I’ve learned as the full truth yet: I still have relatively little data and it’s hard to get truly independent race data. The most available data I have found yet come from championship racing and prelims. These data were very cloudy, however, and tainted by intentional race tactics that weren’t necessarily motivated by trying to run as fast as possible. But, I think with the rest of the data that I do have, a very interesting door has opened.

Methods

800m Data

Some of the data I used were collected from your track notes between 2010ish and 2016ish. Another portion consists of the athletes I coach right now. The last portion of data, and the most precise data, comes from NBNI (New Balance Nationals Indoor) 2023 high school championships. Although they were championship data, in high school the championships are usually more about running fast than NCAA or international stuff. They had about the same proportion of negative splits as the rest of the data that I collected, so I think they’re safe for analysis (negative splits were a much larger proportion of NCAA prelims and finals). I tried to webscrape the results but because of the way milesplit hosts their data, it’s nearly impossible to do it with programming. So it took hours to copy and paste shit into a csv. I’m thinking about trying to get more NBNI competitions, but if I do, I want to automate it somehow. You can say a little bit about how indoor races are different than outdoor, and I know some mixed-effect modeling that can account for those differences, but I think they’re relatively inconsequential. For all the data sources, I collected 400m splits, and some had 200m splits, but I didn’t end up using to 200m splits for the 800m analysis.

400m Data

The two data sources for the 400m data come from my coaching stuff and NBNI 2023 again. NBNI was really the only place to publish 200m splits for the 400s (makes sense for 200m tracks).

Race Metrics

I tried a bunch of metrics from races to measure efficiency, and what I ended up going with was the ratio of the second lap to the first lap:

\[\frac{\text{lap}_2}{\text{lap}_1}\]

This meant that \(x = 1.0\) was an even split, \(x > 1.0\) is a slower second lap (which is typical), and \(x < 1.0\) is a negative split.

Changepoint Analysis and Piecewise Regression

One of the methods I referred to/learned is called changepoint or breakpoint analysis. Essentially, they’re algorithms that detect a change in average, variance, or slope in a sequence of data points. This is used in tandem with piecewise regression, where post hoc models are fitted to segments in the sequence, according to detected changepoints. So, in this case study, I arranged all the data according to their laps ratio, then fit linear models to each side of the changepoint. I don’t really think there’s a ton of merit to the linear models by themselves, but it does serve as a solid conceptualization of the pattern that’s going on. I think a natural spline, or maybe a quadratic model, would serve as a better regression method, but the point is that there’s a changepoint that occurs in the lap ratios, suggesting an optimal speed differential between your first and second lap (or first and second 200m split).

Results

800m

Smoothing Curves

This is a the first look I took at the data, with the smoothing curve.

The changepoint for this was \(1.049\), meaning that you run the second lap 4.9% slower than the first. This is what the linear models would look like at that changepoint.

Clearly, the goal is to run as close to an even split as possible, but even in David Rudisha’s world record, he didn’t run even splits. So, there is an advantage to run the first lap faster.

400m

Smoothing Curves

The 400m data had an even stronger trend than the 800m. Here’s what the data looked like.

The changepoint that was detected in the 400m data occurred at 1.123.


Discussion and Conclusion

Like I said earlier, I don’t think this is a final draft of whatever these data might be telling us. For example, the changepoints that were detected are the ratios used among the fastest times. That doesn’t mean that a 2:10 800m runner needs to run that ratio, they are likely better off with having a slightly larger difference between laps. This is why I think a natural spline/quadratic regression would help make all of this a little more interpretable. But it’s interesting to think about. I think a definite take away is that in order to run faster in the 800m, you need to make your laps more consistent, but the absolute ratio that produces that optimal speed will change from runner to runner. It could serve as a starting point and/or goal to replicate that lap ratio in your races, but at the end of the day it’s how you balance your speed versus your endurance/strength.

Let me know if you have any questions or feedback on all of this, I’m gonna keep trying to acquire data as time goes on. One day when I have my own coaching program or whatever, I’m gonna get all the data I could ever need hahaha. I love you, V, thank you for helping me discover a lifelong passion all those years ago.