# Fun with Statistics - Is Usain Bolt really the fastest man on earth?

If you search for the phrase “fastest man on earth” in Google, chances are that it will return the answer “Usain Bolt”. It certainly does so for me, even though the results might be different if Google decides to personalize the results for you. This is because currently he holds the world record for being the quickest (9.58s) to run a 100 metres race. Justin Gatlin once did it in 9.45s, but that does not count as the record because he was assisted by the wind.

However, as statisticians, we know that a single data point does not necessarily prove anything and we should look at the distribution of timings for the top runners to check if Usain Bolt is truly the fastest man on earth. In this article, we will try to find an answer to this question using data.

### Data

Our first step is to find data on the 100m race timings for all top runners. Unfortunately, I could not find any public source of this data. The best option we have is to use data from the excellent website Alltime Athletics maintained by Peter Larsson. This page provides the details of the all-time men’s best 100m race timings. We can answer a slightly different question using this data. Out of all runners who appear in this list, is there any evidence to suggest that there is a statistical difference among the top runners?

Our first step will be to extract and store the data in a R data frame. We use the *rvest* and *readr* R packages to do this.

```
library(rvest)
library(readr)
```

`male_100_html <- read_html("http://www.alltime-athletics.com/m_100ok.htm")`

An examination of the HTML source reveals that the data has been stored within HTML pre-tags. There are also seven tables available as part of the page, out of which we are interested in the table “All-time men’s best 100m”. We use the function *html_nodes* from the *rvest* package using an XPath selector. The next step is to extract the HTML text using the function *html_text*, and restrict to the first element.

```
male_100_pres <- male_100_html %>%
html_nodes(xpath = "//pre")
male_100_htext <- male_100_pres %>%
html_text()
male_100_htext <- male_100_htext[[1]]
```

The data is stored in fixed-width format in the variable *male_100_htext*. The *read_fwf* function from the *readr* package can be used to store the data in a data frame. The exact arguments to the function were decided using a bit of trial and error.

```
male_100 <- readr::read_fwf(male_100_htext, skip = 1, n_max = 3178,
col_types = cols(.default = col_character()),
col_positions = fwf_positions(
c(1, 16, 27, 35, 66, 74, 86, 93, 123),
c(15, 26, 34, 65, 73, 85, 92, 122, 132)
))
str(male_100)
```

```
## spec_tbl_df [3,178 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1: chr [1:3178] "1" "2" "3" "3" ...
## $ X2: chr [1:3178] "9.58" "9.63" "9.69" "9.69" ...
## $ X3: chr [1:3178] "+0.9" "+1.5" "±0.0" "+2.0" ...
## $ X4: chr [1:3178] "Usain Bolt" "Usain Bolt" "Usain Bolt" "Tyson Gay" ...
## $ X5: chr [1:3178] "JAM" "JAM" "JAM" "USA" ...
## $ X6: chr [1:3178] "21.08.86" "21.08.86" "21.08.86" "09.08.82" ...
## $ X7: chr [1:3178] "1" "1" "1" "1" ...
## $ X8: chr [1:3178] "Berlin" "London" "Beijing" "Shanghai" ...
## $ X9: chr [1:3178] "16.08.2009" "05.08.2012" "16.08.200" "20.09.2009" ...
## - attr(*, "spec")=
## .. cols(
## .. .default = col_character(),
## .. X1 = col_character(),
## .. X2 = col_character(),
## .. X3 = col_character(),
## .. X4 = col_character(),
## .. X5 = col_character(),
## .. X6 = col_character(),
## .. X7 = col_character(),
## .. X8 = col_character(),
## .. X9 = col_character()
## .. )
```

For this analysis, we are only interested in the name of the runner and the race timing. So we restrict the data to these two columns, and give each column an appropriate name.

```
library(dplyr)
male_100 <- male_100 %>%
select(X2, X4)
colnames(male_100) <- c("timing", "runner")
```

We observe that some of the timings have an “A” against the time. The Wikipedia page Athletics abbreviations informs us that this indicates a mark set at altitude. We simply remove the “A” and also convert the variable *timing* into a numeric variable.

```
male_100 <- male_100 %>%
mutate(timing = gsub("A", "", timing),
timing = as.numeric(timing))
male_100
```

```
## # A tibble: 3,178 x 2
## timing runner
## <dbl> <chr>
## 1 9.58 Usain Bolt
## 2 9.63 Usain Bolt
## 3 9.69 Usain Bolt
## 4 9.69 Tyson Gay
## 5 9.69 Yohan Blake
## 6 9.71 Tyson Gay
## 7 9.72 Usain Bolt
## 8 9.72 Asafa Powell
## 9 9.74 Asafa Powell
## 10 9.74 Justin Gatlin
## # ... with 3,168 more rows
```

### Exploration

As with any statistical analysis, the first step is to explore and understand the data we are working with. We see that there are 318 unique runners in the data.

`n_distinct(male_100$runner)`

`## [1] 341`

```
runner_cnt <- male_100 %>%
group_by(runner) %>%
summarise(n_rec = n()) %>%
arrange(desc(n_rec))
runner_cnt
```

```
## # A tibble: 341 x 2
## runner n_rec
## <chr> <int>
## 1 Asafa Powell 144
## 2 Mike Rodgers 100
## 3 Justin Gatlin 99
## 4 Maurice Greene 83
## 5 Nesta Carter 72
## 6 Usain Bolt 71
## 7 Tyson Gay 67
## 8 Kim Collins 66
## 9 Ato Boldon 65
## 10 Frank Fredericks 62
## # ... with 331 more rows
```

As expected, some of the well-known names dominate the list with the maximum number of records. Our next step is to look at the distribution of the number of records.

```
library(ggplot2)
ggplot(runner_cnt, aes(n_rec)) +
geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
xlab(NULL) +
ylab(NULL) +
ggtitle("Distribution of number of records by runner") +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
```

As expected, this distribution is heavily skewed as the top runners will tend to dominate the list. We first restrict our data to those who have at least 50 records each.

```
male_100_2 <- male_100 %>%
inner_join(
select(filter(runner_cnt, n_rec >= 50), runner),
by = c("runner" = "runner")
)
```

There are 16 runners left in this data. Let us have a look at the distribution of their race timings as well as the mean and median race timings for these runners.

```
ggplot(male_100_2, aes(timing)) +
geom_density() +
facet_wrap(~runner, nrow = 4, ncol = 4) +
xlab(NULL) +
ylab(NULL) +
ggtitle("Distribution of race timings by runner") +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
```

```
male_100_means <- male_100_2 %>%
group_by(runner) %>%
summarise(mean_timing = mean(timing)) %>%
arrange(mean_timing)
male_100_means
```

```
## # A tibble: 15 x 2
## runner mean_timing
## <chr> <dbl>
## 1 Usain Bolt 9.90
## 2 Asafa Powell 9.94
## 3 Tyson Gay 9.95
## 4 Justin Gatlin 9.96
## 5 Yohan Blake 9.96
## 6 Maurice Greene 9.97
## 7 Ato Boldon 10.0
## 8 Mike Rodgers 10.0
## 9 Frank Fredericks 10.0
## 10 Nesta Carter 10.0
## 11 Carl Lewis 10.0
## 12 Linford Christie 10.0
## 13 Dennis Mitchell 10.0
## 14 Kim Collins 10.0
## 15 Michael Frater 10.0
```

```
male_100_medians <- male_100_2 %>%
group_by(runner) %>%
summarise(median_timing = median(timing)) %>%
arrange(median_timing)
male_100_medians
```

```
## # A tibble: 15 x 2
## runner median_timing
## <chr> <dbl>
## 1 Usain Bolt 9.9
## 2 Asafa Powell 9.95
## 3 Justin Gatlin 9.97
## 4 Maurice Greene 9.97
## 5 Tyson Gay 9.97
## 6 Yohan Blake 9.97
## 7 Ato Boldon 10
## 8 Mike Rodgers 10
## 9 Nesta Carter 10.0
## 10 Frank Fredericks 10.0
## 11 Linford Christie 10.0
## 12 Dennis Mitchell 10.0
## 13 Carl Lewis 10.0
## 14 Kim Collins 10.0
## 15 Michael Frater 10.0
```

There are 6 runners who are at the top using both the mean or the median. Also, for all of them, the mean and the median timings are less than 10 seconds. So we will further restrict the data to these 6 runners, and see if their race timings are significantly different.

```
male_100_3 <- male_100_2 %>%
filter(runner %in% c("Usain Bolt", "Asafa Powell", "Yohan Blake",
"Justin Gatlin", "Maurice Greene", "Tyson Gay"))
```

### Analysis

Now that we have the data ready, our next step is to try and answer the original question. We will follow two different approaches. Our first approach will be the non-parametric Kruskall-Wallis rank sum test. We will also do a multiple comparison test for the Kruskall-Wallis. The next approach will be a pairwise comparison of the runners using simulations. This approach was popularized in an article by Allen Downey, which said that there is only one statistical test. Another article which implements the same idea, and provides code using the *infer* package, can be found in this link.

```
kt <- kruskal.test(timing ~ runner, data = male_100_3)
kt
```

```
##
## Kruskal-Wallis rank sum test
##
## data: timing by runner
## Kruskal-Wallis chi-squared = 19.376, df = 5, p-value = 0.001635
```

Since the p-value is significant at the 99% level, it indicates that at least one of the runners is different from the others.

```
library(pgirmess)
ktph <- kruskalmc(male_100_3$timing, male_100_3$runner)
ktph$dif.com
```

```
## obs.dif critical.dif difference
## Asafa Powell-Justin Gatlin 31.4232955 58.13314 FALSE
## Asafa Powell-Maurice Greene 49.4011463 61.36379 FALSE
## Asafa Powell-Tyson Gay 24.6743885 65.84793 FALSE
## Asafa Powell-Usain Bolt 45.9600450 64.56963 FALSE
## Asafa Powell-Yohan Blake 32.2246699 68.02214 FALSE
## Justin Gatlin-Maurice Greene 17.9778508 66.26719 FALSE
## Justin Gatlin-Tyson Gay 6.7489070 70.43987 FALSE
## Justin Gatlin-Usain Bolt 77.3833404 69.24640 TRUE
## Justin Gatlin-Yohan Blake 0.8013744 72.47646 FALSE
## Maurice Greene-Tyson Gay 24.7267578 73.12884 FALSE
## Maurice Greene-Usain Bolt 95.3611912 71.97997 TRUE
## Maurice Greene-Yohan Blake 17.1764764 75.09254 FALSE
## Tyson Gay-Usain Bolt 70.6344335 75.83898 FALSE
## Tyson Gay-Yohan Blake 7.5502814 78.79927 FALSE
## Usain Bolt-Yohan Blake 78.1847148 77.73425 TRUE
```

The only two cases where there seems to be a difference are - “Justin Gatlin-Usain Bolt” and “Maurice Green-Usain Bolt”.

We now do the pairwise comparison of athletes using simulations. To do that, we will write a function to compare two individual runners. The first step will be to calculate the difference in medians in the observed data. This is accomplished using the *specify* and *calculate* functions in the *infer* package. *specify* is used to specify the response and explanatory variables. *calculate* is used to calculate the test statistic, which is the difference in medians in this case. The second step is similar, except that we include a null hypothesis that there is no difference in the median timings between the runners. We also perform 10,000 simulations - where each simulation permutes the runners in the existing data, and then calculate the difference in medians between the two groups. The *infer* package also provides a *visualize* function to visualize the distribution of the simulated test statistic.

```
library(infer)
set.seed(2154)
compare_runners <- function(runner1, runner2, visualize = FALSE) {
diff_med <- male_100_3 %>%
filter(runner %in% c(runner1, runner2)) %>%
arrange(runner) %>%
specify(timing ~ runner) %>%
calculate("diff in medians",
order = c(runner1, runner2))
diff_med_null <- male_100_3 %>%
filter(runner %in% c(runner1, runner2)) %>%
arrange(runner) %>%
specify(timing ~ runner) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate("diff in medians",
order = c(runner1, runner2))
if (visualize) {
diff_med_plot <- diff_med_null %>%
visualize(method = "simulation") +
shade_p_value(obs_stat = diff_med, direction = "less") +
xlab("Difference in median run timings") +
ylab(NULL) +
ggtitle(paste0("Comparing ", runner1, " and ", runner2)) +
scale_x_continuous(labels = scales::number_format(accuracy = 0.01)) +
theme_bw() +
theme(
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, vjust = 0.5)
)
print(diff_med_plot)
}
diff_med_null %>%
get_pvalue(obs_stat = diff_med, direction = "less")
}
```

The function is applied to each pair of runners. Rather than looking at the results for all the 15 pairs of runners, we do it based on their rankings by median times. There is only one exception, where we also look at the pair Usain Bolt-Yohan Blake, who have the 1st and 3rd least median times respectively. The results for each pair, as well as the p-value generated by the simulation, are shown in the graphs and tables below.

`compare_runners("Usain Bolt", "Asafa Powell", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0185
```

`compare_runners("Asafa Powell", "Yohan Blake", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.193
```

`compare_runners("Yohan Blake", "Justin Gatlin", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.642
```

`compare_runners("Justin Gatlin", "Maurice Greene", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.738
```

`compare_runners("Maurice Greene", "Tyson Gay", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.702
```

`compare_runners("Usain Bolt", "Yohan Blake", TRUE)`

```
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0027
```

There is only a 1.85% chance of seeing a difference as large as the observed difference if there is actually no difference between the timings of Usain Bolt and Asafa Powell. The chance for the pair of Usain Bolt and Yohan Blake is even lower at 0.27%. For all the remaining pairs, we fail to reject the null hypothesis and do not find sufficient evidence to conclude that the timings for these pairs are statistically different. It appears that Usain Bolt is better than the next best in the list which is Asafa Powell and Yohan Blake. Given the data we have, we do have reason to believe that Usain Bolt is the fastest man on earth (as measured by 100m race timings).

The reproducible code for this analysis can be found in Github.