Alt text

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

Collecting information about both the data sets

We are going to:

Review the first lines of the data;
Use the describe and info functions to collect statistical information, datatypes, column names and other information

data = pd.read_csv('../input/athlete-events/athlete_events.csv')
regions = pd.read_csv('../input/athlete-events/noc_regions.csv')

data.describe()

data.info()

regions = pd.read_csv('../input/athlete-events/noc_regions.csv')
regions.head()

Joining the data frames

merged = pd.merge(data, regions, on='NOC', how='left')
merged.head()

Distribution of the age of gold medalists

Let’s start creating a new data frame including only gold medalists.

goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals.head()

I would like to have a plot of the Age to see the distribution but I need to check first if the Age column contains NaN values.

goldMedals.isnull().any()

Let’s take only the values that are different from NaN.

goldMedals = goldMedals[np.isfinite(goldMedals['Age'])]

We can now create a countplot to see the result of our work:

plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(goldMedals['Age'])
plt.title('Distribution of Gold Medals')

It seems that we have people with Age greater that 50 with a gold medal: Let’s know more about those people.

goldMedals['ID'][goldMedals['Age'] > 50].count()

65 people. Wonderul But which disciplines allows you to land a gold medal after your fifties?

We will now create a new dataframe called masterDisciplines in which we will insert this new set of people and then create a visualization with it.

masterDisciplines = goldMedals['Sport'][goldMedals['Age'] > 50]
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(masterDisciplines)
plt.title('Gold Medals for Athletes Over 50')

It seems that our senior gold medalists are shooters, archers, sailors and, above all, horse riders!

It makes sense: I cannot imagine a sprinter making 100 meters in 10 seconds at 55, but who knows!

Women in Athletics

Studying the data we can try to understand how much medals we have only for women in the recent history of the Summer Games. Let’s create a filtered dataset :

womenInOlympics = merged[(merged.Sex == 'F') & (merged.Season == 'Summer')]
womenInOlympics.head(10)

To plot the curve over time, let’s create a plot in which we put the year (on the x-axis) and count of the number of medals per edition of the games (consider that we will have more medals for the same athlete).

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=womenInOlympics)
plt.title('Women medals per edition of the Games')

Usually I cross-check the data: below I tried to review only the medalists for the 1900 Summer edition to see if the visualization is correct.

womenInOlympics.loc[womenInOlympics['Year'] == 1900].head(10)

Okay, let’s count the rows (same code as above adding the count() function and filtering only for ID)

womenInOlympics['ID'].loc[womenInOlympics['Year'] == 1900].count()

So we have 33 records (with repetitions, for example ‘Marion Jones (-Farquhar)’ won a medal both for Tennis Women’s Singles and Tennis Mixed Doubles – To be sure I cross-checked also with Wikipedia and the outcome seems correct).

Medals per country

Let’s now review the top 5 gold medal countries:

goldMedals.region.value_counts().reset_index(name='Medal').head()

Let’s plot this:

totalGoldMedals = goldMedals.region.value_counts().reset_index(name='Medal').head(5)
g = sns.catplot(x="index", y="Medal", data=totalGoldMedals,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_xlabels("Top 5 countries")
g.set_ylabels("Number of Medals")
plt.title('Medals per Country')

The USA seems to be the most winning country.

But which are the most awarded disciplines of American Athletes?

Disciplines with the greatest number of Gold Medals

Let’s create a dataframe to filter the gold medals only for the USA.

goldMedalsUSA = goldMedals.loc[goldMedals['NOC'] == 'USA']

Done! Now, we can count the medals per discipline

goldMedalsUSA.Event.value_counts().reset_index(name='Medal').head(20)

Let’s slice the dataframe using only the data of male athletes to better review it:

basketballGoldUSA = goldMedalsUSA.loc[(goldMedalsUSA['Sport'] == 'Basketball') & (goldMedalsUSA['Sex'] == 'M')].sort_values(['Year'])
basketballGoldUSA.head(15)

What we supposed is true: the medals are not grouped by Edition/Team but we were counting the gold medals of each member of the team!

Let’s proceed grouping by year the athletes – the idea is to create a new dataframe to make a pre-filter using only the first record for each member of the team.

groupedBasketUSA = basketballGoldUSA.groupby(['Year']).first()
groupedBasketUSA
groupedBasketUSA['ID'].count()

What is the median height/weight of an Olympic medalist?

Let’s try to plot a scatterplot of height vs weight to see the distribution of values (without grouping by discipline).

First of all, we have to take again the goldMedals dataframe

goldMedals.head()

We can see that we have NaN values both in height and weight columns

At this point, we can act as follows:

Using only the rows that has a value in the Height and Weight columns;
Replace the value with the mean of the column.

Solution 2 in my opinion it is not the best way to go: we are talking about data of athletes of different ages and different disciplines (that have done different training).

Let’s go with solution 1.

The first thing to do is to collect general information about the dataframe that we have to use: goldMedals.

goldMedals.info()

Okay, we have more than 13.000 rows.

We will now create a dataframe filtering only the rows that has the column Height and Weight populated.

notNullMedals = goldMedals[(goldMedals['Height'].notnull()) & (goldMedals['Weight'].notnull())]

plt.figure(figsize=(12, 10))
ax = sns.scatterplot(x="Height", y="Weight", data=notNullMedals)
plt.title('Height vs Weight of Olympic Medalists')

The vast majority of the samples show a linear relation between height and weight (the more the weight, the more the height).

We have exceptions and I am willing to know more!

For example, let’s see which is the athlete that weighs more than 160 kilograms.

notNullMedals.loc[notNullMedals['Weight'] > 160]

MenOverTime = merged[(merged.Sex == 'M') & (merged.Season == 'Summer')]
WomenOverTime = merged[(merged.Sex == 'F') & (merged.Season == 'Summer')]

MenOverTime.head()

part = MenOverTime.groupby('Year')['Sex'].value_counts()
plt.figure(figsize=(20, 10))
part.loc[:,'M'].plot()
plt.title('Variation of Male Athletes over time')

part = WomenOverTime.groupby('Year')['Sex'].value_counts()
plt.figure(figsize=(20, 10))
part.loc[:,'F'].plot()
plt.title('Variation of Female Athletes over time')

What I immediately saw is that for women:

We have a steep increase in the population;
The grow is constant.

On the other hand, the grow for men seems less strong:

After the 1990 we can see a relevant decrease in the number of male athletes at the summer games;
The growth has slowly restarted recently.

plt.figure(figsize=(20, 10))
sns.boxplot('Year', 'Age', data=MenOverTime)
plt.title('Variation of Age for Male Athletes over time')

What is strange for me is the age of some athletes in the games between the 1924 and the 1948: let’s check all the people with age greater than 80.

MenOverTime.loc[MenOverTime['Age'] > 80].head(10)

plt.figure(figsize=(20, 10))
sns.boxplot('Year', 'Age', data=WomenOverTime)
plt.title('Variation of Age for Female Athletes over time')

Interesting points for me:

Generally, the age distribution starts has a lower minimum and a lower maximum;
In 1904 the age distribution is strongly different from the other Olympics: let’s know more about this point:

WomenOverTime.loc[WomenOverTime['Year'] == 1904]

We will now try using a pointplot to visualize the variation in weight over athletes.

The first graph will show data for men, the second for women:

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=MenOverTime)
plt.title('Variation of Weight for Male Athletes over time')

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=WomenOverTime)
plt.title('Variation of Weight for Female Athletes over time')

What we can see is that it seems that we do not have data for women before 1924. Let’s try filtering all the women athletes for that period to review this point:

womenInOlympics.loc[womenInOlympics['Year'] < 1924].head(20)

Using the same pointplot (with a different palette) we can plot the weight change along time.

The first graph will show the information for men, the second for women:

womenInOlympics.loc[womenInOlympics['Year'] < 1924].head(20)

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=WomenOverTime, palette='Set2')
plt.title('Variation of Height for Female Athletes over time')

What we may see:

For both men and women, the height is incrementing over time but it is decreasing between the 2012 and the 2016.
For women we have a peak between 1928 and 1948, let’s deepen this point:

WomenOverTime.loc[(WomenOverTime['Year'] > 1924) & (WomenOverTime['Year'] < 1952)].head(10)

MenOverTime.head()

itMenOverTime = MenOverTime.loc[MenOverTime['region'] == 'Italy']

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=itMenOverTime, palette='Set2')
plt.title('Variation of Age for Italian Male Athletes over time')

now we can plot the change over time:

itWomenOverTime = WomenOverTime.loc[WomenOverTime['region'] == 'Italy']
sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=itWomenOverTime, palette='Set2')
plt.title('Variation of Age for Italian Female Athletes over time')

Let’s first of all isolate all the discipline of the Olympics dataframe.

My idea is to see if Gymnastics is called differently or if there is any type.

MenOverTime['Sport'].unique().tolist()

Okay, the string to use to filter is ‘Gymnastics’: let’s create two new dataframes for men and women.

gymMenOverTime = MenOverTime.loc[MenOverTime['Sport'] == 'Gymnastics']
gymWomenOverTime = WomenOverTime.loc[WomenOverTime['Sport'] == 'Gymnastics']

Okay: let’s now create our plot for male and female athletes and then we can make our observations

plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Weight', data=gymMenOverTime)
plt.title('Weight over year for Male Gymnasts')

A few things I noticed:

The weight for female Gymnasts has go down for 60 to 50 kilograms on average;
The weight for men has been more or less stable since 1964;
The height is more stable for both men and women. ### Also, men weight data from 1924 seems missing: let’s check.

plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Height', data=gymMenOverTime)
plt.title('Height over year for Male Gymnasts')

plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Weight', data=gymWomenOverTime)
plt.title('Weight over year for Female Gymnasts')

plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Height', data=gymWomenOverTime)
plt.title('Height over year for Female Gymnasts')

gymMenOverTime['Weight'].loc[gymMenOverTime['Year'] == 1924].isnull().all()

Weightlifting

Let’s work on an analysis similar to what we have done for Gymnastics also for the Lifters.

We can start creating a new, dedicated dataframe

wlMenOverTime = MenOverTime.loc[MenOverTime['Sport'] == 'Weightlifting']
wlWomenOverTime = WomenOverTime.loc[WomenOverTime['Sport'] == 'Weightlifting']

Okay: let’s now create our plot for male and female athletes and then we can make our observations

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=wlMenOverTime, palette='Set2')
plt.title('Weight over year for Male Lifters')

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=wlMenOverTime, palette='Set2')
plt.title('Height over year for Male Lifters')

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=wlWomenOverTime)
plt.title('Weight over year for Female Lifters')

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=wlWomenOverTime)
plt.title('Height over year for Female Lifters')