Data source: The dataset include job outcomess of students who graduated recently (age<28) from college (US only) between 2010 and 2012. The original data on job outcomes was released by American Community Survey. FiveThirtyEight cleaned the dataset and released it on their Github repo.
Aim: The purpose of the analysis is to explore the earnings based on College majors. Some related question asked are:
- Do students from more popular majors make more money?
- Which majors are predominantly male? Predominantly female?
- Which category of majors have the most students?
Analysis Tool: The tools used for the analysis are Python Pandas and Matplotlib.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
recent_grads = pd.read_csv('recent-grads.csv')
print("First Row Sample Record")
print(recent_grads.iloc[0])
print("")
#print("Head", recent_grads.head(10))
print("")
#print("Tail", recent_grads.tail(3))
print("")
print("Shape of data", recent_grads.shape)
print("")
print("List Columns", list(recent_grads))
# Drop rows with missing values
raw_data_count=recent_grads.shape[0]
print("Raw Count", raw_data_count)
recent_grads = recent_grads.dropna()
cleaned_data_count=recent_grads.shape[0]
print("Final Cleaned Count", cleaned_data_count )
print("Describe: Summary Statistics")
print(recent_grads.describe())
# Scatter plot of "Sample_size" vs "Employed"
# Employed refers to Number employed.
# Sample_size refers to Sample size (unweighted) of full-time.
recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title="Sample_size vs Employed", figsize=(8,8))
Note: Instead of using "Sample_size", maybe should use "Total" instead. "Sample Size" is the number for collecting earnings information. And its unweighted. So probably not a good indication of whether the course is popular or not. "Total" may be a more accurate variable to use.
# Instead of using "Sample_size", maybe should use "Total" instead.
# Scatter plot of "Total" vs "Employed"
# Total number of people with major.
# Employed refers to Number employed.
recent_grads.plot(x='Total', y='Employed', kind='scatter', title="Total vs Employed", figsize=(8,8))
Note: Too many records seems to clutter around the lower values for "Total". Need to verify further.
recent_grads['Total'].hist(bins=20, range=(100,300000))
Note:
From the histogram plot, it seems that most of the values for "Total" are around 50000 and below; the series has a long tail.
A consideration maybe to exclude the tail from the analysis.
Example: reduced_df = df[df['longtail_column'] < df['longtail_column'].quantile(0.9)]
recent_grads_reduced=recent_grads[recent_grads["Total"]<recent_grads["Total"].quantile(0.9)]
recent_grads_reduced.plot(x='Total', y='Employed', kind='scatter', title="Total vs Employed (reduced dataset)", figsize=(8,8))
# Scatter plot of "Total" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Total number of people with major.
recent_grads.plot(x='Total', y='Median', kind='scatter', title="Total vs Median", figsize=(8,8))
recent_grads_reduced.plot(x='Total', y='Median', kind='scatter', title="Total vs Median (reduced dataset)", figsize=(8,8))
# Scatter plot of "Total" vs "Unemployment_rate"
# Total number of people with major.
# Unemployment_rate refers to Unemployed / (Unemployed + Employed)
recent_grads.plot(x='Total', y='Unemployment_rate', kind='scatter', title="Total vs Unemployment_rate", figsize=(8,8))
recent_grads_reduced.plot(x='Total', y='Unemployment_rate', kind='scatter', title="Total vs Unemployment_rate (reduced dataset)", figsize=(8,8))
# Scatter plot of "Full_time" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Full_time refers to employed 35 hours or more per week
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title="Full_time vs Median", figsize=(8,8))
# Scatter plot of "ShareWomen" vs "Unemployment_rate"
# Unemployment_rate refers to Unemployed / (Unemployed + Employed)
# ShareWomen refer to Women as share of total graduates
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title="ShareWomen vs Unemployment_rate", figsize=(8,8))
# Scatter plot of "Men" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Men refers to Male graduates
recent_grads.plot(x='Men', y='Median', kind='scatter', title="Men vs Median", figsize=(8,8))
# Scatter plot of "Women" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Women refers to female graduates
recent_grads.plot(x='Women', y='Median', kind='scatter', title="Women vs Median", figsize=(8,8))
Note:
From the plots above, only the scatter plot for "Total vs Employed" shows strong positive correlation. The rest of the scatter plots are not conclusive.
So, the only conclusion here is that "the number of total graduates" and "the number employed" for the majors are highly positively corrected.
Also, do note that correlation does not imply causation.
recent_grads['Sample_size'].hist(bins=20, range=(0,5000))
plt.title("Distribution of Sample Size")
plt.ylabel("Frequency")
plt.xlabel("Sample_size")
recent_grads['Total'].hist(bins=20, range=(100,400000))
plt.title("Distribution of Total")
plt.ylabel("Frequency")
plt.xlabel("Total")
recent_grads['Median'].hist(bins=20, range=(20000,120000))
plt.title("Distribution of Median")
plt.ylabel("Frequency")
plt.xlabel("Median")
recent_grads['Employed'].hist(bins=20, range=(0,300000))
plt.title("Distribution of Employed")
plt.ylabel("Frequency")
plt.xlabel("Employed")
recent_grads['Full_time'].hist(bins=20, range=(100,200000))
plt.title("Distribution of Full_time")
plt.ylabel("Frequency")
plt.xlabel("Full_time")
recent_grads['ShareWomen'].hist(bins=20, range=(0,1))
plt.title("Distribution of ShareWomen")
plt.ylabel("Frequency")
plt.xlabel("ShareWomen")
recent_grads['Unemployment_rate'].hist(bins=10, range=(0,0.2))
plt.title("Distribution of Unemployment_rate")
plt.ylabel("Frequency")
plt.xlabel("Unemployment_rate")
A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.
from pandas.tools.plotting import scatter_matrix
scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))
scatter_matrix(recent_grads[['Total', 'Median', 'Unemployment_rate']], figsize=(10,10))
scatter_matrix(recent_grads[['Women', 'Median']], figsize=(10,10))
scatter_matrix(recent_grads[['Women', 'Median', 'Part_time']], figsize=(10,10))
Note: From the plots there seems to be no correlation between "Total", "Median" and "Unemployment_rate". However, when comparing between 'Women', 'Median', 'Part_time', there seems to some positive correlation between "Women" and "Part-time". Again, do note that correlation does not imply causation.
Reminder: The dataset index is ranked by median earnings So the 10 highest paying majors will be the first 10 rows of the data.
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False)
plt.title("ShareWomen by Major")
plt.ylabel("ShareWomen")
recent_grads[163:].plot.bar(x='Major', y='ShareWomen', legend=False)
plt.title("ShareWomen by Major")
plt.ylabel("ShareWomen")
Note: Based on the Bar Plots above. It seems that the top 10 ranking Majors (ranked by Median Salary) are dominated by Male. In fact, 7 of the 10 top Majors has less than 20% female graduates. On the other hand, the bottom 10 ranking Major seems to be dominated by Female; all 10 of them has 60% and above female graduates.
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
plt.title("Unemployment_rate by Major")
plt.ylabel("Unemployment_rate")
recent_grads[163:].plot.bar(x='Major', y='Unemployment_rate', legend=False)
plt.title("Unemployment_rate by Major")
plt.ylabel("Unemployment_rate")
Note: Based on the Bar Plots above. It seems that unemployment rate for top 10 ranking Majors (ranked by Median Salary) are relatively low. However, "Nuclear Engineering" has comparatively higher unemployment rate. Followed by "Mining and Mineral Engineering" and then "Actuarial Science". For the bottom 10 ranking Majors, "Clinical Psychology" has comparatively higher unemployment rate. Followed by "Library Science" and "Other Foreign Languages".
-- the end