Visualizing Earnings Based On College Majors¶

A DataQuest Guided Project (www.dataquest.io)¶

(completed 14 Dec 2016)¶

Data source: The dataset include job outcomess of students who graduated recently (age<28) from college (US only) between 2010 and 2012. The original data on job outcomes was released by American Community Survey. FiveThirtyEight cleaned the dataset and released it on their Github repo.

Aim: The purpose of the analysis is to explore the earnings based on College majors. Some related question asked are:

- Do students from more popular majors make more money?
- Which majors are predominantly male? Predominantly female?
- Which category of majors have the most students?

Analysis Tool: The tools used for the analysis are Python Pandas and Matplotlib.

1) Importing Data and Initial Data Inspection¶

In [200]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')

print("First Row Sample Record")
print(recent_grads.iloc[0])
print("")
#print("Head", recent_grads.head(10))
print("")
#print("Tail", recent_grads.tail(3))
print("")
print("Shape of data", recent_grads.shape)
print("")
print("List Columns", list(recent_grads))
First Row Sample Record
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object



Shape of data (173, 21)

List Columns ['Rank', 'Major_code', 'Major', 'Total', 'Men', 'Women', 'Major_category', 'ShareWomen', 'Sample_size', 'Employed', 'Full_time', 'Part_time', 'Full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median', 'P25th', 'P75th', 'College_jobs', 'Non_college_jobs', 'Low_wage_jobs']
In [201]:
# Drop rows with missing values
raw_data_count=recent_grads.shape[0]
print("Raw Count", raw_data_count)
recent_grads = recent_grads.dropna()
cleaned_data_count=recent_grads.shape[0]
print("Final Cleaned Count", cleaned_data_count )
Raw Count 173
Final Cleaned Count 172
In [202]:
print("Describe: Summary Statistics")
print(recent_grads.describe())
Describe: Summary Statistics
             Rank   Major_code          Total            Men          Women  \
count  172.000000   172.000000     172.000000     172.000000     172.000000
mean    87.377907  3895.953488   39370.081395   16723.406977   22646.674419
std     49.983181  1679.240095   63483.491009   28122.433474   41057.330740
min      1.000000  1100.000000     124.000000     119.000000       0.000000
25%     44.750000  2403.750000    4549.750000    2177.500000    1778.250000
50%     87.500000  3608.500000   15104.000000    5434.000000    8386.500000
75%    130.250000  5503.250000   38909.750000   14631.000000   22553.750000
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000

       ShareWomen  Sample_size      Employed      Full_time      Part_time  \
count  172.000000   172.000000     172.00000     172.000000     172.000000
mean     0.522223   357.941860   31355.80814   26165.767442    8877.232558
std      0.231205   619.680419   50777.42865   42957.122320   14679.038729
min      0.000000     2.000000       0.00000     111.000000       0.000000
25%      0.336026    42.000000    3734.75000    3181.000000    1013.750000
50%      0.534024   131.000000   12031.50000   10073.500000    3332.500000
75%      0.703299   339.000000   31701.25000   25447.250000    9981.000000
max      0.968954  4212.000000  307933.00000  251540.000000  115172.000000

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            172.000000    172.000000         172.000000     172.000000
mean           19798.843023   2428.412791           0.068024   40076.744186
std            33229.227514   4121.730452           0.030340   11461.388773
min              111.000000      0.000000           0.000000   22000.000000
25%             2474.750000    299.500000           0.050261   33000.000000
50%             7436.500000    905.000000           0.067544   36000.000000
75%            17674.750000   2397.000000           0.087247   45000.000000
max           199897.000000  28169.000000           0.177226  110000.000000

              P25th          P75th   College_jobs  Non_college_jobs  \
count    172.000000     172.000000     172.000000        172.000000
mean   29486.918605   51386.627907   12387.401163      13354.325581
std     9190.769927   14882.278650   21344.967522      23841.326605
min    18500.000000   22000.000000       0.000000          0.000000
25%    24000.000000   41750.000000    1744.750000       1594.000000
50%    27000.000000   47000.000000    4467.500000       4603.500000
75%    33250.000000   58500.000000   14595.750000      11791.750000
max    95000.000000  125000.000000  151643.000000     148395.000000

       Low_wage_jobs
count     172.000000
mean     3878.633721
std      6960.467621
min         0.000000
25%       336.750000
50%      1238.500000
75%      3496.000000
max     48207.000000

2) Scatter Plots¶

In [203]:
# Scatter plot of "Sample_size" vs "Employed"
# Employed refers to Number employed.
# Sample_size refers to Sample size (unweighted) of full-time.
recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title="Sample_size vs Employed", figsize=(8,8))
Out[203]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47497be550>

Note: Instead of using "Sample_size", maybe should use "Total" instead. "Sample Size" is the number for collecting earnings information. And its unweighted. So probably not a good indication of whether the course is popular or not. "Total" may be a more accurate variable to use.

In [204]:
# Instead of using "Sample_size", maybe should use "Total" instead.
# Scatter plot of "Total" vs "Employed"
# Total number of people with major.
# Employed refers to Number employed.
recent_grads.plot(x='Total', y='Employed', kind='scatter', title="Total vs Employed", figsize=(8,8))
Out[204]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f474978a0f0>

Note: Too many records seems to clutter around the lower values for "Total". Need to verify further.

In [205]:
recent_grads['Total'].hist(bins=20, range=(100,300000))
Out[205]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47496fa080>

Note: From the histogram plot, it seems that most of the values for "Total" are around 50000 and below; the series has a long tail.
A consideration maybe to exclude the tail from the analysis. Example: reduced_df = df[df['longtail_column'] < df['longtail_column'].quantile(0.9)]

In [206]:
recent_grads_reduced=recent_grads[recent_grads["Total"]<recent_grads["Total"].quantile(0.9)]

recent_grads_reduced.plot(x='Total', y='Employed', kind='scatter', title="Total vs Employed (reduced dataset)", figsize=(8,8))
Out[206]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4749686be0>
In [207]:
# Scatter plot of "Total" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Total number of people with major.
recent_grads.plot(x='Total', y='Median', kind='scatter', title="Total vs Median", figsize=(8,8))
recent_grads_reduced.plot(x='Total', y='Median', kind='scatter', title="Total vs Median (reduced dataset)", figsize=(8,8))
Out[207]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f474952c828>
In [208]:
# Scatter plot of "Total" vs "Unemployment_rate"
# Total number of people with major.
# Unemployment_rate refers to Unemployed / (Unemployed + Employed)
recent_grads.plot(x='Total', y='Unemployment_rate', kind='scatter', title="Total vs Unemployment_rate", figsize=(8,8))
recent_grads_reduced.plot(x='Total', y='Unemployment_rate', kind='scatter', title="Total vs Unemployment_rate (reduced dataset)", figsize=(8,8))
Out[208]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f474947b400>
In [209]:
# Scatter plot of "Full_time" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Full_time refers to employed 35 hours or more per week
recent_grads.plot(x='Full_time', y='Median', kind='scatter', title="Full_time vs Median", figsize=(8,8))
Out[209]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f474944ab00>
In [210]:
# Scatter plot of "ShareWomen" vs "Unemployment_rate"
# Unemployment_rate refers to Unemployed / (Unemployed + Employed)
# ShareWomen refer to Women as share of total graduates
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind='scatter', title="ShareWomen vs Unemployment_rate", figsize=(8,8))
Out[210]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4749412128>
In [211]:
# Scatter plot of "Men" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Men refers to Male graduates
recent_grads.plot(x='Men', y='Median', kind='scatter', title="Men vs Median", figsize=(8,8))
Out[211]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4749368828>
In [212]:
# Scatter plot of "Women" vs "Median"
# Median refers to Median salary of full-time, year-round workers.
# Women refers to female graduates
recent_grads.plot(x='Women', y='Median', kind='scatter', title="Women vs Median", figsize=(8,8))
Out[212]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f474934e630>

Note: From the plots above, only the scatter plot for "Total vs Employed" shows strong positive correlation. The rest of the scatter plots are not conclusive.
So, the only conclusion here is that "the number of total graduates" and "the number employed" for the majors are highly positively corrected. Also, do note that correlation does not imply causation.

3) Histogram using Series Histogram¶

In [213]:
recent_grads['Sample_size'].hist(bins=20, range=(0,5000))
plt.title("Distribution of Sample Size")
plt.ylabel("Frequency")
plt.xlabel("Sample_size")
Out[213]:
<matplotlib.text.Text at 0x7f47493f77b8>
In [214]:
recent_grads['Total'].hist(bins=20, range=(100,400000))
plt.title("Distribution of Total")
plt.ylabel("Frequency")
plt.xlabel("Total")
Out[214]:
<matplotlib.text.Text at 0x7f4749265940>
In [215]:
recent_grads['Median'].hist(bins=20, range=(20000,120000))
plt.title("Distribution of Median")
plt.ylabel("Frequency")
plt.xlabel("Median")
Out[215]:
<matplotlib.text.Text at 0x7f4749142e10>
In [216]:
recent_grads['Employed'].hist(bins=20, range=(0,300000))
plt.title("Distribution of Employed")
plt.ylabel("Frequency")
plt.xlabel("Employed")
Out[216]:
<matplotlib.text.Text at 0x7f47490db320>
In [217]:
recent_grads['Full_time'].hist(bins=20, range=(100,200000))
plt.title("Distribution of Full_time")
plt.ylabel("Frequency")
plt.xlabel("Full_time")
Out[217]:
<matplotlib.text.Text at 0x7f474914fcf8>
In [218]:
recent_grads['ShareWomen'].hist(bins=20, range=(0,1))
plt.title("Distribution of ShareWomen")
plt.ylabel("Frequency")
plt.xlabel("ShareWomen")
Out[218]:
<matplotlib.text.Text at 0x7f4749878518>
In [219]:
recent_grads['Unemployment_rate'].hist(bins=10, range=(0,0.2))
plt.title("Distribution of Unemployment_rate")
plt.ylabel("Frequency")
plt.xlabel("Unemployment_rate")
Out[219]:
<matplotlib.text.Text at 0x7f4749d28748>

4) Scatter Matrix Plot¶

A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously.

In [220]:
from pandas.tools.plotting import scatter_matrix

scatter_matrix(recent_grads[['Total', 'Median']], figsize=(10,10))
scatter_matrix(recent_grads[['Total', 'Median', 'Unemployment_rate']], figsize=(10,10))
Out[220]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f474e0d27b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f474e2fd1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f474a8d2128>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f474a217668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f474d3d9d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4749ed2cc0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f474901cef0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748febb00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748fa7a58>]], dtype=object)
In [221]:
scatter_matrix(recent_grads[['Women', 'Median']], figsize=(10,10))
scatter_matrix(recent_grads[['Women', 'Median', 'Part_time']], figsize=(10,10))
Out[221]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4748ca0cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748e27f28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748c260b8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4748c2f630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748bab6d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748b781d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4748b339e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748b03400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4748abf550>]], dtype=object)

Note: From the plots there seems to be no correlation between "Total", "Median" and "Unemployment_rate". However, when comparing between 'Women', 'Median', 'Part_time', there seems to some positive correlation between "Women" and "Part-time". Again, do note that correlation does not imply causation.

5) Bar Plots¶

Reminder: The dataset index is ranked by median earnings So the 10 highest paying majors will be the first 10 rows of the data.

In [222]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen', legend=False)
plt.title("ShareWomen by Major")
plt.ylabel("ShareWomen")
recent_grads[163:].plot.bar(x='Major', y='ShareWomen', legend=False)
plt.title("ShareWomen by Major")
plt.ylabel("ShareWomen")
Out[222]:
<matplotlib.text.Text at 0x7f47488d9198>

Note: Based on the Bar Plots above. It seems that the top 10 ranking Majors (ranked by Median Salary) are dominated by Male. In fact, 7 of the 10 top Majors has less than 20% female graduates. On the other hand, the bottom 10 ranking Major seems to be dominated by Female; all 10 of them has 60% and above female graduates.

In [223]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate', legend=False)
plt.title("Unemployment_rate by Major")
plt.ylabel("Unemployment_rate")
recent_grads[163:].plot.bar(x='Major', y='Unemployment_rate', legend=False)
plt.title("Unemployment_rate by Major")
plt.ylabel("Unemployment_rate")
Out[223]:
<matplotlib.text.Text at 0x7f47487698d0>

Note: Based on the Bar Plots above. It seems that unemployment rate for top 10 ranking Majors (ranked by Median Salary) are relatively low. However, "Nuclear Engineering" has comparatively higher unemployment rate. Followed by "Mining and Mineral Engineering" and then "Actuarial Science". For the bottom 10 ranking Majors, "Clinical Psychology" has comparatively higher unemployment rate. Followed by "Library Science" and "Other Foreign Languages".

-- the end