![]() |
---|
Mayfield High School junior Laura Cruz, 18, holds up a sign March 2, 2015, during a student-organized walkout to protest the PARCC exams in Las Cruces, N.M. |
There's a remarkable lack of substantive analysis with regards to public schooling data in the United States, especially in lower income areas like Baltimore. This is unfortunate, given how crucial this information is - both for the wellbeing of students, as well as for the direction of our society.
As part of a two year initiative, Maryland's State education department has finally come out with Star Ratings to rank the quality of their public Elementary, middle and high schools. The star ratings were required by a federal law called the Every Student Succeeds Act or ESSA, which replaced the more punitive No Child Left Behind Act to help identify and diagnose schools that were having an issue within Maryland, with all the schools recieving a one star rating having to come up with a plan of action. According to the Baltimore Sun, "The bottom 5 percent of schools in the state — including some two-star schools — will be identified as needing an intervention plan for improvement. Most of those schools are in Baltimore City, where 23 schools earned one star."
These star rankings take standardized test scores into account, in fact, almost 65% of the evaluation is based on them! The star rating is a good step in the right direction, only releasing as early as December 4th, 2018, but the majority of the metrics on schools is still shown by the the standardized test scores of the schools in each area. In Maryland, this is the PARCC, or the (Partnership for Assessment of Readiness for College and Careers).
We wanted to see the correlation between income and testscores by counties in maryland has a linear relationship, in order to see if taking blanket PARCC scores to rate schools is an accurate way of describing student potential. Additionally, we wanted to map the disparities in both Test scores and income to try and predict problem areas and see if they spread out to nearby counties, and see if we get location results similar to the star ratings.
Our initial hypothesis: Family Income, even in the public school system, is so closely tied to the quality of education recieved, that the PARCC scores (assuming that it is a well designed exam) will have a positive linear correlation with Family Income.
Obviously, the PARCC is not a perfect indicator for how well a student is doing, but it is a standard metric that we can at least attempt to use as justification.
The model doesn't take into account extraneous factors, such as race, gender or english-speaking vs non-english speaking households because we didn't think it necessary to fit those things against school performance (for both ethical and time related reasons.)
Finally, We'll be using data from 2010 through 2017, inclusive. This brings up a few issues that we'll address later, especially the fact that Maryland switched from the MSA (Maryland State Assessment) system to PARCC (Partnership for Assessment of Readiness for College and Careers) in 2014.
This tutorial will use data from:
The Maryland State Department of Education
The United States Census Bureau
Here is a list of tools we'll be using:
import pandas as pd
# We'll be using pandas as our primary means of data manipulation and management.
import numpy as np
import matplotlib.pyplot as plt
# We'll be using pyplot for graphing data across time.
import folium
# folium is an expansive data visualization library, but we'll mostly be using it for it's chloropleth feature.
import requests
import os
import seaborn as sns
from statsmodels.formula.api import ols
#least squard models
from statsmodels.stats.anova import anova_lm
First things first, we've got to gather our data. We had to do this manually, going to the sites linked above, downloading and importing the csv files into our environments. It's not that many files, so all things considered, it's faster than trying to script the process.
Once we've gotten that out of the way, we can use one of pandas (many) helpful features and turn these csv's straight into dataframes. From 2010 to 2014, we reference the MSA data files. From 2015 to 2017, we use PARCC. We'll keep lists of each individual year's dataframe.
grade_data_msa = []
grade_data_parcc = []
for year in range(2010, 2015):
grade_data_msa.append(pd.read_csv('MSA_' + str(year) + '.csv'))
for year in range(2015, 2018):
grade_data_parcc.append(pd.read_csv('PARCC_' + str(year) + '.csv'))
Now we've got the data, we want to put all of the years together. Luckily, pandas has another great feature that allows us to cleanly do this.
msaAllYears = pd.concat(grade_data_msa, sort=False)
parccAllYears = pd.concat(grade_data_parcc, sort=False)
Time to investigate these dataframes to see what we're working with. First up, MSA:
msaAllYears.head()
Not horrible, but needs a bunch of cleaning. First of all, it has a bunch of stuff we don't care about. Let's drop those columns.
msaAllYears = msaAllYears.drop(columns = ['LEA Number', 'School Number', 'School Name', 'Test Type Title',
'Grade Title', 'Grade', 'Create Date', 'Test Type'])
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Percent')))] # little shortcut for dropping
msaAllYears.head()
Already a lot easier to parse, visually, but there's still plenty to do.
I don't like those asterisks, and I'm planning on summing over thos columns in a bit. Let's set them to 0's, so we can forget about them.
msaAllYears = msaAllYears.replace('*', '0')
Also, "LEA Name" isn't super intuitive, when really we're looking at counties. Renaming that column seems smart.
msaAllYears = msaAllYears.rename(index=str, columns={"LEA Name": "County"})
Lastly, we're considering Maryland schools across county, so I don't really want to look at SEED (a DC charter school) or the state as a whole. Goodbye to those rows.
msaAllYears = msaAllYears[msaAllYears['County'] != 'SEED']
msaAllYears = msaAllYears[msaAllYears['County'] != 'Seed School LEA']
msaAllYears = msaAllYears[msaAllYears['County'] != 'All Public Schools']
Let's see where we're at now. (I'm printing the entire thing for a reason)
msaAllYears
We're getting there. Time to deal with all those NaNs.
If you'll notice, every row seems to either have values in "Subject Title" $\textbf{or}$ "Subject", and "Test Member Count" $\textbf{or}$ "Tested Count". Let's make sure this is the case.
msaAllYears = msaAllYears.replace(np.nan, '')
count = 0
for index, row in msaAllYears.iterrows():
if (row['Subject Title'] == '' and row['Subject'] == '') \
or (row['Test Member Count'] == '' and row['Tested Count'] == '') \
or (row['Subject Title'] != '' and row['Subject'] != '') \
or (row['Test Member Count'] != '' and row['Tested Count'] != ''):
print('Error at index: ' + str(index))
...
Okay we're good! Let's go ahead and fill in the respective columns, then drop the extra ones.
msaAllYears = msaAllYears.replace('', np.nan)
msaAllYears['Subject'].fillna(msaAllYears['Subject Title'], inplace=True)
msaAllYears['Tested Count'].fillna(msaAllYears['Test Member Count'], inplace=True)
msaAllYears = msaAllYears.drop(columns=['Subject Title', 'Test Member Count'])
Unfortunately, the PARCC testing system doesn't include science exams, so we'll be discarding those in the MSA data for consistency.
msaAllYears = msaAllYears[msaAllYears['Subject'] != 'Science']
Let's convert everything to integers so we can do some math.
msaAllYears['Tested Count'] = msaAllYears['Tested Count'].astype(int)
msaAllYears['Advanced Count'] = msaAllYears['Advanced Count'].astype(int)
msaAllYears['Proficient Count'] = msaAllYears['Proficient Count'].astype(int)
msaAllYears['Basic Count'] = msaAllYears['Basic Count'].astype(int)
Time to put everything together so we can graph it more easily later.
msaAllYears = msaAllYears.groupby(['Academic Year', 'County', 'Subject']).sum()
msaAllYears = msaAllYears.reset_index()
msaAllYears.head()
Okay, we're starting to see real information. Now that we have a cumulative count, let's get back those percentages we dumped earlier.
msaAllYears['Adv Ratio'] = msaAllYears ['Advanced Count'] / msaAllYears ['Tested Count']
msaAllYears['Prof Ratio'] = msaAllYears ['Proficient Count'] / msaAllYears ['Tested Count']
msaAllYears['Basic Ratio'] = msaAllYears ['Basic Count'] / msaAllYears ['Tested Count']
msaAllYears = msaAllYears.drop(columns=['Advanced Count', 'Proficient Count', 'Basic Count'])
msaAllYears.head()
Alright, so let's take a second to gather ourselves. We want to graph this data over time - how?
Well, we need to have some sort of objective metric of performance for each county, in each year, in each subject. As it stands, we just have a bunch of percentages.
Here's the idea. We take the difference between the advanced percentage and the basic percentage. We can ignore the number in the middle - let's consider those to be "neutral". So if a county has a lot of high performing students relative to low performing, the number will be positve, and respectively negative for low performing counties. Then, we'll add 1 to make all these numbers positive, with the already positive numbers jumping over 1, and then multiply by 100 to make the numbers more readable. Here's what we get:
msaAllYears['Performance'] = (msaAllYears['Adv Ratio'] - msaAllYears['Basic Ratio'] + 1) * 100
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Ratio')))]
msaAllYears = msaAllYears.drop(columns=['Tested Count'])
msaAllYears.head()
Looks good, and graphable! Now to split into two dataframes, between math and language. If there's an easy way to do this is pandas, I haven't found it. Fortunately, it's not that hard to do manually, and it's good coding practice anyway!
First let's get a list of all the counties, and initialize a dictionary for each subject.
counties = msaAllYears['County'].unique()
readingDict = {}
mathDict = {}
for c in counties:
readingDict[c] = []
mathDict[c] = []
Okay, now here's the idea. We want to take every row that's assigned to each subject, and append it's performance value to the associated dictionarie's county entry.
Example Row: 2011, Montgomery, Math, 103
This will go to the math dictionary, look for the Montgomery key, and then add the performance 103. Because this table is stored chronologically, we don't have to worry about the year - it'll be in the right place.
for index, row in msaAllYears.iterrows():
if row['Subject'] == 'Reading':
readingDict[row['County']].append(row['Performance'])
else:
mathDict[row['County']].append(row['Performance'])
Let's convert back to see the fruits of our labor.
msaAllYearsReading = pd.DataFrame(readingDict, index = range(2010, 2015))
msaAllYearsMath = pd.DataFrame(mathDict, index = range(2010, 2015))
msaAllYearsReading.head()
Looks pretty nice. Finally time to graph it!
We'll split the counties in half so each graph isn't too cluttered.
msaAllYearsReading.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")
msaAllYearsReading.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")
msaAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
msaAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
Some brief analysis before we continue - Baltimore City is very clearly the lowest performing county in both subjects. If you're familiar with the area this probably isn't surprising. We'll further explore this later.
Apart from Baltimore City, most counties are fairly in line, with more minor exceptions like Howard and Dorchester.
Moving on to the PARCC year records. Luckily, these start off a bit cleaner, so we have less work to do.
In general, if I did something very similar with the MSA data, I won't repeat myself here. Let's take a look.
parccAllYears.head()
This one is significantly easier to manage. All we need to do is drop the columns we won't be using, then group by year and county, along with the other minor changes we made last time.
parccAllYears = parccAllYears.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date'])
parccAllYears = parccAllYears.rename(index=str, columns={"LEA Name": "County"})
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Percent')))]
parccAllYears = parccAllYears[parccAllYears['County'] != 'SEED']
parccAllYears = parccAllYears[parccAllYears['County'] != 'Seed School LEA']
parccAllYears = parccAllYears[parccAllYears['County'] != 'All Public Schools']
parccAllYears = parccAllYears.replace('*', '0')
To make things consistent with the MSA data, we're going to categorize exams as either ELA (English Language Arts) or math. As you may be able to tell, once the students get to high school the exam names change from "Mathematics Grade 'x'" to Algebra, Geometry, etc. Therefore, we'll just treat anything that has the word "English" in it as an ELA exam, and anything without the word "English" as a math exam.
parccAllYears.loc[parccAllYears['Assessment'].str.contains('English'), 'Assessment'] = 'ELA'
parccAllYears.loc[~parccAllYears['Assessment'].str.contains('ELA'), 'Assessment'] = 'Math'
Once again, we'll convert the relevant columns to integers so we can do math stuff, and group them appropriately.
parccAllYears['Tested Count'] = parccAllYears['Tested Count'].astype(int)
parccAllYears[list(parccAllYears.filter(regex='Level'))] = parccAllYears[list(parccAllYears.filter(regex='Level'))].astype(int)
parccAllYears = parccAllYears.groupby(['Academic Year', 'County', 'Assessment']).sum()
parccAllYears = parccAllYears.reset_index()
Same as before - convert test counts into test ratios, use a little shortcut to drop anything with the word "level" to get rid of the counts.
parccAllYears['1 Ratio'] = parccAllYears ['Level 1: Did not yet meet expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['2 Ratio'] = parccAllYears ['Level 2: Partially met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['3 Ratio'] = parccAllYears ['Level 3: Approached expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['4 Ratio'] = parccAllYears ['Level 4: Met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['5 Ratio'] = parccAllYears ['Level 5: Exceeded expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Level')))]
parccAllYears.head()
Here is the metric of performance we'll be using for the PARCC data. Similar to that of MSA, we consider anything below the median score (in this case 3) to be negative, with anything above considered positive. We double the weight of scores 1 and 5, to account for these extremes. We then normalize and scale to make it more readable.
parccAllYears['Performance'] = (-2*parccAllYears['1 Ratio'] - parccAllYears['2 Ratio'] + \
parccAllYears['4 Ratio'] + 2*parccAllYears['5 Ratio'] + 1) * 100
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Ratio')))]
parccAllYears.head()
Back to the same conversion, then plotting paradigm.
counties = parccAllYears['County'].unique()
elaDict = {}
mathDict = {}
for c in counties:
elaDict[c] = []
mathDict[c] = []
for index, row in parccAllYears.iterrows():
if row['Assessment'] == 'ELA':
elaDict[row['County']].append(row['Performance'])
else:
mathDict[row['County']].append(row['Performance'])
parccAllYearsELA = pd.DataFrame(elaDict, index = range(2015, 2018))
parccAllYearsMath = pd.DataFrame(mathDict, index = range(2015, 2018))
parccAllYearsELA.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsELA.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
Now let's put it all together, accounting for the difference in exam structure change across year. Again, pandas saves the day.
allYearsLanguage = pd.concat([msaAllYearsReading, parccAllYearsELA], sort=True)
allYearsMath = pd.concat([msaAllYearsMath, parccAllYearsMath], sort=True)
allYearsMath
That's some mighty attractive data. Let's take a better look. We'll draw a line down the year marker where the exams changed from MSA to PARCC.
plt1 = allYearsLanguage.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsLanguage.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Language Exam Performance")
plt2.set(xlabel="Year", ylabel="Language Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black', linewidth=0.2)
plt1 = allYearsMath.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsMath.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Math Exam Performance")
plt2.set(xlabel="Year", ylabel="Math Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black', linewidth=0.2)
What do we see here?
Well, first thing you'll notice is that the change in exams noticeably changed the trajectory of nearly every county. Even Baltimore faired better, and stopped declining as harshly in performance. The best, like Howard, stay the top. This is not surprising - testing systems may favor slightly different things, but certain resource advantages will always show their hand in results like these. We'll explore what those resources may be in a bit.
Our goal here is to further explore the distribution of exam success across Maryland, this time focusing on one year.
We'll repeat our collection process, and then analyze not only the mathematical distribution of test scores (using the performance variable we created in the previous section), but also to analyze the geographic distribution in order to see if the counties closer to each other have similar scores, and to relate it to income later on.
PARCC2017 = pd.read_csv("PARCC_2017.csv")
PARCC2017 = PARCC2017.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date', 'Assessment', 'Academic Year'])
PARCC2017 = PARCC2017 [PARCC2017 .columns.drop(list(PARCC2017.filter(regex='Percent')))]
PARCC2017 = PARCC2017 .replace('*', '0')
PARCC2017['Tested Count'] = PARCC2017['Tested Count'].astype(int)
PARCC2017[list(PARCC2017.filter(regex='Level'))] = PARCC2017[list(PARCC2017.filter(regex='Level'))].astype(int)
PARCC2017['1 Ratio'] = PARCC2017['Level 1: Did not yet meet expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['2 Ratio'] = PARCC2017 ['Level 2: Partially met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['3 Ratio'] = PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['4 Ratio'] = PARCC2017 ['Level 4: Met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['5 Ratio'] = PARCC2017 ['Level 5: Exceeded expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Performance'] = (-2*PARCC2017['1 Ratio'] - PARCC2017['2 Ratio'] + \
PARCC2017['4 Ratio'] + 2*PARCC2017['5 Ratio'] + 1) * 100
PARCC2017Performance = PARCC2017.copy(deep = True)
distplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[:4085, :]).set_title("Distribution of Student Performance by County")
distplot.figure.set_size_inches(20, 5)
slowplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[4086:, :]).set_title("Distribution of Student Performance by County (Cont.)")
slowplot.figure.set_size_inches(20, 5)
If we look at the distribution of the performance, we'll find that they vary greatly in shape, range and glut (the thick areas).
A skinnier shape indicates a more even distribution of scores, while a shorter/ fatter shape shows that more students fell under the a certain performance area.
A taller range means that excellent students and terrible students have a larger disparity, and vice versa for a shorter range.
A glut's location indicates where more students are falling in terms of performance.
Some interesting happenings:
Our county (Prince georges) has a pretty normal looking distribution, with an extremely high range, indicating that it's otherwise normally ditributed, but the excellent students are very excellent compared to the worse students.
Montgomery, Howard, Anne Arundel and Calvert are all relatively skinny, as well as the highest performing counties.
Seed and public schools have similar distributions, but flipped. The glut placement might dictate the shape of the rest, because both are relatively short and fat.
PARCC2017 = PARCC2017[PARCC2017.columns.drop(list(PARCC2017.filter(regex='Ratio|Performance')))]
PARCC2017 = PARCC2017.groupby('LEA Name').agg(['sum'])
PARCC2017['Lower Ratio']= (PARCC2017 ['Level 1: Did not yet meet expectations - Count'] + PARCC2017 ['Level 2: Partially met expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017['Middle Ratio']= PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Upper Ratio']= (PARCC2017 ['Level 4: Met expectations - Count'] + PARCC2017 ['Level 5: Exceeded expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017 = PARCC2017.reset_index()
PARCC2017.head()
We added three columns to make sure that we could accuratley compare counties: Upper, Middle and Lower ratio. The Upper and Lower ratios are calculated by taking all the reported 4&5 scores (1&2 Scores for the lower) and dividing them by the total number of students who took the exam in that county.
On a slight tangent, we'd like to note that every county (without exception) underreports their data. This shuoldn't be a large issue (because it's more than 97% reported in every county), but is still worth mentioning as it could affect some of our numbers:
PARCC2017['Percent Reported'] = PARCC2017['Lower Ratio'] + PARCC2017['Middle Ratio'] + PARCC2017['Upper Ratio']
PARCC2017['Percent Reported']
These are far from bad numbers, it's just slightly funny to see that the overall reporting is still listed as 100%. Moving on.
Let's get rid of those not-counties again.
PARCC2017 = PARCC2017.drop(PARCC2017.index[0])
PARCC2017 = PARCC2017.drop(PARCC2017.index[18])
Time to do some real visualization, strap in.
In order to better visualize the disparity by county, and see if there is any spillover, we decided to plot some choropleths to help us see which areas have outstanding PARCC scores (beased on 2017 numbers) and which areas are hurting the most.
import warnings;
warnings.simplefilter('ignore')
map_parcc1 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)
map_parcc2 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)
geo_dist = os.path.join('MD-24-maryland-counties.json')
map_parcc1.choropleth(
geo_data=geo_dist,
name='choropleth',
data= PARCC2017,
columns=['LEA Name', 'Upper Ratio'],
key_on='feature.properties.NAME',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Student Rate of Above Average PARCC Scores in 2017'
)
map_parcc2.choropleth(
geo_data=geo_dist,
name='choropleth',
data= PARCC2017,
columns=['LEA Name', 'Lower Ratio'],
key_on='feature.properties.NAME',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Student Rate of Below Average PARCC Scores in 2017'
)
As you can see for the Above Average scores, there seems to be some geographic influence: Four of the darkest reigons are all next to each other, and the shcools that seem to be having the most trouble contain two of marylands biggest and most populous urban areas: DC (It doesn't actually conatin DC, but around there) and Baltimore.
map_parcc1
For the Below Average Scores, the map is consistent with the hypothesis from above. Urban areas seem to be the most affected, and the areas that were severly lacking in high performance students, mostly seem to be making up for the glut in lower performing students. Counties that contain a low amount of high and lower students are explained by the average students (like Caroline county), which is consistent with our model.
map_parcc2
Let's gather some data about Maryland family incomes, and see if we can find some correlations to test data. We want to plot it in another choropleth map so we can see if the intensity in the difference of median family income (households with children) in 2017 is the same for the PARCC 2017 test scores by county.
We are going to use the Maryland Census data from 2017 to Obtain this info, see link above.
Census2017 = pd.read_csv("ACS_17_5YR_DP03_with_ann.csv")
Census2017.head()
This is gross, but luckily we need very little of it. Let's slide up that first row into the header slot, so it's more easily accessible, and dump everything that we don't care about (so anything that isn't about the counties or income).
new_header = Census2017.iloc[0]
Census2017 = Census2017[1:]
Census2017.columns = new_header
Census2017 = Census2017.filter(regex='(Median family income|Geography)')
Census2017.head()
Better, but still a little too much, we need to fix up the data so that it'll correspond with the GeoJson County names.
We want to use a robust statistic to measure the difference in 'average' wealth, so we settled on Median Family Annual Income (MFAI) because we wouldn't have to worry about outliers and we would get a better picture of the counties' respective wealth without worrying about the population size and distribution (which data we don't have so we can't analyze).
Census2017 = Census2017[Census2017.columns.drop(list(Census2017.filter(regex='(Percent|Margin)')))]
Census2017 = Census2017.rename(index=str, columns ={"Estimate; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)" :"Median Family Income",
"Geography" : "County"})
Census2017 ['County'] = Census2017['County'].str.replace(r'(County)?, Maryland', '').str.strip()
Census2017 ['County'] = Census2017['County'].str.replace('city', 'City')
Census2017 ['County'] = Census2017['County'].str.replace('St.', 'Saint')
Census2017.at['3', 'County'] = 'Baltimore County'
Census2017 ['Median Family Income'] = Census2017 ['Median Family Income'].astype(int)
Census2017
We will take the data and put it into another choropleth to hope to compare intensities of key counties.
import warnings;
warnings.simplefilter('ignore')
map_medincome = folium.Map(location=[38.9072, -77.0369], tiles = 'cartodbpositron', zoom_start=8)
geo_dist = os.path.join('MD-24-maryland-counties.json')
map_medincome.choropleth(
geo_data=geo_dist,
name='choropleth',
data= Census2017,
columns=['County', 'Median Family Income'],
key_on='feature.properties.NAME',
fill_color='BuPu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Family Household Median Annual Income in 2017 by County '
)
map_medincome
If you compare this Choropleth to the earlier Choropleth of higher scores, it seems to have some correlation. The reigons with higher family income median tend to be the highest performing reigons: Montgomery County, Howard County, Anne Arundel and Calvert all rank in the upper percentile according to our previous choropleth.
Conversely, the poorest performing county is by far the county with the lowest median income. Same with Prince George's.
Takeaways:
An interesting county is Worcester which unexpectedly has a low median income, but has some of the heighest performance of all of the counties. I believe the key might be in the distribution, Worcestor has a high average but a low range, the violin plot is short and fat.
That is perhaps an indicator that Worcestor's income distribution might also be short and fat, and the school system may be run in a way that promotes all the students to exceed rather than people having money to take outside of school classes.
Another interesting takeaway: These are also the reigons with the highest student performance ranges in the distribution (See the violin plot above). So it seems as if the areas with the highest median income also seem to have higher peaks and valley. The distributions for all of those areas are skinny as well, indicating that there isn't that much of a glut in the average area (not a normal dsitribution).
At this point we have already looked at the Income and Test score data (PARCC) for 2017, so we want to see if we can find a relationship between the two using OLS, and then see if we can fit a model that we could use to see if it will explain our hypothesis that the test scores are too related with income for them to included so heavily in the star rankings.
Using the Performance statistic we created, we will attempt to run a linear regression and analyze the r^2 and the p-values to better fit the model.
fulldf = fulldf[fulldf.columns.drop(list(fulldf.filter(regex='Level|Tested|Ratio')))]
fulldf = Census2017.merge(PARCC2017Performance.groupby('LEA Name').agg({'Performance': 'sum'}).rename(columns={'Performance': 'Performance'}), left_on='County', right_on='LEA Name', how='inner')
fulldf = fulldf.rename(index = str, columns = {'Median Family Income' : 'Median_Family_Income', '(Performance, sum)':'Performance'})
fulldf
We Stuff the two together on by county and remove the unecessary full, so that we can use this to plot our linear regression.
When we plot our regression we want to make note of our F-statistic and make sure it is large enough to be greater than our Fstatistic. If so we can reject the null hypothesis, and declare this model having some semblance of positive linearity.
We want a small p value for our coeffecient and a high R-Squared (don't have to worry bout Adjusted in Simple Linear Regression.)
relmodel = ols(formula='Performance ~ Median_Family_Income', data=fulldf).fit()
relanova = anova_lm(relmodel)
print(relmodel.summary())
print(relanova)
inc_perf = sns.regplot(x = "Median_Family_Income", y = "Performance", data = fulldf)
inc_perf
Model Issues: The P-values for the intercept and coeffecient look okay, and since we aren't doing MLR I don't have to worry about multiple p-values or vifs or multicolinearity (even though I get that warning).
The data, when plotted, has a relatively low R-Squared, which means it needs some transformations to make it a linear relationship.
We would rather not remove outliers because the dataset is already so small, only 26 points, so I think that it is in our best interest to transform the response. The log transform of the response looks most appropriate, based on the exponential shape of the points, and the performance of the previous model.
nemodel = ols(formula='np.log(Performance) ~ Median_Family_Income', data=fulldf).fit()
neanova = anova_lm(nemodel)
print(nemodel.summary())
print(neanova)
fulldf['logperf']= np.log(fulldf['Performance'])
inc_perf_log = sns.regplot(x = "Median_Family_Income", y = "logperf", data = fulldf)
Final Model Results: The R squared, the F value increased and the coeffecient p value decreased, so this is defintley a better model! We can reject the null on both our F test and our p value test.
So there seems to be a lot of correlation between income and PARCC performance for 2017.
The final equation we have for it is log(Performance) = 3.303e-05 $\times$ Median_Family_Income + 6.9427
There is definitley a positive trend (exponential or linear).
*Why It's important
Through our analysis we have proved that there is a definite positive relationship between Maryland student's household incomes and their PARCC scores. That means that if you come from a lower income area, you do worse on standardized testing, and that's a problem.
According to The Baltimore Sun, The reason that it is a problem is that these scores are used in consideration by other programs and companies to give students extra attention and take a way instructional time from the students. These scores are used in the calculation of star ratings which affect not only the schools themselves but the surrounding house prices, which makes the value of property areas with low star scores go down.
According to The Herald the PARCC test is much better than it's predessecor, the I-Sat (study done in Illinois), but regardless there is still a trend, even if the slope of the trend is left.
We believe that the best thing to do would be to weigh the scores of these exams less in star scores or property valuations and instead use them as indicators for which schools need more help and funding rather than indicators of excellent schools. We should be giving easy standardized exams that ask only the common core material, acessible to people of all financial backgrounds. The data shows, otherwise, however as in 2018, (PARCC scores continue to drop)[https://wtop.com/maryland/2018/08/maryland-parcc-scores-students-still-struggle-to-pass-english-math-tests-but-some-scores-tick-up/], and with it, star scores, seriously affecting the lives of these low income household students.
How we could have improved the Model
We could have improved the model by processing more data points from more years, but that would require cleaning a lot more census data and inputting that into the model, which we didn't have the space or time for. Also if we plot it over a period of time there are more things affecting the model, like general inflation and other time dependant influences that we wanted to avoid.
We could have added the 2018 PARCC data, but we couldn't find a clean version for some reason.
Another thing we could have done would be to add in other variables to analyze, but our goal of our project was simply to prove the positive linear relationship between the two quantities and analyze the distributions and geographical placements, not to predict a studen'ts test score based on their individual attributes.
Some research that backs our claim up, but for the Entire U.S.
ACT and The problem of Underserved Learners(Source: WaPo) <-- This is a really good one.