Public School Standardized Testing vs Income by County in Maryland

Aaron Eisenberg, Zarek Peris, Fernando Rodriguez

TestsScores
Mayfield High School junior Laura Cruz, 18, holds up a sign March 2, 2015, during a student-organized walkout to protest the PARCC exams in Las Cruces, N.M.

Introduction

There's a remarkable lack of substantive analysis with regards to public schooling data in the United States, especially in lower income areas like Baltimore. This is unfortunate, given how crucial this information is - both for the wellbeing of students, as well as for the direction of our society.

As part of a two year initiative, Maryland's State education department has finally come out with Star Ratings to rank the quality of their public Elementary, middle and high schools. The star ratings were required by a federal law called the Every Student Succeeds Act or ESSA, which replaced the more punitive No Child Left Behind Act to help identify and diagnose schools that were having an issue within Maryland, with all the schools recieving a one star rating having to come up with a plan of action. According to the Baltimore Sun, "The bottom 5 percent of schools in the state — including some two-star schools — will be identified as needing an intervention plan for improvement. Most of those schools are in Baltimore City, where 23 schools earned one star."

These star rankings take standardized test scores into account, in fact, almost 65% of the evaluation is based on them! The star rating is a good step in the right direction, only releasing as early as December 4th, 2018, but the majority of the metrics on schools is still shown by the the standardized test scores of the schools in each area. In Maryland, this is the PARCC, or the (Partnership for Assessment of Readiness for College and Careers).

We wanted to see the correlation between income and testscores by counties in maryland has a linear relationship, in order to see if taking blanket PARCC scores to rate schools is an accurate way of describing student potential. Additionally, we wanted to map the disparities in both Test scores and income to try and predict problem areas and see if they spread out to nearby counties, and see if we get location results similar to the star ratings.

Our initial hypothesis: Family Income, even in the public school system, is so closely tied to the quality of education recieved, that the PARCC scores (assuming that it is a well designed exam) will have a positive linear correlation with Family Income.

Limitations and Assumptions:

Obviously, the PARCC is not a perfect indicator for how well a student is doing, but it is a standard metric that we can at least attempt to use as justification.

The model doesn't take into account extraneous factors, such as race, gender or english-speaking vs non-english speaking households because we didn't think it necessary to fit those things against school performance (for both ethical and time related reasons.)

Finally, We'll be using data from 2010 through 2017, inclusive. This brings up a few issues that we'll address later, especially the fact that Maryland switched from the MSA (Maryland State Assessment) system to PARCC (Partnership for Assessment of Readiness for College and Careers) in 2014.

This tutorial will use data from:

The Maryland State Department of Education

The United States Census Bureau

Here is a list of tools we'll be using:

In [210]:
import pandas as pd
# We'll be using pandas as our primary means of data manipulation and management.
import numpy as np
import matplotlib.pyplot as plt
# We'll be using pyplot for graphing data across time. 
import folium
# folium is an expansive data visualization library, but we'll mostly be using it for it's chloropleth feature. 
import requests
import os 
import seaborn as sns
from statsmodels.formula.api import ols
#least squard models 
from statsmodels.stats.anova import anova_lm

Test Scores: Data Cleanup

First things first, we've got to gather our data. We had to do this manually, going to the sites linked above, downloading and importing the csv files into our environments. It's not that many files, so all things considered, it's faster than trying to script the process.

Once we've gotten that out of the way, we can use one of pandas (many) helpful features and turn these csv's straight into dataframes. From 2010 to 2014, we reference the MSA data files. From 2015 to 2017, we use PARCC. We'll keep lists of each individual year's dataframe.

In [15]:
grade_data_msa = []
grade_data_parcc = []

for year in range(2010, 2015):
    grade_data_msa.append(pd.read_csv('MSA_' + str(year) + '.csv'))
    
for year in range(2015, 2018):
    grade_data_parcc.append(pd.read_csv('PARCC_' + str(year) + '.csv'))

Now we've got the data, we want to put all of the years together. Luckily, pandas has another great feature that allows us to cleanly do this.

In [16]:
msaAllYears = pd.concat(grade_data_msa, sort=False)
parccAllYears = pd.concat(grade_data_parcc, sort=False)

Time to investigate these dataframes to see what we're working with. First up, MSA:

In [17]:
msaAllYears.head()
Out[17]:
Academic Year LEA Number LEA Name School Number School Name Test Type Title Grade Title Subject Title Advanced Count Advanced Percent Proficient Count Proficient Percent Basic Count Basic Percent Test Member Count Create Date Test Type Grade Subject Tested Count
0 2010 01 Allegany 0301 Flintstone Elementary All Grade 5 Science * <= 5.0 29 82.9 5 14.3 35 20111104 NaN NaN NaN NaN
1 2010 01 Allegany 0301 Flintstone Elementary All Grade 3 Reading 9 25.0 22 61.1 5 13.9 36 20111104 NaN NaN NaN NaN
2 2010 01 Allegany 0301 Flintstone Elementary All Grade 4 Reading 15 37.5 25 62.5 * <= 5.0 40 20111104 NaN NaN NaN NaN
3 2010 01 Allegany 0301 Flintstone Elementary All Grade 5 Reading 25 65.8 11 28.9 2 5.3 38 20111104 NaN NaN NaN NaN
4 2010 01 Allegany 0301 Flintstone Elementary All Grade 3 Math 9 25.0 25 69.4 2 5.6 36 20111104 NaN NaN NaN NaN

Not horrible, but needs a bunch of cleaning. First of all, it has a bunch of stuff we don't care about. Let's drop those columns.

In [18]:
msaAllYears = msaAllYears.drop(columns = ['LEA Number', 'School Number', 'School Name', 'Test Type Title', 
                                          'Grade Title', 'Grade', 'Create Date', 'Test Type'])
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Percent')))] # little shortcut for dropping
msaAllYears.head()
Out[18]:
Academic Year LEA Name Subject Title Advanced Count Proficient Count Basic Count Test Member Count Subject Tested Count
0 2010 Allegany Science * 29 5 35 NaN NaN
1 2010 Allegany Reading 9 22 5 36 NaN NaN
2 2010 Allegany Reading 15 25 * 40 NaN NaN
3 2010 Allegany Reading 25 11 2 38 NaN NaN
4 2010 Allegany Math 9 25 2 36 NaN NaN

Already a lot easier to parse, visually, but there's still plenty to do.

I don't like those asterisks, and I'm planning on summing over thos columns in a bit. Let's set them to 0's, so we can forget about them.

In [19]:
msaAllYears = msaAllYears.replace('*', '0')

Also, "LEA Name" isn't super intuitive, when really we're looking at counties. Renaming that column seems smart.

In [20]:
msaAllYears = msaAllYears.rename(index=str, columns={"LEA Name": "County"})

Lastly, we're considering Maryland schools across county, so I don't really want to look at SEED (a DC charter school) or the state as a whole. Goodbye to those rows.

In [21]:
msaAllYears = msaAllYears[msaAllYears['County'] != 'SEED']
msaAllYears = msaAllYears[msaAllYears['County'] != 'Seed School LEA']
msaAllYears = msaAllYears[msaAllYears['County'] != 'All Public Schools']

Let's see where we're at now. (I'm printing the entire thing for a reason)

In [22]:
msaAllYears
Out[22]:
Academic Year County Subject Title Advanced Count Proficient Count Basic Count Test Member Count Subject Tested Count
0 2010 Allegany Science 0 29 5 35 NaN NaN
1 2010 Allegany Reading 9 22 5 36 NaN NaN
2 2010 Allegany Reading 15 25 0 40 NaN NaN
3 2010 Allegany Reading 25 11 2 38 NaN NaN
4 2010 Allegany Math 9 25 2 36 NaN NaN
5 2010 Allegany Math 21 19 0 40 NaN NaN
6 2010 Allegany Math 5 29 4 38 NaN NaN
7 2010 Allegany Science 0 42 27 72 NaN NaN
8 2010 Allegany Reading 8 42 19 69 NaN NaN
9 2010 Allegany Reading 10 49 13 72 NaN NaN
10 2010 Allegany Reading 33 20 19 72 NaN NaN
11 2010 Allegany Math 23 40 6 69 NaN NaN
12 2010 Allegany Math 27 37 7 71 NaN NaN
13 2010 Allegany Math 7 49 16 72 NaN NaN
14 2010 Allegany Reading 0 0 0 0 NaN NaN
15 2010 Allegany Reading 0 0 0 0 NaN NaN
16 2010 Allegany Reading 0 0 0 0 NaN NaN
17 2010 Allegany Math 0 0 0 0 NaN NaN
18 2010 Allegany Math 0 0 0 0 NaN NaN
19 2010 Allegany Math 0 0 0 0 NaN NaN
20 2010 Allegany Science 8 26 13 47 NaN NaN
21 2010 Allegany Reading 3 36 14 53 NaN NaN
22 2010 Allegany Reading 4 30 10 44 NaN NaN
23 2010 Allegany Reading 23 20 4 47 NaN NaN
24 2010 Allegany Math 10 30 13 53 NaN NaN
25 2010 Allegany Math 19 20 5 44 NaN NaN
26 2010 Allegany Math 8 31 8 47 NaN NaN
27 2010 Allegany Reading 0 0 0 0 NaN NaN
28 2010 Allegany Reading 0 0 0 0 NaN NaN
29 2010 Allegany Reading 0 0 0 0 NaN NaN
... ... ... ... ... ... ... ... ... ...
9004 2014 Baltimore City NaN 6 25 57 NaN Reading 88
9005 2014 Baltimore City NaN 0 19 69 NaN Science 88
9006 2014 Baltimore City NaN 0 0 0 NaN Math 0
9007 2014 Baltimore City NaN 0 4 37 NaN Math 41
9008 2014 Baltimore City NaN 0 0 0 NaN Math 0
9009 2014 Baltimore City NaN 0 6 14 NaN Reading 20
9010 2014 Baltimore City NaN 4 17 21 NaN Reading 42
9011 2014 Baltimore City NaN 0 16 31 NaN Reading 48
9012 2014 Baltimore City NaN 0 7 36 NaN Science 43
9013 2014 Baltimore City NaN 0 18 28 NaN Math 46
9014 2014 Baltimore City NaN 0 18 37 NaN Math 55
9015 2014 Baltimore City NaN 0 3 48 NaN Math 51
9016 2014 Baltimore City NaN 5 26 18 NaN Reading 49
9017 2014 Baltimore City NaN 8 23 26 NaN Reading 57
9018 2014 Baltimore City NaN 3 18 30 NaN Reading 51
9019 2014 Baltimore City NaN 0 10 40 NaN Science 50
9020 2014 Baltimore City NaN 0 2595 3192 NaN Math 6018
9021 2014 Baltimore City NaN 485 2465 2603 NaN Math 5553
9022 2014 Baltimore City NaN 0 2206 3293 NaN Math 5722
9023 2014 Baltimore City NaN 357 1865 2953 NaN Math 5175
9024 2014 Baltimore City NaN 249 1077 2476 NaN Math 3802
9025 2014 Baltimore City NaN 315 1169 3729 NaN Math 5213
9026 2014 Baltimore City NaN 0 2909 2502 NaN Reading 5668
9027 2014 Baltimore City NaN 532 3216 1800 NaN Reading 5548
9028 2014 Baltimore City NaN 1497 2725 1492 NaN Reading 5714
9029 2014 Baltimore City NaN 893 2275 1892 NaN Reading 5060
9030 2014 Baltimore City NaN 954 1917 2014 NaN Reading 4885
9031 2014 Baltimore City NaN 882 1976 2365 NaN Reading 5223
9032 2014 Baltimore City NaN 0 1639 3941 NaN Science 5693
9033 2014 Baltimore City NaN 0 1786 3286 NaN Science 5147

59066 rows × 9 columns

We're getting there. Time to deal with all those NaNs.

If you'll notice, every row seems to either have values in "Subject Title" $\textbf{or}$ "Subject", and "Test Member Count" $\textbf{or}$ "Tested Count". Let's make sure this is the case.

In [23]:
msaAllYears = msaAllYears.replace(np.nan, '')
count = 0
for index, row in msaAllYears.iterrows():
    if (row['Subject Title'] == '' and row['Subject'] == '') \
    or (row['Test Member Count'] == '' and row['Tested Count'] == '') \
    or (row['Subject Title'] != '' and row['Subject'] != '') \
    or (row['Test Member Count'] != '' and row['Tested Count'] != ''):
        print('Error at index: ' + str(index))

...

Okay we're good! Let's go ahead and fill in the respective columns, then drop the extra ones.

In [24]:
msaAllYears = msaAllYears.replace('', np.nan)
msaAllYears['Subject'].fillna(msaAllYears['Subject Title'], inplace=True)
msaAllYears['Tested Count'].fillna(msaAllYears['Test Member Count'], inplace=True)
msaAllYears = msaAllYears.drop(columns=['Subject Title', 'Test Member Count'])

Unfortunately, the PARCC testing system doesn't include science exams, so we'll be discarding those in the MSA data for consistency.

In [25]:
msaAllYears = msaAllYears[msaAllYears['Subject'] != 'Science']

Let's convert everything to integers so we can do some math.

In [26]:
msaAllYears['Tested Count'] = msaAllYears['Tested Count'].astype(int)
msaAllYears['Advanced Count'] = msaAllYears['Advanced Count'].astype(int)
msaAllYears['Proficient Count'] = msaAllYears['Proficient Count'].astype(int)
msaAllYears['Basic Count'] = msaAllYears['Basic Count'].astype(int)

Time to put everything together so we can graph it more easily later.

In [27]:
msaAllYears = msaAllYears.groupby(['Academic Year', 'County', 'Subject']).sum()
msaAllYears = msaAllYears.reset_index()
msaAllYears.head()
Out[27]:
Academic Year County Subject Advanced Count Proficient Count Basic Count Tested Count
0 2010 Allegany Math 2294 3981 1645 7936
1 2010 Allegany Reading 2940 3779 1224 7950
2 2010 Anne Arundel Math 26185 29427 10419 66230
3 2010 Anne Arundel Reading 29541 28832 7672 66276
4 2010 Baltimore City Math 11720 33214 22981 68163

Okay, we're starting to see real information. Now that we have a cumulative count, let's get back those percentages we dumped earlier.

In [28]:
msaAllYears['Adv Ratio'] = msaAllYears ['Advanced Count'] / msaAllYears ['Tested Count']
msaAllYears['Prof Ratio'] = msaAllYears ['Proficient Count'] / msaAllYears ['Tested Count']
msaAllYears['Basic Ratio'] = msaAllYears ['Basic Count'] / msaAllYears ['Tested Count']

msaAllYears = msaAllYears.drop(columns=['Advanced Count', 'Proficient Count', 'Basic Count'])
msaAllYears.head()
Out[28]:
Academic Year County Subject Tested Count Adv Ratio Prof Ratio Basic Ratio
0 2010 Allegany Math 7936 0.289062 0.501638 0.207283
1 2010 Allegany Reading 7950 0.369811 0.475346 0.153962
2 2010 Anne Arundel Math 66230 0.395365 0.444315 0.157315
3 2010 Anne Arundel Reading 66276 0.445727 0.435029 0.115758
4 2010 Baltimore City Math 68163 0.171941 0.487273 0.337148

Alright, so let's take a second to gather ourselves. We want to graph this data over time - how?

Well, we need to have some sort of objective metric of performance for each county, in each year, in each subject. As it stands, we just have a bunch of percentages.

Here's the idea. We take the difference between the advanced percentage and the basic percentage. We can ignore the number in the middle - let's consider those to be "neutral". So if a county has a lot of high performing students relative to low performing, the number will be positve, and respectively negative for low performing counties. Then, we'll add 1 to make all these numbers positive, with the already positive numbers jumping over 1, and then multiply by 100 to make the numbers more readable. Here's what we get:

In [29]:
msaAllYears['Performance'] = (msaAllYears['Adv Ratio'] - msaAllYears['Basic Ratio'] + 1) * 100
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Ratio')))] 
msaAllYears = msaAllYears.drop(columns=['Tested Count'])
msaAllYears.head()
Out[29]:
Academic Year County Subject Performance
0 2010 Allegany Math 108.177923
1 2010 Allegany Reading 121.584906
2 2010 Anne Arundel Math 123.804922
3 2010 Anne Arundel Reading 132.996862
4 2010 Baltimore City Math 83.479307

Looks good, and graphable! Now to split into two dataframes, between math and language. If there's an easy way to do this is pandas, I haven't found it. Fortunately, it's not that hard to do manually, and it's good coding practice anyway!

First let's get a list of all the counties, and initialize a dictionary for each subject.

In [30]:
counties = msaAllYears['County'].unique()
readingDict = {}
mathDict = {}
for c in counties:
    readingDict[c] = []
    mathDict[c] = []

Okay, now here's the idea. We want to take every row that's assigned to each subject, and append it's performance value to the associated dictionarie's county entry.

Example Row: 2011, Montgomery, Math, 103

This will go to the math dictionary, look for the Montgomery key, and then add the performance 103. Because this table is stored chronologically, we don't have to worry about the year - it'll be in the right place.

In [31]:
for index, row in msaAllYears.iterrows():
    if row['Subject'] == 'Reading':
        readingDict[row['County']].append(row['Performance'])
    else:
        mathDict[row['County']].append(row['Performance'])

Let's convert back to see the fruits of our labor.

In [32]:
msaAllYearsReading = pd.DataFrame(readingDict, index = range(2010, 2015))
msaAllYearsMath = pd.DataFrame(mathDict, index = range(2010, 2015))
msaAllYearsReading.head()
Out[32]:
Allegany Anne Arundel Baltimore City Baltimore County Calvert Caroline Carroll Cecil Charles Dorchester ... Kent Montgomery Prince George's Queen Anne's Saint Mary's Somerset Talbot Washington Wicomico Worcester
2010 121.584906 132.996862 92.564980 123.195494 141.909939 124.199134 143.603026 118.674742 121.302566 104.877419 ... 118.306762 138.914423 98.788001 140.566038 129.748860 110.330229 120.687743 124.103907 107.186339 142.476293
2011 118.386226 136.028962 85.109381 125.508953 144.982139 126.666667 137.494322 120.408163 119.808252 106.297164 ... 118.623025 139.865525 102.551224 143.678161 126.912449 116.054077 121.970310 126.904738 116.609108 150.106383
2012 123.141361 134.903043 83.445835 125.904127 142.942576 128.819370 145.055491 121.969209 120.990496 102.637076 ... 111.917098 139.463753 102.789353 142.172949 130.722286 116.611433 125.868931 125.743952 113.832254 149.135135
2013 123.540340 138.672386 85.880004 127.845474 146.056782 125.210084 142.407782 119.039237 120.424157 101.933216 ... 118.991597 140.162087 104.709728 140.801037 127.182658 112.255489 118.373256 123.455497 118.411435 151.959022
2014 119.075898 130.142295 77.534726 121.295967 140.264266 114.042278 139.865323 110.768873 114.972553 94.156884 ... 108.621736 133.073475 101.464655 133.274074 124.081993 105.368852 115.360046 117.613667 102.957613 144.264482

5 rows × 24 columns

MSA Test Scores Over Time

Looks pretty nice. Finally time to graph it!

We'll split the counties in half so each graph isn't too cluttered.

In [33]:
msaAllYearsReading.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")
msaAllYearsReading.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")
Out[33]:
[Text(0,0.5,'Reading Exam Performance'), Text(0.5,0,'Year')]
In [89]:
msaAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
msaAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
Out[89]:
[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]

Some brief analysis before we continue - Baltimore City is very clearly the lowest performing county in both subjects. If you're familiar with the area this probably isn't surprising. We'll further explore this later.

Apart from Baltimore City, most counties are fairly in line, with more minor exceptions like Howard and Dorchester.


PARCC Test Scores Over Time

Moving on to the PARCC year records. Luckily, these start off a bit cleaner, so we have less work to do.

In general, if I did something very similar with the MSA data, I won't repeat myself here. Let's take a look.

In [34]:
parccAllYears.head()
Out[34]:
Academic Year LEA Number LEA Name School Number School Name Assessment Tested Count Level 1: Did not yet meet expectations - Count Level 1: Did not yet meet expectations - Percent Level 2: Partially met expectations - Count Level 2: Partially met expectations - Percent Level 3: Approached expectations - Count Level 3: Approached expectations - Percent Level 4: Met expectations - Count Level 4: Met expectations - Percent Level 5: Exceeded expectations - Count Level 5: Exceeded expectations - Percent Create Date
0 2015 01 Allegany 0301 Flintstone Elementary English/Language Arts Grade 3 32 6 18.8 6 18.8 10 31.3 8 25 2 6.3 20160108
1 2015 01 Allegany 0301 Flintstone Elementary English/Language Arts Grade 4 33 * <= 5.0 4 12.1 11 33.3 14 42.4 3 9.1 20160108
2 2015 01 Allegany 0301 Flintstone Elementary English/Language Arts Grade 5 33 6 18.2 6 18.2 11 33.3 10 30.3 * <= 5.0 20160108
3 2015 01 Allegany 0301 Flintstone Elementary Mathematics Grade 3 32 2 6.3 11 34.4 10 31.3 7 21.9 2 6.3 20160108
4 2015 01 Allegany 0301 Flintstone Elementary Mathematics Grade 4 33 * <= 5.0 7 21.2 11 33.3 14 42.4 * <= 5.0 20160108

This one is significantly easier to manage. All we need to do is drop the columns we won't be using, then group by year and county, along with the other minor changes we made last time.

In [35]:
parccAllYears = parccAllYears.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date'])
parccAllYears = parccAllYears.rename(index=str, columns={"LEA Name": "County"})
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Percent')))]
parccAllYears = parccAllYears[parccAllYears['County'] != 'SEED']
parccAllYears = parccAllYears[parccAllYears['County'] != 'Seed School LEA']
parccAllYears = parccAllYears[parccAllYears['County'] != 'All Public Schools']
parccAllYears = parccAllYears.replace('*', '0')

To make things consistent with the MSA data, we're going to categorize exams as either ELA (English Language Arts) or math. As you may be able to tell, once the students get to high school the exam names change from "Mathematics Grade 'x'" to Algebra, Geometry, etc. Therefore, we'll just treat anything that has the word "English" in it as an ELA exam, and anything without the word "English" as a math exam.

In [36]:
parccAllYears.loc[parccAllYears['Assessment'].str.contains('English'), 'Assessment'] = 'ELA'
parccAllYears.loc[~parccAllYears['Assessment'].str.contains('ELA'), 'Assessment'] = 'Math'

Once again, we'll convert the relevant columns to integers so we can do math stuff, and group them appropriately.

In [37]:
parccAllYears['Tested Count'] = parccAllYears['Tested Count'].astype(int)
parccAllYears[list(parccAllYears.filter(regex='Level'))] = parccAllYears[list(parccAllYears.filter(regex='Level'))].astype(int)
parccAllYears = parccAllYears.groupby(['Academic Year', 'County', 'Assessment']).sum()
parccAllYears = parccAllYears.reset_index()

Same as before - convert test counts into test ratios, use a little shortcut to drop anything with the word "level" to get rid of the counts.

In [38]:
parccAllYears['1 Ratio'] = parccAllYears ['Level 1: Did not yet meet expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['2 Ratio'] = parccAllYears ['Level 2: Partially met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['3 Ratio'] = parccAllYears ['Level 3: Approached expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['4 Ratio'] = parccAllYears ['Level 4: Met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['5 Ratio'] = parccAllYears ['Level 5: Exceeded expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Level')))] 
parccAllYears.head()
Out[38]:
Academic Year County Assessment Tested Count 1 Ratio 2 Ratio 3 Ratio 4 Ratio 5 Ratio
0 2015 Allegany ELA 8511 0.182940 0.231583 0.289743 0.267654 0.011045
1 2015 Allegany Math 8914 0.164685 0.293247 0.291564 0.225712 0.005609
2 2015 Anne Arundel ELA 79982 0.107737 0.167263 0.251357 0.386975 0.071116
3 2015 Anne Arundel Math 84538 0.121614 0.254844 0.284121 0.300551 0.023291
4 2015 Baltimore City ELA 73284 0.317218 0.290527 0.222218 0.146499 0.010916

Here is the metric of performance we'll be using for the PARCC data. Similar to that of MSA, we consider anything below the median score (in this case 3) to be negative, with anything above considered positive. We double the weight of scores 1 and 5, to account for these extremes. We then normalize and scale to make it more readable.

In [39]:
parccAllYears['Performance'] = (-2*parccAllYears['1 Ratio'] - parccAllYears['2 Ratio'] + \
                                parccAllYears['4 Ratio'] + 2*parccAllYears['5 Ratio'] + 1) * 100
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Ratio')))] 
parccAllYears.head()
Out[39]:
Academic Year County Assessment Tested Count Performance
0 2015 Allegany ELA 8511 69.228058
1 2015 Allegany Math 8914 61.431456
2 2015 Anne Arundel ELA 79982 114.647046
3 2015 Anne Arundel Math 84538 84.906196
4 2015 Baltimore City ELA 73284 24.336827

Back to the same conversion, then plotting paradigm.

In [40]:
counties = parccAllYears['County'].unique()
elaDict = {}
mathDict = {}
for c in counties:
    elaDict[c] = []
    mathDict[c] = []
for index, row in parccAllYears.iterrows():
    if row['Assessment'] == 'ELA':
        elaDict[row['County']].append(row['Performance'])
    else:
        mathDict[row['County']].append(row['Performance'])
parccAllYearsELA = pd.DataFrame(elaDict, index = range(2015, 2018))
parccAllYearsMath = pd.DataFrame(mathDict, index = range(2015, 2018))
In [41]:
parccAllYearsELA.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsELA.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
Out[41]:
[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]
In [42]:
parccAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
Out[42]:
[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]

Now let's put it all together, accounting for the difference in exam structure change across year. Again, pandas saves the day.

In [43]:
allYearsLanguage = pd.concat([msaAllYearsReading, parccAllYearsELA], sort=True)
allYearsMath = pd.concat([msaAllYearsMath, parccAllYearsMath], sort=True)
allYearsMath
Out[43]:
Allegany Anne Arundel Baltimore City Baltimore County Calvert Caroline Carroll Cecil Charles Dorchester ... Kent Montgomery Prince George's Queen Anne's Saint Mary's Somerset Talbot Washington Wicomico Worcester
2010 108.177923 123.804922 83.479307 108.661548 129.290556 121.595837 128.591410 102.038787 107.557938 94.844032 ... 105.418719 121.693018 81.799845 126.512323 126.649347 96.994073 109.905660 121.346280 104.881131 138.093532
2011 115.248363 127.192068 75.115120 112.432698 132.682861 121.889152 128.984433 105.520381 112.829849 100.181960 ... 104.524887 123.599523 85.581274 130.379747 129.211518 109.741635 110.813733 121.954126 112.557206 150.416593
2012 123.981603 131.616156 79.044917 117.098708 137.851610 134.604671 138.413750 114.627185 116.734432 101.702462 ... 101.725129 127.152736 90.501731 136.418481 134.017576 114.469320 117.767171 127.315276 117.111447 152.338971
2013 113.194260 119.777775 72.499268 111.824171 133.694311 123.666527 131.460013 89.826285 109.204540 95.099271 ... 97.040759 118.524184 85.554520 97.566595 125.018278 111.199681 87.090003 115.654474 111.634888 148.977395
2014 103.856674 104.792453 47.284196 96.725914 119.975455 79.604093 108.501630 79.134252 92.327710 59.716758 ... 76.610381 104.184512 69.273594 90.850869 104.499933 70.627803 73.891626 95.655583 81.379019 128.965517
2015 61.431456 84.906196 6.009766 52.508331 77.780952 82.694485 121.128621 67.814251 61.677146 47.709288 ... 49.453162 90.202048 26.435176 88.433868 95.294769 50.850340 64.213978 69.097178 51.471322 111.116821
2016 76.089897 97.728827 3.740038 72.381960 113.213900 89.708548 138.676482 71.406784 70.989822 49.635967 ... 70.207510 110.382469 27.406613 115.068338 98.842543 55.074425 85.653726 74.348833 69.757701 118.198700
2017 68.536045 97.510863 3.192313 69.192294 122.271767 94.833309 145.115130 91.076083 80.294295 51.174059 ... 66.351889 111.196707 23.450029 119.917059 105.733306 51.630252 81.049990 67.015163 74.287544 114.122924

8 rows × 24 columns

That's some mighty attractive data. Let's take a better look. We'll draw a line down the year marker where the exams changed from MSA to PARCC.

In [44]:
plt1 = allYearsLanguage.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsLanguage.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Language Exam Performance")
plt2.set(xlabel="Year", ylabel="Language Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black',  linewidth=0.2)
Out[44]:
<matplotlib.lines.Line2D at 0x7fcba6409a20>
In [45]:
plt1 = allYearsMath.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsMath.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Math Exam Performance")
plt2.set(xlabel="Year", ylabel="Math Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black',  linewidth=0.2)
Out[45]:
<matplotlib.lines.Line2D at 0x7fcba0984f98>

What do we see here?

Well, first thing you'll notice is that the change in exams noticeably changed the trajectory of nearly every county. Even Baltimore faired better, and stopped declining as harshly in performance. The best, like Howard, stay the top. This is not surprising - testing systems may favor slightly different things, but certain resource advantages will always show their hand in results like these. We'll explore what those resources may be in a bit.


Visualizing Distribution over a map

Our goal here is to further explore the distribution of exam success across Maryland, this time focusing on one year.

We'll repeat our collection process, and then analyze not only the mathematical distribution of test scores (using the performance variable we created in the previous section), but also to analyze the geographic distribution in order to see if the counties closer to each other have similar scores, and to relate it to income later on.

In [196]:
PARCC2017 = pd.read_csv("PARCC_2017.csv")
PARCC2017 = PARCC2017.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date', 'Assessment', 'Academic Year'])
PARCC2017  = PARCC2017 [PARCC2017 .columns.drop(list(PARCC2017.filter(regex='Percent')))]
PARCC2017  = PARCC2017 .replace('*', '0')
PARCC2017['Tested Count'] = PARCC2017['Tested Count'].astype(int)
PARCC2017[list(PARCC2017.filter(regex='Level'))] = PARCC2017[list(PARCC2017.filter(regex='Level'))].astype(int)
PARCC2017['1 Ratio'] = PARCC2017['Level 1: Did not yet meet expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['2 Ratio'] = PARCC2017 ['Level 2: Partially met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['3 Ratio'] = PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['4 Ratio'] = PARCC2017 ['Level 4: Met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['5 Ratio'] = PARCC2017 ['Level 5: Exceeded expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Performance'] = (-2*PARCC2017['1 Ratio'] - PARCC2017['2 Ratio'] + \
                                PARCC2017['4 Ratio'] + 2*PARCC2017['5 Ratio'] + 1) * 100
PARCC2017Performance = PARCC2017.copy(deep = True)
distplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[:4085, :]).set_title("Distribution of Student Performance by County")
distplot.figure.set_size_inches(20, 5)
In [197]:
slowplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[4086:, :]).set_title("Distribution of Student Performance by County (Cont.)")
slowplot.figure.set_size_inches(20, 5)

If we look at the distribution of the performance, we'll find that they vary greatly in shape, range and glut (the thick areas).

A skinnier shape indicates a more even distribution of scores, while a shorter/ fatter shape shows that more students fell under the a certain performance area.

A taller range means that excellent students and terrible students have a larger disparity, and vice versa for a shorter range.

A glut's location indicates where more students are falling in terms of performance.

Some interesting happenings:

Our county (Prince georges) has a pretty normal looking distribution, with an extremely high range, indicating that it's otherwise normally ditributed, but the excellent students are very excellent compared to the worse students.

Montgomery, Howard, Anne Arundel and Calvert are all relatively skinny, as well as the highest performing counties.

Seed and public schools have similar distributions, but flipped. The glut placement might dictate the shape of the rest, because both are relatively short and fat.

In [192]:
PARCC2017 = PARCC2017[PARCC2017.columns.drop(list(PARCC2017.filter(regex='Ratio|Performance')))]
PARCC2017 = PARCC2017.groupby('LEA Name').agg(['sum'])
PARCC2017['Lower Ratio']= (PARCC2017 ['Level 1: Did not yet meet expectations - Count'] + PARCC2017 ['Level 2: Partially met expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017['Middle Ratio']= PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Upper Ratio']= (PARCC2017 ['Level 4: Met expectations - Count'] + PARCC2017 ['Level 5: Exceeded expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017 = PARCC2017.reset_index()
PARCC2017.head()
Out[192]:
LEA Name Tested Count Level 1: Did not yet meet expectations - Count Level 2: Partially met expectations - Count Level 3: Approached expectations - Count Level 4: Met expectations - Count Level 5: Exceeded expectations - Count Lower Ratio Middle Ratio Upper Ratio
sum sum sum sum sum sum
sum sum sum sum sum sum
0 All Public Schools 957179 167629 197548 234596 292967 64439 0.381514 0.245091 0.373395
1 Allegany 19882 3707 4167 5075 5981 616 0.396037 0.255256 0.331808
2 Anne Arundel 173429 19756 31663 44961 62976 11483 0.296484 0.259247 0.429334
3 Baltimore City 175032 67829 48497 33934 20885 1231 0.664598 0.193873 0.126354
4 Baltimore County 222949 38732 51607 56785 62222 10256 0.405200 0.254700 0.325088

We added three columns to make sure that we could accuratley compare counties: Upper, Middle and Lower ratio. The Upper and Lower ratios are calculated by taking all the reported 4&5 scores (1&2 Scores for the lower) and dividing them by the total number of students who took the exam in that county.

On a slight tangent, we'd like to note that every county (without exception) underreports their data. This shuoldn't be a large issue (because it's more than 97% reported in every county), but is still worth mentioning as it could affect some of our numbers:

In [194]:
PARCC2017['Percent Reported'] = PARCC2017['Lower Ratio'] + PARCC2017['Middle Ratio'] + PARCC2017['Upper Ratio'] 
PARCC2017['Percent Reported']
Out[194]:
0     1.000000
1     0.983100
2     0.985066
3     0.984826
4     0.984988
5     0.985924
6     0.982124
7     0.977083
8     0.982400
9     0.984727
10    0.990333
11    0.987599
12    0.984187
13    0.984284
14    0.985091
15    0.982391
16    0.987477
17    0.984045
18    0.973669
19    0.996205
20    0.986917
21    0.981531
22    0.982816
23    0.985524
24    0.984957
25    0.985175
Name: Percent Reported, dtype: float64

These are far from bad numbers, it's just slightly funny to see that the overall reporting is still listed as 100%. Moving on.

Let's get rid of those not-counties again.

In [51]:
PARCC2017 = PARCC2017.drop(PARCC2017.index[0])
PARCC2017 = PARCC2017.drop(PARCC2017.index[18])

Time to do some real visualization, strap in.

In order to better visualize the disparity by county, and see if there is any spillover, we decided to plot some choropleths to help us see which areas have outstanding PARCC scores (beased on 2017 numbers) and which areas are hurting the most.

In [134]:
import warnings; 
warnings.simplefilter('ignore')
map_parcc1 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)
map_parcc2 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)

geo_dist = os.path.join('MD-24-maryland-counties.json')

map_parcc1.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= PARCC2017,
    columns=['LEA Name', 'Upper Ratio'],
    key_on='feature.properties.NAME',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Student Rate of Above Average PARCC Scores in 2017'
)


map_parcc2.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= PARCC2017,
    columns=['LEA Name', 'Lower Ratio'],
    key_on='feature.properties.NAME',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Student Rate of Below Average PARCC Scores in 2017'
)

As you can see for the Above Average scores, there seems to be some geographic influence: Four of the darkest reigons are all next to each other, and the shcools that seem to be having the most trouble contain two of marylands biggest and most populous urban areas: DC (It doesn't actually conatin DC, but around there) and Baltimore.

In [190]:
map_parcc1
Out[190]:

For the Below Average Scores, the map is consistent with the hypothesis from above. Urban areas seem to be the most affected, and the areas that were severly lacking in high performance students, mostly seem to be making up for the glut in lower performing students. Counties that contain a low amount of high and lower students are explained by the average students (like Caroline county), which is consistent with our model.

In [131]:
map_parcc2
Out[131]:

Charting Family Income by County

Let's gather some data about Maryland family incomes, and see if we can find some correlations to test data. We want to plot it in another choropleth map so we can see if the intensity in the difference of median family income (households with children) in 2017 is the same for the PARCC 2017 test scores by county.

We are going to use the Maryland Census data from 2017 to Obtain this info, see link above.

In [114]:
Census2017 = pd.read_csv("ACS_17_5YR_DP03_with_ann.csv")
Census2017.head()
Out[114]:
GEO.id GEO.id2 GEO.display-label HC01_VC03 HC02_VC03 HC03_VC03 HC04_VC03 HC01_VC04 HC02_VC04 HC03_VC04 ... HC03_VC178 HC04_VC178 HC01_VC179 HC02_VC179 HC03_VC179 HC04_VC179 HC01_VC180 HC02_VC180 HC03_VC180 HC04_VC180
0 Id Id2 Geography Estimate; EMPLOYMENT STATUS - Population 16 ye... Margin of Error; EMPLOYMENT STATUS - Populatio... Percent; EMPLOYMENT STATUS - Population 16 yea... Percent Margin of Error; EMPLOYMENT STATUS - P... Estimate; EMPLOYMENT STATUS - Population 16 ye... Margin of Error; EMPLOYMENT STATUS - Populatio... Percent; EMPLOYMENT STATUS - Population 16 yea... ... Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO... Percent Margin of Error; PERCENTAGE OF FAMILIE... Estimate; PERCENTAGE OF FAMILIES AND PEOPLE WH... Margin of Error; PERCENTAGE OF FAMILIES AND PE... Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO... Percent Margin of Error; PERCENTAGE OF FAMILIE... Estimate; PERCENTAGE OF FAMILIES AND PEOPLE WH... Margin of Error; PERCENTAGE OF FAMILIES AND PE... Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO... Percent Margin of Error; PERCENTAGE OF FAMILIE...
1 0500000US24001 24001 Allegany County, Maryland 61337 141 61337 (X) 31633 720 51.6 ... 9.3 1.4 (X) (X) 11.5 1.5 (X) (X) 35.3 2.9
2 0500000US24003 24003 Anne Arundel County, Maryland 451557 539 451557 (X) 318762 1889 70.6 ... 5.5 0.6 (X) (X) 4.2 0.4 (X) (X) 15.6 1.1
3 0500000US24005 24005 Baltimore County, Maryland 670033 630 670033 (X) 446069 2532 66.6 ... 7.6 0.6 (X) (X) 6.5 0.4 (X) (X) 19.8 0.7
4 0500000US24009 24009 Calvert County, Maryland 71843 256 71843 (X) 49994 664 69.6 ... 6.2 1.6 (X) (X) 3.5 0.8 (X) (X) 18.9 2.5

5 rows × 551 columns

This is gross, but luckily we need very little of it. Let's slide up that first row into the header slot, so it's more easily accessible, and dump everything that we don't care about (so anything that isn't about the counties or income).

In [115]:
new_header = Census2017.iloc[0] 
Census2017 = Census2017[1:]
Census2017.columns = new_header
Census2017 = Census2017.filter(regex='(Median family income|Geography)')
Census2017.head()
Out[115]:
Geography Estimate; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars) Margin of Error; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars) Percent; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars) Percent Margin of Error; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)
1 Allegany County, Maryland 59450 2366 (X) (X)
2 Anne Arundel County, Maryland 107889 2039 (X) (X)
3 Baltimore County, Maryland 88717 1119 (X) (X)
4 Calvert County, Maryland 113574 4475 (X) (X)
5 Caroline County, Maryland 63584 4377 (X) (X)

Better, but still a little too much, we need to fix up the data so that it'll correspond with the GeoJson County names.

We want to use a robust statistic to measure the difference in 'average' wealth, so we settled on Median Family Annual Income (MFAI) because we wouldn't have to worry about outliers and we would get a better picture of the counties' respective wealth without worrying about the population size and distribution (which data we don't have so we can't analyze).

In [138]:
Census2017 = Census2017[Census2017.columns.drop(list(Census2017.filter(regex='(Percent|Margin)')))]
Census2017 = Census2017.rename(index=str, columns ={"Estimate; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)" :"Median Family Income", 
                                                   "Geography" : "County"})
Census2017 ['County'] = Census2017['County'].str.replace(r'(County)?, Maryland', '').str.strip()
Census2017 ['County'] = Census2017['County'].str.replace('city', 'City')
Census2017 ['County'] = Census2017['County'].str.replace('St.', 'Saint')
Census2017.at['3', 'County'] = 'Baltimore County'
Census2017 ['Median Family Income'] = Census2017 ['Median Family Income'].astype(int)
Census2017
Out[138]:
County Median Family Income
1 Allegany 59450
2 Anne Arundel 107889
3 Baltimore County 88717
4 Calvert 113574
5 Caroline 63584
6 Carroll 105274
7 Cecil 85539
8 Charles 105606
9 Dorchester 64390
10 Frederick 103616
11 Garrett 57893
12 Harford 99596
13 Howard 135226
14 Kent 74097
15 Montgomery 122104
16 Prince George's 90382
17 Queen Anne's 106416
18 Saint Mary's 101941
19 Somerset 51591
20 Talbot 80865
21 Washington 71989
22 Wicomico 66657
23 Worcester 71534
24 Baltimore City 56236

We will take the data and put it into another choropleth to hope to compare intensities of key counties.

In [189]:
import warnings; 
warnings.simplefilter('ignore')
map_medincome = folium.Map(location=[38.9072, -77.0369], tiles = 'cartodbpositron', zoom_start=8)

geo_dist = os.path.join('MD-24-maryland-counties.json')

map_medincome.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= Census2017,
    columns=['County', 'Median Family Income'],
    key_on='feature.properties.NAME',
    fill_color='BuPu',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Family Household Median Annual Income in 2017 by County '
)

map_medincome
Out[189]:

If you compare this Choropleth to the earlier Choropleth of higher scores, it seems to have some correlation. The reigons with higher family income median tend to be the highest performing reigons: Montgomery County, Howard County, Anne Arundel and Calvert all rank in the upper percentile according to our previous choropleth.

Conversely, the poorest performing county is by far the county with the lowest median income. Same with Prince George's.

Takeaways:

An interesting county is Worcester which unexpectedly has a low median income, but has some of the heighest performance of all of the counties. I believe the key might be in the distribution, Worcestor has a high average but a low range, the violin plot is short and fat.

That is perhaps an indicator that Worcestor's income distribution might also be short and fat, and the school system may be run in a way that promotes all the students to exceed rather than people having money to take outside of school classes.

Another interesting takeaway: These are also the reigons with the highest student performance ranges in the distribution (See the violin plot above). So it seems as if the areas with the highest median income also seem to have higher peaks and valley. The distributions for all of those areas are skinny as well, indicating that there isn't that much of a glut in the average area (not a normal dsitribution).


Relating Income and Test Scores

At this point we have already looked at the Income and Test score data (PARCC) for 2017, so we want to see if we can find a relationship between the two using OLS, and then see if we can fit a model that we could use to see if it will explain our hypothesis that the test scores are too related with income for them to included so heavily in the star rankings.

Using the Performance statistic we created, we will attempt to run a linear regression and analyze the r^2 and the p-values to better fit the model.

In [221]:
fulldf = fulldf[fulldf.columns.drop(list(fulldf.filter(regex='Level|Tested|Ratio')))]
fulldf = Census2017.merge(PARCC2017Performance.groupby('LEA Name').agg({'Performance': 'sum'}).rename(columns={'Performance': 'Performance'}), left_on='County', right_on='LEA Name', how='inner')
fulldf = fulldf.rename(index = str, columns = {'Median Family Income' : 'Median_Family_Income', '(Performance, sum)':'Performance'})
fulldf
Out[221]:
County Median_Family_Income Performance
0 Allegany 59450 13507.297605
1 Anne Arundel 107889 76391.412839
2 Baltimore County 88717 73137.097171
3 Calvert 113574 18629.782009
4 Caroline 63584 6745.185830
5 Carroll 105274 32135.696368
6 Cecil 85539 17712.520672
7 Charles 105606 20469.704127
8 Dorchester 64390 6226.068964
9 Frederick 103616 49613.587418
10 Garrett 57893 6723.154796
11 Harford 99596 38031.043715
12 Howard 135226 61965.412164
13 Kent 74097 4232.840692
14 Montgomery 122104 135690.044449
15 Prince George's 90382 66741.016115
16 Queen Anne's 106416 10870.796987
17 Saint Mary's 101941 19835.901362
18 Somerset 51591 4031.446699
19 Talbot 80865 6151.325434
20 Washington 71989 22582.723566
21 Wicomico 66657 10857.305837
22 Worcester 71534 10728.753793
23 Baltimore City 56236 6213.900599

We Stuff the two together on by county and remove the unecessary full, so that we can use this to plot our linear regression.

When we plot our regression we want to make note of our F-statistic and make sure it is large enough to be greater than our Fstatistic. If so we can reject the null hypothesis, and declare this model having some semblance of positive linearity.

We want a small p value for our coeffecient and a high R-Squared (don't have to worry bout Adjusted in Simple Linear Regression.)

In [242]:
relmodel = ols(formula='Performance ~ Median_Family_Income', data=fulldf).fit() 
relanova = anova_lm(relmodel)
print(relmodel.summary())
print(relanova)
inc_perf = sns.regplot(x = "Median_Family_Income", y = "Performance", data = fulldf)
inc_perf
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Performance   R-squared:                       0.410
Model:                            OLS   Adj. R-squared:                  0.383
Method:                 Least Squares   F-statistic:                     15.28
Date:                Sat, 15 Dec 2018   Prob (F-statistic):           0.000753
Time:                        17:27:51   Log-Likelihood:                -276.42
No. Observations:                  24   AIC:                             556.8
Df Residuals:                      22   BIC:                             559.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept            -4.749e+04   2.05e+04     -2.319      0.030      -9e+04   -5013.318
Median_Family_Income     0.8920      0.228      3.909      0.001       0.419       1.365
==============================================================================
Omnibus:                       10.296   Durbin-Watson:                   1.788
Prob(Omnibus):                  0.006   Jarque-Bera (JB):                8.414
Skew:                           1.167   Prob(JB):                       0.0149
Kurtosis:                       4.723   Cond. No.                     3.55e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                        df        sum_sq       mean_sq          F    PR(>F)
Median_Family_Income   1.0  9.850552e+09  9.850552e+09  15.278587  0.000753
Residual              22.0  1.418404e+10  6.447293e+08        NaN       NaN
Out[242]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb9225da90>

Model Issues: The P-values for the intercept and coeffecient look okay, and since we aren't doing MLR I don't have to worry about multiple p-values or vifs or multicolinearity (even though I get that warning).

The data, when plotted, has a relatively low R-Squared, which means it needs some transformations to make it a linear relationship.

We would rather not remove outliers because the dataset is already so small, only 26 points, so I think that it is in our best interest to transform the response. The log transform of the response looks most appropriate, based on the exponential shape of the points, and the performance of the previous model.

In [241]:
nemodel = ols(formula='np.log(Performance) ~ Median_Family_Income', data=fulldf).fit() 
neanova = anova_lm(nemodel)
print(nemodel.summary())
print(neanova)
fulldf['logperf']= np.log(fulldf['Performance'])
inc_perf_log = sns.regplot(x = "Median_Family_Income", y = "logperf", data = fulldf)
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     np.log(Performance)   R-squared:                       0.565
Model:                             OLS   Adj. R-squared:                  0.545
Method:                  Least Squares   F-statistic:                     28.53
Date:                 Sat, 15 Dec 2018   Prob (F-statistic):           2.32e-05
Time:                         17:27:46   Log-Likelihood:                -24.036
No. Observations:                   24   AIC:                             52.07
Df Residuals:                       22   BIC:                             54.43
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                6.9427      0.555     12.508      0.000       5.792       8.094
Median_Family_Income  3.303e-05   6.18e-06      5.341      0.000    2.02e-05    4.59e-05
==============================================================================
Omnibus:                        0.439   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.803   Jarque-Bera (JB):                0.573
Skew:                           0.190   Prob(JB):                        0.751
Kurtosis:                       2.345   Cond. No.                     3.55e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                        df     sum_sq    mean_sq         F    PR(>F)
Median_Family_Income   1.0  13.503584  13.503584  28.52628  0.000023
Residual              22.0  10.414216   0.473373       NaN       NaN

Final Model Results: The R squared, the F value increased and the coeffecient p value decreased, so this is defintley a better model! We can reject the null on both our F test and our p value test.

So there seems to be a lot of correlation between income and PARCC performance for 2017.

The final equation we have for it is log(Performance) = 3.303e-05 $\times$ Median_Family_Income + 6.9427

There is definitley a positive trend (exponential or linear).

Conclusion

*Why It's important

Through our analysis we have proved that there is a definite positive relationship between Maryland student's household incomes and their PARCC scores. That means that if you come from a lower income area, you do worse on standardized testing, and that's a problem.

According to The Baltimore Sun, The reason that it is a problem is that these scores are used in consideration by other programs and companies to give students extra attention and take a way instructional time from the students. These scores are used in the calculation of star ratings which affect not only the schools themselves but the surrounding house prices, which makes the value of property areas with low star scores go down.

According to The Herald the PARCC test is much better than it's predessecor, the I-Sat (study done in Illinois), but regardless there is still a trend, even if the slope of the trend is left.

We believe that the best thing to do would be to weigh the scores of these exams less in star scores or property valuations and instead use them as indicators for which schools need more help and funding rather than indicators of excellent schools. We should be giving easy standardized exams that ask only the common core material, acessible to people of all financial backgrounds. The data shows, otherwise, however as in 2018, (PARCC scores continue to drop)[https://wtop.com/maryland/2018/08/maryland-parcc-scores-students-still-struggle-to-pass-english-math-tests-but-some-scores-tick-up/], and with it, star scores, seriously affecting the lives of these low income household students.

How we could have improved the Model

We could have improved the model by processing more data points from more years, but that would require cleaning a lot more census data and inputting that into the model, which we didn't have the space or time for. Also if we plot it over a period of time there are more things affecting the model, like general inflation and other time dependant influences that we wanted to avoid.

We could have added the 2018 PARCC data, but we couldn't find a clean version for some reason.

Another thing we could have done would be to add in other variables to analyze, but our goal of our project was simply to prove the positive linear relationship between the two quantities and analyze the distributions and geographical placements, not to predict a studen'ts test score based on their individual attributes.

Some research that backs our claim up, but for the Entire U.S.