Public School Standardized Testing vs Income by County in Maryland¶

Aaron Eisenberg, Zarek Peris, Fernando Rodriguez¶


Mayfield High School junior Laura Cruz, 18, holds up a sign March 2, 2015, during a student-organized walkout to protest the PARCC exams in Las Cruces, N.M.

Introduction¶

There's a remarkable lack of substantive analysis with regards to public schooling data in the United States, especially in lower income areas like Baltimore. This is unfortunate, given how crucial this information is - both for the wellbeing of students, as well as for the direction of our society.

As part of a two year initiative, Maryland's State education department has finally come out with Star Ratings to rank the quality of their public Elementary, middle and high schools. The star ratings were required by a federal law called the Every Student Succeeds Act or ESSA, which replaced the more punitive No Child Left Behind Act to help identify and diagnose schools that were having an issue within Maryland, with all the schools recieving a one star rating having to come up with a plan of action. According to the Baltimore Sun, "The bottom 5 percent of schools in the state — including some two-star schools — will be identified as needing an intervention plan for improvement. Most of those schools are in Baltimore City, where 23 schools earned one star."

These star rankings take standardized test scores into account, in fact, almost 65% of the evaluation is based on them! The star rating is a good step in the right direction, only releasing as early as December 4th, 2018, but the majority of the metrics on schools is still shown by the the standardized test scores of the schools in each area. In Maryland, this is the PARCC, or the (Partnership for Assessment of Readiness for College and Careers).

We wanted to see the correlation between income and testscores by counties in maryland has a linear relationship, in order to see if taking blanket PARCC scores to rate schools is an accurate way of describing student potential. Additionally, we wanted to map the disparities in both Test scores and income to try and predict problem areas and see if they spread out to nearby counties, and see if we get location results similar to the star ratings.

Our initial hypothesis: Family Income, even in the public school system, is so closely tied to the quality of education recieved, that the PARCC scores (assuming that it is a well designed exam) will have a positive linear correlation with Family Income.

Limitations and Assumptions:¶

Obviously, the PARCC is not a perfect indicator for how well a student is doing, but it is a standard metric that we can at least attempt to use as justification.

The model doesn't take into account extraneous factors, such as race, gender or english-speaking vs non-english speaking households because we didn't think it necessary to fit those things against school performance (for both ethical and time related reasons.)

Finally, We'll be using data from 2010 through 2017, inclusive. This brings up a few issues that we'll address later, especially the fact that Maryland switched from the MSA (Maryland State Assessment) system to PARCC (Partnership for Assessment of Readiness for College and Careers) in 2014.

This tutorial will use data from:

The Maryland State Department of Education

The United States Census Bureau

Here is a list of tools we'll be using:

import pandas as pd
# We'll be using pandas as our primary means of data manipulation and management.
import numpy as np
import matplotlib.pyplot as plt
# We'll be using pyplot for graphing data across time. 
import folium
# folium is an expansive data visualization library, but we'll mostly be using it for it's chloropleth feature. 
import requests
import os 
import seaborn as sns
from statsmodels.formula.api import ols
#least squard models 
from statsmodels.stats.anova import anova_lm

Test Scores: Data Cleanup¶

First things first, we've got to gather our data. We had to do this manually, going to the sites linked above, downloading and importing the csv files into our environments. It's not that many files, so all things considered, it's faster than trying to script the process.

Once we've gotten that out of the way, we can use one of pandas (many) helpful features and turn these csv's straight into dataframes. From 2010 to 2014, we reference the MSA data files. From 2015 to 2017, we use PARCC. We'll keep lists of each individual year's dataframe.

grade_data_msa = []
grade_data_parcc = []

for year in range(2010, 2015):
    grade_data_msa.append(pd.read_csv('MSA_' + str(year) + '.csv'))
    
for year in range(2015, 2018):
    grade_data_parcc.append(pd.read_csv('PARCC_' + str(year) + '.csv'))

Now we've got the data, we want to put all of the years together. Luckily, pandas has another great feature that allows us to cleanly do this.

msaAllYears = pd.concat(grade_data_msa, sort=False)
parccAllYears = pd.concat(grade_data_parcc, sort=False)

Time to investigate these dataframes to see what we're working with. First up, MSA:

msaAllYears.head()

Not horrible, but needs a bunch of cleaning. First of all, it has a bunch of stuff we don't care about. Let's drop those columns.

msaAllYears = msaAllYears.drop(columns = ['LEA Number', 'School Number', 'School Name', 'Test Type Title', 
                                          'Grade Title', 'Grade', 'Create Date', 'Test Type'])
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Percent')))] # little shortcut for dropping
msaAllYears.head()

Already a lot easier to parse, visually, but there's still plenty to do.

I don't like those asterisks, and I'm planning on summing over thos columns in a bit. Let's set them to 0's, so we can forget about them.

msaAllYears = msaAllYears.replace('*', '0')

Also, "LEA Name" isn't super intuitive, when really we're looking at counties. Renaming that column seems smart.

msaAllYears = msaAllYears.rename(index=str, columns={"LEA Name": "County"})

Lastly, we're considering Maryland schools across county, so I don't really want to look at SEED (a DC charter school) or the state as a whole. Goodbye to those rows.

msaAllYears = msaAllYears[msaAllYears['County'] != 'SEED']
msaAllYears = msaAllYears[msaAllYears['County'] != 'Seed School LEA']
msaAllYears = msaAllYears[msaAllYears['County'] != 'All Public Schools']

Let's see where we're at now. (I'm printing the entire thing for a reason)

msaAllYears

We're getting there. Time to deal with all those NaNs.

If you'll notice, every row seems to either have values in "Subject Title" $\textbf{or}$ "Subject", and "Test Member Count" $\textbf{or}$ "Tested Count". Let's make sure this is the case.

msaAllYears = msaAllYears.replace(np.nan, '')
count = 0
for index, row in msaAllYears.iterrows():
    if (row['Subject Title'] == '' and row['Subject'] == '') \
    or (row['Test Member Count'] == '' and row['Tested Count'] == '') \
    or (row['Subject Title'] != '' and row['Subject'] != '') \
    or (row['Test Member Count'] != '' and row['Tested Count'] != ''):
        print('Error at index: ' + str(index))

...

Okay we're good! Let's go ahead and fill in the respective columns, then drop the extra ones.

msaAllYears = msaAllYears.replace('', np.nan)
msaAllYears['Subject'].fillna(msaAllYears['Subject Title'], inplace=True)
msaAllYears['Tested Count'].fillna(msaAllYears['Test Member Count'], inplace=True)
msaAllYears = msaAllYears.drop(columns=['Subject Title', 'Test Member Count'])

Unfortunately, the PARCC testing system doesn't include science exams, so we'll be discarding those in the MSA data for consistency.

msaAllYears = msaAllYears[msaAllYears['Subject'] != 'Science']

Let's convert everything to integers so we can do some math.

msaAllYears['Tested Count'] = msaAllYears['Tested Count'].astype(int)
msaAllYears['Advanced Count'] = msaAllYears['Advanced Count'].astype(int)
msaAllYears['Proficient Count'] = msaAllYears['Proficient Count'].astype(int)
msaAllYears['Basic Count'] = msaAllYears['Basic Count'].astype(int)

Time to put everything together so we can graph it more easily later.

msaAllYears = msaAllYears.groupby(['Academic Year', 'County', 'Subject']).sum()
msaAllYears = msaAllYears.reset_index()
msaAllYears.head()

Okay, we're starting to see real information. Now that we have a cumulative count, let's get back those percentages we dumped earlier.

msaAllYears['Adv Ratio'] = msaAllYears ['Advanced Count'] / msaAllYears ['Tested Count']
msaAllYears['Prof Ratio'] = msaAllYears ['Proficient Count'] / msaAllYears ['Tested Count']
msaAllYears['Basic Ratio'] = msaAllYears ['Basic Count'] / msaAllYears ['Tested Count']

msaAllYears = msaAllYears.drop(columns=['Advanced Count', 'Proficient Count', 'Basic Count'])
msaAllYears.head()

Alright, so let's take a second to gather ourselves. We want to graph this data over time - how?

Well, we need to have some sort of objective metric of performance for each county, in each year, in each subject. As it stands, we just have a bunch of percentages.

Here's the idea. We take the difference between the advanced percentage and the basic percentage. We can ignore the number in the middle - let's consider those to be "neutral". So if a county has a lot of high performing students relative to low performing, the number will be positve, and respectively negative for low performing counties. Then, we'll add 1 to make all these numbers positive, with the already positive numbers jumping over 1, and then multiply by 100 to make the numbers more readable. Here's what we get:

msaAllYears['Performance'] = (msaAllYears['Adv Ratio'] - msaAllYears['Basic Ratio'] + 1) * 100
msaAllYears = msaAllYears[msaAllYears.columns.drop(list(msaAllYears.filter(regex='Ratio')))] 
msaAllYears = msaAllYears.drop(columns=['Tested Count'])
msaAllYears.head()

Looks good, and graphable! Now to split into two dataframes, between math and language. If there's an easy way to do this is pandas, I haven't found it. Fortunately, it's not that hard to do manually, and it's good coding practice anyway!

First let's get a list of all the counties, and initialize a dictionary for each subject.

counties = msaAllYears['County'].unique()
readingDict = {}
mathDict = {}
for c in counties:
    readingDict[c] = []
    mathDict[c] = []

Okay, now here's the idea. We want to take every row that's assigned to each subject, and append it's performance value to the associated dictionarie's county entry.

Example Row: 2011, Montgomery, Math, 103

This will go to the math dictionary, look for the Montgomery key, and then add the performance 103. Because this table is stored chronologically, we don't have to worry about the year - it'll be in the right place.

for index, row in msaAllYears.iterrows():
    if row['Subject'] == 'Reading':
        readingDict[row['County']].append(row['Performance'])
    else:
        mathDict[row['County']].append(row['Performance'])

Let's convert back to see the fruits of our labor.

msaAllYearsReading = pd.DataFrame(readingDict, index = range(2010, 2015))
msaAllYearsMath = pd.DataFrame(mathDict, index = range(2010, 2015))
msaAllYearsReading.head()

MSA Test Scores Over Time¶

Looks pretty nice. Finally time to graph it!

We'll split the counties in half so each graph isn't too cluttered.

msaAllYearsReading.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")
msaAllYearsReading.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Reading Exam Performance")

[Text(0,0.5,'Reading Exam Performance'), Text(0.5,0,'Year')]

msaAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
msaAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")

[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]

Some brief analysis before we continue - Baltimore City is very clearly the lowest performing county in both subjects. If you're familiar with the area this probably isn't surprising. We'll further explore this later.

Apart from Baltimore City, most counties are fairly in line, with more minor exceptions like Howard and Dorchester.

PARCC Test Scores Over Time¶

Moving on to the PARCC year records. Luckily, these start off a bit cleaner, so we have less work to do.

In general, if I did something very similar with the MSA data, I won't repeat myself here. Let's take a look.

parccAllYears.head()

This one is significantly easier to manage. All we need to do is drop the columns we won't be using, then group by year and county, along with the other minor changes we made last time.

parccAllYears = parccAllYears.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date'])
parccAllYears = parccAllYears.rename(index=str, columns={"LEA Name": "County"})
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Percent')))]
parccAllYears = parccAllYears[parccAllYears['County'] != 'SEED']
parccAllYears = parccAllYears[parccAllYears['County'] != 'Seed School LEA']
parccAllYears = parccAllYears[parccAllYears['County'] != 'All Public Schools']
parccAllYears = parccAllYears.replace('*', '0')

To make things consistent with the MSA data, we're going to categorize exams as either ELA (English Language Arts) or math. As you may be able to tell, once the students get to high school the exam names change from "Mathematics Grade 'x'" to Algebra, Geometry, etc. Therefore, we'll just treat anything that has the word "English" in it as an ELA exam, and anything without the word "English" as a math exam.

parccAllYears.loc[parccAllYears['Assessment'].str.contains('English'), 'Assessment'] = 'ELA'
parccAllYears.loc[~parccAllYears['Assessment'].str.contains('ELA'), 'Assessment'] = 'Math'

Once again, we'll convert the relevant columns to integers so we can do math stuff, and group them appropriately.

parccAllYears['Tested Count'] = parccAllYears['Tested Count'].astype(int)
parccAllYears[list(parccAllYears.filter(regex='Level'))] = parccAllYears[list(parccAllYears.filter(regex='Level'))].astype(int)
parccAllYears = parccAllYears.groupby(['Academic Year', 'County', 'Assessment']).sum()
parccAllYears = parccAllYears.reset_index()

Same as before - convert test counts into test ratios, use a little shortcut to drop anything with the word "level" to get rid of the counts.

parccAllYears['1 Ratio'] = parccAllYears ['Level 1: Did not yet meet expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['2 Ratio'] = parccAllYears ['Level 2: Partially met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['3 Ratio'] = parccAllYears ['Level 3: Approached expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['4 Ratio'] = parccAllYears ['Level 4: Met expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears['5 Ratio'] = parccAllYears ['Level 5: Exceeded expectations - Count'] / parccAllYears ['Tested Count']
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Level')))] 
parccAllYears.head()

Here is the metric of performance we'll be using for the PARCC data. Similar to that of MSA, we consider anything below the median score (in this case 3) to be negative, with anything above considered positive. We double the weight of scores 1 and 5, to account for these extremes. We then normalize and scale to make it more readable.

parccAllYears['Performance'] = (-2*parccAllYears['1 Ratio'] - parccAllYears['2 Ratio'] + \
                                parccAllYears['4 Ratio'] + 2*parccAllYears['5 Ratio'] + 1) * 100
parccAllYears = parccAllYears[parccAllYears.columns.drop(list(parccAllYears.filter(regex='Ratio')))] 
parccAllYears.head()

Back to the same conversion, then plotting paradigm.

counties = parccAllYears['County'].unique()
elaDict = {}
mathDict = {}
for c in counties:
    elaDict[c] = []
    mathDict[c] = []
for index, row in parccAllYears.iterrows():
    if row['Assessment'] == 'ELA':
        elaDict[row['County']].append(row['Performance'])
    else:
        mathDict[row['County']].append(row['Performance'])
parccAllYearsELA = pd.DataFrame(elaDict, index = range(2015, 2018))
parccAllYearsMath = pd.DataFrame(mathDict, index = range(2015, 2018))

parccAllYearsELA.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsELA.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")

[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]

parccAllYearsMath.iloc[:, :12].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")
parccAllYearsMath.iloc[:, 12:].plot.line(figsize=(15,8)).set(xlabel="Year", ylabel="Math Exam Performance")

[Text(0,0.5,'Math Exam Performance'), Text(0.5,0,'Year')]

Now let's put it all together, accounting for the difference in exam structure change across year. Again, pandas saves the day.

allYearsLanguage = pd.concat([msaAllYearsReading, parccAllYearsELA], sort=True)
allYearsMath = pd.concat([msaAllYearsMath, parccAllYearsMath], sort=True)
allYearsMath

That's some mighty attractive data. Let's take a better look. We'll draw a line down the year marker where the exams changed from MSA to PARCC.

plt1 = allYearsLanguage.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsLanguage.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Language Exam Performance")
plt2.set(xlabel="Year", ylabel="Language Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black',  linewidth=0.2)

<matplotlib.lines.Line2D at 0x7fcba6409a20>

plt1 = allYearsMath.iloc[:, :12].plot.line(figsize=(15,8))
plt2 = allYearsMath.iloc[:, 12:].plot.line(figsize=(15,8))
plt1.set(xlabel="Year", ylabel="Math Exam Performance")
plt2.set(xlabel="Year", ylabel="Math Exam Performance")
plt1.axvline(2015, color='black', linewidth=0.2, label='change')
plt2.axvline(2015, color='black',  linewidth=0.2)

<matplotlib.lines.Line2D at 0x7fcba0984f98>

What do we see here?

Well, first thing you'll notice is that the change in exams noticeably changed the trajectory of nearly every county. Even Baltimore faired better, and stopped declining as harshly in performance. The best, like Howard, stay the top. This is not surprising - testing systems may favor slightly different things, but certain resource advantages will always show their hand in results like these. We'll explore what those resources may be in a bit.

Visualizing Distribution over a map¶

Our goal here is to further explore the distribution of exam success across Maryland, this time focusing on one year.

We'll repeat our collection process, and then analyze not only the mathematical distribution of test scores (using the performance variable we created in the previous section), but also to analyze the geographic distribution in order to see if the counties closer to each other have similar scores, and to relate it to income later on.

PARCC2017 = pd.read_csv("PARCC_2017.csv")
PARCC2017 = PARCC2017.drop(columns=['LEA Number', 'School Number', 'School Name', 'Create Date', 'Assessment', 'Academic Year'])
PARCC2017  = PARCC2017 [PARCC2017 .columns.drop(list(PARCC2017.filter(regex='Percent')))]
PARCC2017  = PARCC2017 .replace('*', '0')
PARCC2017['Tested Count'] = PARCC2017['Tested Count'].astype(int)
PARCC2017[list(PARCC2017.filter(regex='Level'))] = PARCC2017[list(PARCC2017.filter(regex='Level'))].astype(int)
PARCC2017['1 Ratio'] = PARCC2017['Level 1: Did not yet meet expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['2 Ratio'] = PARCC2017 ['Level 2: Partially met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['3 Ratio'] = PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['4 Ratio'] = PARCC2017 ['Level 4: Met expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['5 Ratio'] = PARCC2017 ['Level 5: Exceeded expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Performance'] = (-2*PARCC2017['1 Ratio'] - PARCC2017['2 Ratio'] + \
                                PARCC2017['4 Ratio'] + 2*PARCC2017['5 Ratio'] + 1) * 100
PARCC2017Performance = PARCC2017.copy(deep = True)
distplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[:4085, :]).set_title("Distribution of Student Performance by County")
distplot.figure.set_size_inches(20, 5)

slowplot = sns.violinplot(x = 'LEA Name', y = 'Performance', data = PARCC2017.iloc[4086:, :]).set_title("Distribution of Student Performance by County (Cont.)")
slowplot.figure.set_size_inches(20, 5)

If we look at the distribution of the performance, we'll find that they vary greatly in shape, range and glut (the thick areas).

A skinnier shape indicates a more even distribution of scores, while a shorter/ fatter shape shows that more students fell under the a certain performance area.

A taller range means that excellent students and terrible students have a larger disparity, and vice versa for a shorter range.

A glut's location indicates where more students are falling in terms of performance.

Some interesting happenings:

Our county (Prince georges) has a pretty normal looking distribution, with an extremely high range, indicating that it's otherwise normally ditributed, but the excellent students are very excellent compared to the worse students.

Montgomery, Howard, Anne Arundel and Calvert are all relatively skinny, as well as the highest performing counties.

Seed and public schools have similar distributions, but flipped. The glut placement might dictate the shape of the rest, because both are relatively short and fat.

PARCC2017 = PARCC2017[PARCC2017.columns.drop(list(PARCC2017.filter(regex='Ratio|Performance')))]
PARCC2017 = PARCC2017.groupby('LEA Name').agg(['sum'])
PARCC2017['Lower Ratio']= (PARCC2017 ['Level 1: Did not yet meet expectations - Count'] + PARCC2017 ['Level 2: Partially met expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017['Middle Ratio']= PARCC2017 ['Level 3: Approached expectations - Count'] / PARCC2017 ['Tested Count']
PARCC2017['Upper Ratio']= (PARCC2017 ['Level 4: Met expectations - Count'] + PARCC2017 ['Level 5: Exceeded expectations - Count']) / PARCC2017 ['Tested Count']
PARCC2017 = PARCC2017.reset_index()
PARCC2017.head()

We added three columns to make sure that we could accuratley compare counties: Upper, Middle and Lower ratio. The Upper and Lower ratios are calculated by taking all the reported 4&5 scores (1&2 Scores for the lower) and dividing them by the total number of students who took the exam in that county.

On a slight tangent, we'd like to note that every county (without exception) underreports their data. This shuoldn't be a large issue (because it's more than 97% reported in every county), but is still worth mentioning as it could affect some of our numbers:

PARCC2017['Percent Reported'] = PARCC2017['Lower Ratio'] + PARCC2017['Middle Ratio'] + PARCC2017['Upper Ratio'] 
PARCC2017['Percent Reported']

0     1.000000
1     0.983100
2     0.985066
3     0.984826
4     0.984988
5     0.985924
6     0.982124
7     0.977083
8     0.982400
9     0.984727
10    0.990333
11    0.987599
12    0.984187
13    0.984284
14    0.985091
15    0.982391
16    0.987477
17    0.984045
18    0.973669
19    0.996205
20    0.986917
21    0.981531
22    0.982816
23    0.985524
24    0.984957
25    0.985175
Name: Percent Reported, dtype: float64

These are far from bad numbers, it's just slightly funny to see that the overall reporting is still listed as 100%. Moving on.

Let's get rid of those not-counties again.

PARCC2017 = PARCC2017.drop(PARCC2017.index[0])
PARCC2017 = PARCC2017.drop(PARCC2017.index[18])

Time to do some real visualization, strap in.

In order to better visualize the disparity by county, and see if there is any spillover, we decided to plot some choropleths to help us see which areas have outstanding PARCC scores (beased on 2017 numbers) and which areas are hurting the most.

import warnings; 
warnings.simplefilter('ignore')
map_parcc1 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)
map_parcc2 = folium.Map(location=[38.9072, -77.0369],tiles = 'cartodbpositron', zoom_start=8)

geo_dist = os.path.join('MD-24-maryland-counties.json')

map_parcc1.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= PARCC2017,
    columns=['LEA Name', 'Upper Ratio'],
    key_on='feature.properties.NAME',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Student Rate of Above Average PARCC Scores in 2017'
)


map_parcc2.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= PARCC2017,
    columns=['LEA Name', 'Lower Ratio'],
    key_on='feature.properties.NAME',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Student Rate of Below Average PARCC Scores in 2017'
)

As you can see for the Above Average scores, there seems to be some geographic influence: Four of the darkest reigons are all next to each other, and the shcools that seem to be having the most trouble contain two of marylands biggest and most populous urban areas: DC (It doesn't actually conatin DC, but around there) and Baltimore.

map_parcc1

For the Below Average Scores, the map is consistent with the hypothesis from above. Urban areas seem to be the most affected, and the areas that were severly lacking in high performance students, mostly seem to be making up for the glut in lower performing students. Counties that contain a low amount of high and lower students are explained by the average students (like Caroline county), which is consistent with our model.

map_parcc2

Charting Family Income by County¶

Let's gather some data about Maryland family incomes, and see if we can find some correlations to test data. We want to plot it in another choropleth map so we can see if the intensity in the difference of median family income (households with children) in 2017 is the same for the PARCC 2017 test scores by county.

We are going to use the Maryland Census data from 2017 to Obtain this info, see link above.

Census2017 = pd.read_csv("ACS_17_5YR_DP03_with_ann.csv")
Census2017.head()

This is gross, but luckily we need very little of it. Let's slide up that first row into the header slot, so it's more easily accessible, and dump everything that we don't care about (so anything that isn't about the counties or income).

new_header = Census2017.iloc[0] 
Census2017 = Census2017[1:]
Census2017.columns = new_header
Census2017 = Census2017.filter(regex='(Median family income|Geography)')
Census2017.head()

Better, but still a little too much, we need to fix up the data so that it'll correspond with the GeoJson County names.

We want to use a robust statistic to measure the difference in 'average' wealth, so we settled on Median Family Annual Income (MFAI) because we wouldn't have to worry about outliers and we would get a better picture of the counties' respective wealth without worrying about the population size and distribution (which data we don't have so we can't analyze).

Census2017 = Census2017[Census2017.columns.drop(list(Census2017.filter(regex='(Percent|Margin)')))]
Census2017 = Census2017.rename(index=str, columns ={"Estimate; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)" :"Median Family Income", 
                                                   "Geography" : "County"})
Census2017 ['County'] = Census2017['County'].str.replace(r'(County)?, Maryland', '').str.strip()
Census2017 ['County'] = Census2017['County'].str.replace('city', 'City')
Census2017 ['County'] = Census2017['County'].str.replace('St.', 'Saint')
Census2017.at['3', 'County'] = 'Baltimore County'
Census2017 ['Median Family Income'] = Census2017 ['Median Family Income'].astype(int)
Census2017

We will take the data and put it into another choropleth to hope to compare intensities of key counties.

import warnings; 
warnings.simplefilter('ignore')
map_medincome = folium.Map(location=[38.9072, -77.0369], tiles = 'cartodbpositron', zoom_start=8)

geo_dist = os.path.join('MD-24-maryland-counties.json')

map_medincome.choropleth(
    geo_data=geo_dist,
    name='choropleth',
    data= Census2017,
    columns=['County', 'Median Family Income'],
    key_on='feature.properties.NAME',
    fill_color='BuPu',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Family Household Median Annual Income in 2017 by County '
)

map_medincome

If you compare this Choropleth to the earlier Choropleth of higher scores, it seems to have some correlation. The reigons with higher family income median tend to be the highest performing reigons: Montgomery County, Howard County, Anne Arundel and Calvert all rank in the upper percentile according to our previous choropleth.

Conversely, the poorest performing county is by far the county with the lowest median income. Same with Prince George's.

Takeaways:

An interesting county is Worcester which unexpectedly has a low median income, but has some of the heighest performance of all of the counties. I believe the key might be in the distribution, Worcestor has a high average but a low range, the violin plot is short and fat.

That is perhaps an indicator that Worcestor's income distribution might also be short and fat, and the school system may be run in a way that promotes all the students to exceed rather than people having money to take outside of school classes.

Another interesting takeaway: These are also the reigons with the highest student performance ranges in the distribution (See the violin plot above). So it seems as if the areas with the highest median income also seem to have higher peaks and valley. The distributions for all of those areas are skinny as well, indicating that there isn't that much of a glut in the average area (not a normal dsitribution).

Relating Income and Test Scores¶

At this point we have already looked at the Income and Test score data (PARCC) for 2017, so we want to see if we can find a relationship between the two using OLS, and then see if we can fit a model that we could use to see if it will explain our hypothesis that the test scores are too related with income for them to included so heavily in the star rankings.

Using the Performance statistic we created, we will attempt to run a linear regression and analyze the r^2 and the p-values to better fit the model.

fulldf = fulldf[fulldf.columns.drop(list(fulldf.filter(regex='Level|Tested|Ratio')))]
fulldf = Census2017.merge(PARCC2017Performance.groupby('LEA Name').agg({'Performance': 'sum'}).rename(columns={'Performance': 'Performance'}), left_on='County', right_on='LEA Name', how='inner')
fulldf = fulldf.rename(index = str, columns = {'Median Family Income' : 'Median_Family_Income', '(Performance, sum)':'Performance'})
fulldf

We Stuff the two together on by county and remove the unecessary full, so that we can use this to plot our linear regression.

When we plot our regression we want to make note of our F-statistic and make sure it is large enough to be greater than our Fstatistic. If so we can reject the null hypothesis, and declare this model having some semblance of positive linearity.

We want a small p value for our coeffecient and a high R-Squared (don't have to worry bout Adjusted in Simple Linear Regression.)

relmodel = ols(formula='Performance ~ Median_Family_Income', data=fulldf).fit() 
relanova = anova_lm(relmodel)
print(relmodel.summary())
print(relanova)
inc_perf = sns.regplot(x = "Median_Family_Income", y = "Performance", data = fulldf)
inc_perf

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Performance   R-squared:                       0.410
Model:                            OLS   Adj. R-squared:                  0.383
Method:                 Least Squares   F-statistic:                     15.28
Date:                Sat, 15 Dec 2018   Prob (F-statistic):           0.000753
Time:                        17:27:51   Log-Likelihood:                -276.42
No. Observations:                  24   AIC:                             556.8
Df Residuals:                      22   BIC:                             559.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept            -4.749e+04   2.05e+04     -2.319      0.030      -9e+04   -5013.318
Median_Family_Income     0.8920      0.228      3.909      0.001       0.419       1.365
==============================================================================
Omnibus:                       10.296   Durbin-Watson:                   1.788
Prob(Omnibus):                  0.006   Jarque-Bera (JB):                8.414
Skew:                           1.167   Prob(JB):                       0.0149
Kurtosis:                       4.723   Cond. No.                     3.55e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                        df        sum_sq       mean_sq          F    PR(>F)
Median_Family_Income   1.0  9.850552e+09  9.850552e+09  15.278587  0.000753
Residual              22.0  1.418404e+10  6.447293e+08        NaN       NaN

<matplotlib.axes._subplots.AxesSubplot at 0x7fcb9225da90>

Model Issues: The P-values for the intercept and coeffecient look okay, and since we aren't doing MLR I don't have to worry about multiple p-values or vifs or multicolinearity (even though I get that warning).

The data, when plotted, has a relatively low R-Squared, which means it needs some transformations to make it a linear relationship.

We would rather not remove outliers because the dataset is already so small, only 26 points, so I think that it is in our best interest to transform the response. The log transform of the response looks most appropriate, based on the exponential shape of the points, and the performance of the previous model.

nemodel = ols(formula='np.log(Performance) ~ Median_Family_Income', data=fulldf).fit() 
neanova = anova_lm(nemodel)
print(nemodel.summary())
print(neanova)
fulldf['logperf']= np.log(fulldf['Performance'])
inc_perf_log = sns.regplot(x = "Median_Family_Income", y = "logperf", data = fulldf)

                             OLS Regression Results                            
===============================================================================
Dep. Variable:     np.log(Performance)   R-squared:                       0.565
Model:                             OLS   Adj. R-squared:                  0.545
Method:                  Least Squares   F-statistic:                     28.53
Date:                 Sat, 15 Dec 2018   Prob (F-statistic):           2.32e-05
Time:                         17:27:46   Log-Likelihood:                -24.036
No. Observations:                   24   AIC:                             52.07
Df Residuals:                       22   BIC:                             54.43
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                6.9427      0.555     12.508      0.000       5.792       8.094
Median_Family_Income  3.303e-05   6.18e-06      5.341      0.000    2.02e-05    4.59e-05
==============================================================================
Omnibus:                        0.439   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.803   Jarque-Bera (JB):                0.573
Skew:                           0.190   Prob(JB):                        0.751
Kurtosis:                       2.345   Cond. No.                     3.55e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                        df     sum_sq    mean_sq         F    PR(>F)
Median_Family_Income   1.0  13.503584  13.503584  28.52628  0.000023
Residual              22.0  10.414216   0.473373       NaN       NaN

Final Model Results: The R squared, the F value increased and the coeffecient p value decreased, so this is defintley a better model! We can reject the null on both our F test and our p value test.

So there seems to be a lot of correlation between income and PARCC performance for 2017.

The final equation we have for it is log(Performance) = 3.303e-05 $\times$ Median_Family_Income + 6.9427

There is definitley a positive trend (exponential or linear).

Conclusion¶

*Why It's important

Through our analysis we have proved that there is a definite positive relationship between Maryland student's household incomes and their PARCC scores. That means that if you come from a lower income area, you do worse on standardized testing, and that's a problem.

According to The Baltimore Sun, The reason that it is a problem is that these scores are used in consideration by other programs and companies to give students extra attention and take a way instructional time from the students. These scores are used in the calculation of star ratings which affect not only the schools themselves but the surrounding house prices, which makes the value of property areas with low star scores go down.

According to The Herald the PARCC test is much better than it's predessecor, the I-Sat (study done in Illinois), but regardless there is still a trend, even if the slope of the trend is left.

We believe that the best thing to do would be to weigh the scores of these exams less in star scores or property valuations and instead use them as indicators for which schools need more help and funding rather than indicators of excellent schools. We should be giving easy standardized exams that ask only the common core material, acessible to people of all financial backgrounds. The data shows, otherwise, however as in 2018, (PARCC scores continue to drop)[https://wtop.com/maryland/2018/08/maryland-parcc-scores-students-still-struggle-to-pass-english-math-tests-but-some-scores-tick-up/], and with it, star scores, seriously affecting the lives of these low income household students.

How we could have improved the Model

We could have improved the model by processing more data points from more years, but that would require cleaning a lot more census data and inputting that into the model, which we didn't have the space or time for. Also if we plot it over a period of time there are more things affecting the model, like general inflation and other time dependant influences that we wanted to avoid.

We could have added the 2018 PARCC data, but we couldn't find a clean version for some reason.

Another thing we could have done would be to add in other variables to analyze, but our goal of our project was simply to prove the positive linear relationship between the two quantities and analyze the distributions and geographical placements, not to predict a studen'ts test score based on their individual attributes.

Some research that backs our claim up, but for the Entire U.S.

SAT 2014 Scores vs Average Family income (Source: MSNBC)

ACT 2010-2014 Scores vs Houshold Income Source (Source: Huffpost)

ACT and The problem of Underserved Learners(Source: WaPo) <-- This is a really good one.

	Academic Year	County	Subject	Advanced Count	Proficient Count	Basic Count	Tested Count
0	2010	Allegany	Math	2294	3981	1645	7936
1	2010	Allegany	Reading	2940	3779	1224	7950
2	2010	Anne Arundel	Math	26185	29427	10419	66230
3	2010	Anne Arundel	Reading	29541	28832	7672	66276
4	2010	Baltimore City	Math	11720	33214	22981	68163

	Academic Year	County	Subject	Tested Count	Adv Ratio	Prof Ratio	Basic Ratio
0	2010	Allegany	Math	7936	0.289062	0.501638	0.207283
1	2010	Allegany	Reading	7950	0.369811	0.475346	0.153962
2	2010	Anne Arundel	Math	66230	0.395365	0.444315	0.157315
3	2010	Anne Arundel	Reading	66276	0.445727	0.435029	0.115758
4	2010	Baltimore City	Math	68163	0.171941	0.487273	0.337148

	Allegany	Anne Arundel	Baltimore City	Baltimore County	Calvert	Caroline	Carroll	Cecil	Charles	Dorchester	...	Kent	Montgomery	Prince George's	Queen Anne's	Saint Mary's	Somerset	Talbot	Washington	Wicomico	Worcester
2010	121.584906	132.996862	92.564980	123.195494	141.909939	124.199134	143.603026	118.674742	121.302566	104.877419	...	118.306762	138.914423	98.788001	140.566038	129.748860	110.330229	120.687743	124.103907	107.186339	142.476293
2011	118.386226	136.028962	85.109381	125.508953	144.982139	126.666667	137.494322	120.408163	119.808252	106.297164	...	118.623025	139.865525	102.551224	143.678161	126.912449	116.054077	121.970310	126.904738	116.609108	150.106383
2012	123.141361	134.903043	83.445835	125.904127	142.942576	128.819370	145.055491	121.969209	120.990496	102.637076	...	111.917098	139.463753	102.789353	142.172949	130.722286	116.611433	125.868931	125.743952	113.832254	149.135135
2013	123.540340	138.672386	85.880004	127.845474	146.056782	125.210084	142.407782	119.039237	120.424157	101.933216	...	118.991597	140.162087	104.709728	140.801037	127.182658	112.255489	118.373256	123.455497	118.411435	151.959022
2014	119.075898	130.142295	77.534726	121.295967	140.264266	114.042278	139.865323	110.768873	114.972553	94.156884	...	108.621736	133.073475	101.464655	133.274074	124.081993	105.368852	115.360046	117.613667	102.957613	144.264482

	Academic Year	LEA Number	LEA Name	School Number	School Name	Assessment	Tested Count	Level 1: Did not yet meet expectations - Count	Level 1: Did not yet meet expectations - Percent	Level 2: Partially met expectations - Count	Level 2: Partially met expectations - Percent	Level 3: Approached expectations - Count	Level 3: Approached expectations - Percent	Level 4: Met expectations - Count	Level 4: Met expectations - Percent	Level 5: Exceeded expectations - Count	Level 5: Exceeded expectations - Percent	Create Date
0	2015	01	Allegany	0301	Flintstone Elementary	English/Language Arts Grade 3	32	6	18.8	6	18.8	10	31.3	8	25	2	6.3	20160108
1	2015	01	Allegany	0301	Flintstone Elementary	English/Language Arts Grade 4	33	*	<= 5.0	4	12.1	11	33.3	14	42.4	3	9.1	20160108
2	2015	01	Allegany	0301	Flintstone Elementary	English/Language Arts Grade 5	33	6	18.2	6	18.2	11	33.3	10	30.3	*	<= 5.0	20160108
3	2015	01	Allegany	0301	Flintstone Elementary	Mathematics Grade 3	32	2	6.3	11	34.4	10	31.3	7	21.9	2	6.3	20160108
4	2015	01	Allegany	0301	Flintstone Elementary	Mathematics Grade 4	33	*	<= 5.0	7	21.2	11	33.3	14	42.4	*	<= 5.0	20160108

	Academic Year	County	Assessment	Tested Count	1 Ratio	2 Ratio	3 Ratio	4 Ratio	5 Ratio
0	2015	Allegany	ELA	8511	0.182940	0.231583	0.289743	0.267654	0.011045
1	2015	Allegany	Math	8914	0.164685	0.293247	0.291564	0.225712	0.005609
2	2015	Anne Arundel	ELA	79982	0.107737	0.167263	0.251357	0.386975	0.071116
3	2015	Anne Arundel	Math	84538	0.121614	0.254844	0.284121	0.300551	0.023291
4	2015	Baltimore City	ELA	73284	0.317218	0.290527	0.222218	0.146499	0.010916

	Academic Year	LEA Name	Subject Title	Advanced Count	Proficient Count	Basic Count	Test Member Count	Subject	Tested Count
0	2010	Allegany	Science	*	29	5	35	NaN	NaN
1	2010	Allegany	Reading	9	22	5	36	NaN	NaN
2	2010	Allegany	Reading	15	25	*	40	NaN	NaN
3	2010	Allegany	Reading	25	11	2	38	NaN	NaN
4	2010	Allegany	Math	9	25	2	36	NaN	NaN

	Allegany	Anne Arundel	Baltimore City	Baltimore County	Calvert	Caroline	Carroll	Cecil	Charles	Dorchester	...	Kent	Montgomery	Prince George's	Queen Anne's	Saint Mary's	Somerset	Talbot	Washington	Wicomico	Worcester
2010	108.177923	123.804922	83.479307	108.661548	129.290556	121.595837	128.591410	102.038787	107.557938	94.844032	...	105.418719	121.693018	81.799845	126.512323	126.649347	96.994073	109.905660	121.346280	104.881131	138.093532
2011	115.248363	127.192068	75.115120	112.432698	132.682861	121.889152	128.984433	105.520381	112.829849	100.181960	...	104.524887	123.599523	85.581274	130.379747	129.211518	109.741635	110.813733	121.954126	112.557206	150.416593
2012	123.981603	131.616156	79.044917	117.098708	137.851610	134.604671	138.413750	114.627185	116.734432	101.702462	...	101.725129	127.152736	90.501731	136.418481	134.017576	114.469320	117.767171	127.315276	117.111447	152.338971
2013	113.194260	119.777775	72.499268	111.824171	133.694311	123.666527	131.460013	89.826285	109.204540	95.099271	...	97.040759	118.524184	85.554520	97.566595	125.018278	111.199681	87.090003	115.654474	111.634888	148.977395
2014	103.856674	104.792453	47.284196	96.725914	119.975455	79.604093	108.501630	79.134252	92.327710	59.716758	...	76.610381	104.184512	69.273594	90.850869	104.499933	70.627803	73.891626	95.655583	81.379019	128.965517
2015	61.431456	84.906196	6.009766	52.508331	77.780952	82.694485	121.128621	67.814251	61.677146	47.709288	...	49.453162	90.202048	26.435176	88.433868	95.294769	50.850340	64.213978	69.097178	51.471322	111.116821
2016	76.089897	97.728827	3.740038	72.381960	113.213900	89.708548	138.676482	71.406784	70.989822	49.635967	...	70.207510	110.382469	27.406613	115.068338	98.842543	55.074425	85.653726	74.348833	69.757701	118.198700
2017	68.536045	97.510863	3.192313	69.192294	122.271767	94.833309	145.115130	91.076083	80.294295	51.174059	...	66.351889	111.196707	23.450029	119.917059	105.733306	51.630252	81.049990	67.015163	74.287544	114.122924

	LEA Name	Tested Count	Level 1: Did not yet meet expectations - Count	Level 2: Partially met expectations - Count	Level 3: Approached expectations - Count	Level 4: Met expectations - Count	Level 5: Exceeded expectations - Count	Lower Ratio	Middle Ratio	Upper Ratio
		sum	sum	sum	sum	sum	sum
		sum	sum	sum	sum	sum	sum
0	All Public Schools	957179	167629	197548	234596	292967	64439	0.381514	0.245091	0.373395
1	Allegany	19882	3707	4167	5075	5981	616	0.396037	0.255256	0.331808
2	Anne Arundel	173429	19756	31663	44961	62976	11483	0.296484	0.259247	0.429334
3	Baltimore City	175032	67829	48497	33934	20885	1231	0.664598	0.193873	0.126354
4	Baltimore County	222949	38732	51607	56785	62222	10256	0.405200	0.254700	0.325088

	GEO.id	GEO.id2	GEO.display-label	HC01_VC03	HC02_VC03	HC03_VC03	HC04_VC03	HC01_VC04	HC02_VC04	HC03_VC04	...	HC03_VC178	HC04_VC178	HC01_VC179	HC02_VC179	HC03_VC179	HC04_VC179	HC01_VC180	HC02_VC180	HC03_VC180	HC04_VC180
0	Id	Id2	Geography	Estimate; EMPLOYMENT STATUS - Population 16 ye...	Margin of Error; EMPLOYMENT STATUS - Populatio...	Percent; EMPLOYMENT STATUS - Population 16 yea...	Percent Margin of Error; EMPLOYMENT STATUS - P...	Estimate; EMPLOYMENT STATUS - Population 16 ye...	Margin of Error; EMPLOYMENT STATUS - Populatio...	Percent; EMPLOYMENT STATUS - Population 16 yea...	...	Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO...	Percent Margin of Error; PERCENTAGE OF FAMILIE...	Estimate; PERCENTAGE OF FAMILIES AND PEOPLE WH...	Margin of Error; PERCENTAGE OF FAMILIES AND PE...	Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO...	Percent Margin of Error; PERCENTAGE OF FAMILIE...	Estimate; PERCENTAGE OF FAMILIES AND PEOPLE WH...	Margin of Error; PERCENTAGE OF FAMILIES AND PE...	Percent; PERCENTAGE OF FAMILIES AND PEOPLE WHO...	Percent Margin of Error; PERCENTAGE OF FAMILIE...
1	0500000US24001	24001	Allegany County, Maryland	61337	141	61337	(X)	31633	720	51.6	...	9.3	1.4	(X)	(X)	11.5	1.5	(X)	(X)	35.3	2.9
2	0500000US24003	24003	Anne Arundel County, Maryland	451557	539	451557	(X)	318762	1889	70.6	...	5.5	0.6	(X)	(X)	4.2	0.4	(X)	(X)	15.6	1.1
3	0500000US24005	24005	Baltimore County, Maryland	670033	630	670033	(X)	446069	2532	66.6	...	7.6	0.6	(X)	(X)	6.5	0.4	(X)	(X)	19.8	0.7
4	0500000US24009	24009	Calvert County, Maryland	71843	256	71843	(X)	49994	664	69.6	...	6.2	1.6	(X)	(X)	3.5	0.8	(X)	(X)	18.9	2.5

	Geography	Estimate; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)	Margin of Error; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)	Percent; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)	Percent Margin of Error; INCOME AND BENEFITS (IN 2017 INFLATION-ADJUSTED DOLLARS) - Families - Median family income (dollars)
1	Allegany County, Maryland	59450	2366	(X)	(X)
2	Anne Arundel County, Maryland	107889	2039	(X)	(X)
3	Baltimore County, Maryland	88717	1119	(X)	(X)
4	Calvert County, Maryland	113574	4475	(X)	(X)
5	Caroline County, Maryland	63584	4377	(X)	(X)

	County	Median_Family_Income	Performance
0	Allegany	59450	13507.297605
1	Anne Arundel	107889	76391.412839
2	Baltimore County	88717	73137.097171
3	Calvert	113574	18629.782009
4	Caroline	63584	6745.185830
5	Carroll	105274	32135.696368
6	Cecil	85539	17712.520672
7	Charles	105606	20469.704127
8	Dorchester	64390	6226.068964
9	Frederick	103616	49613.587418
10	Garrett	57893	6723.154796
11	Harford	99596	38031.043715
12	Howard	135226	61965.412164
13	Kent	74097	4232.840692
14	Montgomery	122104	135690.044449
15	Prince George's	90382	66741.016115
16	Queen Anne's	106416	10870.796987
17	Saint Mary's	101941	19835.901362
18	Somerset	51591	4031.446699
19	Talbot	80865	6151.325434
20	Washington	71989	22582.723566
21	Wicomico	66657	10857.305837
22	Worcester	71534	10728.753793
23	Baltimore City	56236	6213.900599