The hardest part of this project was getting the data into a format I could work with. I didn’t care if it came in a database, an XML file or a spreadsheet – from any of those formats, I can do some magic with. Thus, I had no idea what format my FOIA request for teacher contracts would arrive in. Turns out, they are all booklets; 5 for CFT and 4 for CESP from 2002-2012. FYI – for those interested, my FOIA request was very simple (provide all teacher contracts with the CFT and CESP since January 1st of 2003), and fortunately, FOIA officer Tom Lockman was able to turn around my request in just a couple short business days.
The “collective bargaining agreement” booklets are interesting in and of themselves; without going too far on a tangent, I found it quite interesting that from 2007 to 2009, the number of pages jumped from 40 to 70. But back to the task at hand. The salary schedule is listed in the Appendix of each agreement, so my first job was to scan each page that had a salary schedule – across the 5 books, I had 10 total half pages. But this is where the trouble really starts.
First of all, the format of the salary schedule is different in each book. Some years the net and gross salaries are split out, other years it only lists the gross. For those years in which the net, gross and board paid retirement is listed, the salary schedule spans more than one page. My next task was to take each scanned image (a JPEG file) and extract all the rows and columns of data. I have used “Free OCR” in the past, but this time I elected to upload my images to a website that did the same thing. OCR looks at your image and pulls out the text in a format that is as close to the original as possible. Better yet, the online version allowed me to save the output as a spreadsheet. The good news is that the error rate was relatively low. The bad news is that there were errors, which I had to correct by hand. It’s just tedious work, pouring over the output tables to make sure what I see on the screen matches what I see in the book.
So with all the data now in spreadsheets, I had to clean it up (reducing the output to show only gross salary, the only consistent metric across all salary schedules) and standardize the format so that each salary schedule in the spreadsheet looked exactly the same (even though the books are ALL OVER THE PLACE). I compiled all the cleaned versions into one spreadsheet, and have made that available for everyone else to fact-check me and to play around with on their own – no reason others cannot benefit from my hard work. 🙂 That all took about 12 hours, give or take.
The next step was to aggregate the information in a meaningful way as to visualize it. That was one of my goals from the beginning, to be able to tell a story with the data in a graphical manner. So I created a new worksheet in the spreadsheet that tallies what I mentally called “the four corners” of the salary schedule; from left to right, top to bottom, I have Step 1 Lane BA in the upper left (B5), Step 1 Lane PhD in the upper right (I25), Step 21 Lane BA in the lower left (B25), and Step 21 Lane PhD in the lower right (I25). My first attempt to generate a chart was an abysmal failure – not because of how it looked, but rather that I made some horrible decisions about comparing data. I went back to the drawing board and decided to compare each of the BA, MA and PhD lanes, but this time, Step by Step (literally, each year). In this way, I can show how a person who starts at Year 1 as a teacher with a BA (Step 1 Lane BA) compares in Year 2 (Step 2 Lane BA) against CPI, then again in Year 3 (Step 3 Lane BA), so on so forth all the way to Year 15. That took about another hour or two.
I also generated a chart which shows the “long term” effects of staying in the BA lane; due to the way the salary schedule is structured, the salaries experience 0 step increases after a certain number of years if you are in the BA lane. The graph shows this well.
And finally I did a “heatmap” to show how the salary schedule as a grid (Step vs Lane) changes from year to year. The “hot” colors are relative high positive changes, the “cool” colors are relative high negative changes. Some years when there is very little change overall, it looks like there are lots of hot and cool spots, but that is only relative to the total change for that year. In 2011, the entire map is hot orange – 0 change across the board.