This is the post excerpt.
This summer I’ll be working on implementing Survey Methods in StatsModels thanks to Google Summer of Code (GSOC). I’m excited because I took my first official programming course last semester (C++ seems to stand strong against the test of time) and get to learn more about open source software development.
I still remember the first programming course I signed up for during my freshman year at Rice. I dropped after the first week. It seems so foreign and of course because it was entry level, I was surrounded by people who have already taken AP computer science or used online resources to gain experience beforehand. I got scared and due to the large class size, I don’t think anyone had to time or energy to even notice I dropped out, let alone try to talk with me through it. Now, I get to contribute to StatsModels and scientist everywhere who need to use survey methods for their data.
I can’t lie, I’m not as prepared as I should be… but that’s only involving technical experience. I can learn how to write unit tests, read and understand survey methods along the way, etc.. I have a great mentor, Kerby Shedden, and I’m prepared to work my butt off. My next update will probably come around the second week of June. This month will be spent studying for my qualification exam. :‘(
GSOC has about a week left until it’s time to submit all of my work. It will consist of a single PR containing many features that researchers can use when analyzing survey data. I hope it will help start the addition of many more features to come to StatsModels. The main .py files include
- Summary statistics for survey data
- mean, total, ratio, and quantile
- Each statistic’s standard error can be derived from the jackknife, bootstrap, or linearized/robust method. Methodology from STATA documentation is used
- For the bootstrap and jackknife, you can promote privacy by retrieving the replicate weights as a numpy array. This way, you can attach it to the data, delete stratum and PSU info, and send it to others to analyze.
- Models for survey data
- Uses StatsModels to create models that take into account info from a survey such as strata, cluster, weights, etc
- Coefficients get linearized standard error using methodology specified in STATA and SAS, user is allowed to choose based on their preference for the assumptions and corrections made
- For Jackknife standard errors, STATA methodology is used
- Contigency table for survey data
- Essentially calculates a total-like statistic for your data grouped by one or two provided variables.
- Tests of independence for complex survey data are calculated using methodology from STATA documentation
All results are tested against STATA currently. This isn’t just adding everything from STATA to Python, though that in and of itself is a great cause. STATA doesn’t calculate bootstrap weights, you can only provide them. It also doesn’t provide standard errors for quantiles. In the future, I am sure that features not supported by STATA will continue to be added (maybe from SAS or R).
I learned so much from this summer. I came in only knowing R and a bit of C++, so learning Python (numpy in particular) was fun and really useful. Additionally, I know about unit testing now, learning the balance between code that is readable by me vs readable by others, proper documentation, practical examples of the speed vs storage tradeoff, Github, etc. I have a strong respect for software engineers now who have to make large products and take into account many edge cases. I would love to do this again, or write code for a company during an internship next year.
This past week as a bit rough as I also flew to Seattle to attend UW’s summer institute in statistics and modeling in infectious diseases (SISMID) which had session from 8:30-5:30. I was encouraged to attend from my research adviser and it was paid for so it was hard to say no. But it was much harder to go and try to get work done for GSOC as well. I definitely had to skip the last day to try and get anything done. Anyway, I’ve been implementing SurveyModel, a more abstract version of what I thought SurveyGLM would be. This is because the entire purpose of SurveyGLM is to repeatedly fit a GLM with different weights. But, this approach should work with any model that has weights, not just GLM. I also need to add a ‘linearized variance’ option when reporting the estimates for SurveyModel, which will be done this upcoming week!
Testing, testing, and more testing. This is what I’ll be doing this week before moving on to SurveyGLM, ie allowing the user to use regression models for their survey data. Before that, I need to add more interface with survey replicate-weights. This is when you want the data to be further privatized and thus don’t want to supply strata and cluster information. You do the jackknife (mentioned previously) or bootstrap to get replicate weights, then just add that to the data before you pass it onto the analyst. ATM, I have a few bugs that I can’t find. Kind of annoying but it’s what you’d expect. Wish me luck!
This week was a busy one. Implemented most of the survey summary statistics and the jackknife to estimate their standard error. This was my first time learning about the jackknife and I thought it was an interesting topic to tell you all about. Within survey data, you tend to have strata and PSUs within the strata that make up subgroups. Now, let’s assume that we want to calculate the mean for your survey data – this entails something along the lines of
np.dot(weights, data) / np.sum(weights)
where weights are the inverse probability of your observation being chosen. Easy enough. Now, if we want to estimate it’s SE (standard error) via the jackknife, we should do something along the lines of
for each strata
for each cluster within that strata
delete that cluster
re-weight the other clusters
np.dot(new_weights, data) / np.sum(new_weights)
center the collection of 'minus one cluster' statistics
do a bit of subtraction, summing, squaring, etc
As you can imagine, this is pretty computationally heavy. But it allows us to estimate the variablility of our estimator and gives us confidence in using it or not. I’m used to the bootstrap, which is even more computer intensive, but is more popular in reasearch and applications. Thankfully, my advisor knows many numpy tricks to speed up the computation. I’ll be doing that next week along with fixing up some design issues. I had most things all in one class but it makes sense to break them into different classes and pass them between one another.
Another blog post about my journey with GSOC. This week I’m implementing some summary statistics for survey data. Surrvey methodology quite resembles ordinary statistical methodology, but there are some pervasive subtleties there due to the importance of the study design for doing standard errors, tests and confidence intervals in a defensible manner. Learning what others have done to make robust analyses with their survey data is quite fun. The hard part is of course the implementation. I haven’t used Python in quite some time and I have very limited experience with it, so things are starting off kind of rough. But I hope the next couple of weeks will go along more smoothly.
I can’t lie, the main difficulty is time management. I thought I could work on this project while reading papers and getting prepared for research this summer and that’s not the case at all. If your mentor thinks it’ll take “x” hours to do something, you can safely add 5-10 hours to their guess. Plus, I had to recognize that I’ll get more done if I work 8 hours throughout the day instead of 6 hours one day and 10 hours another day. You don’t want to stop too quickly or you won’t get into “the run of things” but overworking yourself will lead to sloppy mistakes. Anyway, by my next post I’ll probably be working on some regression methods! Will keep you updated.