Sunday, July 10, 2016

Genius is a Minority, or The Too-Small, Non-Representative Sample (Part III)

This is part 3 of our examination of Simkovic & McIntyre's Economic Value, perhaps the greatest law review article ever saved to a hard drive, the legal scholarship equivalent of James Joyce's Ulysses, becoming more and more intellectually rewarding upon further examination.

Repeatedly, Simkovic & McIntyre have defended the use of SIPP data, standing behind their methodology as all steadfast geniuses do.  To their credit, using SIPP is the sort of move that geniuses make when everyone else is still adding up ABA 509 reports like chumps.

However, using SIPP data in  regards to building a sample of the "law degree" population has two major problems:  first, the way Simkovic & McIntyre have used SIPP causes their sample to be a non-random, non-representative sample of the law graduate population as a whole; and second, SIPP itself is way, way too small to generate conclusions for sub-samples based on later graduation years (as would be required to put forth any viable conclusions about changes over time in the fortunes of law graduates, which Simkovic & McIntyre purport to do).

Let's start by discussing how SIPP works.  SIPP is a (usually) four-year study designed to review income and government program participation.  It surveys somewhere between 14 and 52 thousand households in a series of "waves", with a recall of four months to each household.  There were panels done in 1996, 2001, 2004, and 2008.

The questions about "degree" and "law" are asked early (Wave 2), so Simkovic & McIntyre built their sample using people who had a "law degree" on or before roughly 1996, 2001, 2004, and 2008, respectively (although it does appear some individuals reported a law degree in 2009).

Keep in mind that each panel of SIPP draws its sample from a nationwide population, and each sample is considered individually representative and randomly-drawn for the population at the time.  Immediately, one should note that they are not (nor could they be) representative in the aggregate.  If you pool the independently drawn samples together when they were conducted at different times and the population state changed from the first time to the second, you no longer have a representative sample.

As a more practical example, consider a high school where, for four years, you randomly select students from the entire student body.  Your final sample is not representative of the high school population at any given time because you're going to have oversampled from the freshmen class when the survey began, and relatively undersampled from the first senior class and the last freshman class.

Similarly, consider a person who graduated law school in 1994.  They might have been chosen as one who had a "law degree" in any of the four SIPP panels used by Simkovic & McIntyre.  Meanwhile, a 2007 graduate is only possible to be selected in one of the SIPP panels: 2008's.  So, assuming a constant flow of graduates, without any sort of normalization or procedures to control for the year one received a law degree, Simkovic & McIntyre's sample of "law degree" holders is going to include approximately two times as many 1993 graduates as 2003 graduates.

If you don't believe me, Simkovic has a chart apparently used in PowerPoint presentations that shows a bell curve over time.  In reality, however, the number of law degrees generally increased year over year during the entire duration of the sample time.  That Simkovic & McIntyre's sample looks nothing like a randomized sample of lawyers as-of 2008 would look should be enough by itself to invalidate the survey entirely.

The low sample size for more recent classes causes serious problems.  Consider the 2007 "law degree" graduates.  How many are included in Simkovic's sample?  About 10.  How many existed in real life?  About 43,920.

Would you ever draw a conclusion about the economic fortunes of 43,920 people based on 10?

Thankfully, statistics has an answer to this issue that can provide some guidance.  With a sample size of only 10 people in a population of 43,920, the confidence interval is a mere 31 points (!) at the 95% confidence level.  Without belaboring the meaning of confidence intervals, that's a massive amount of uncertainty that no researcher on the planet would present as viable proof of anything.

To draw any sort of reasonable conclusions about the 2007 graduating class as a separate sub-population (i.e., to compare it with other sub-populations), one would need to randomly sample around 380 graduates.  Even at "low" research standards (90% confidence level and 10% acceptable error), you would have to survey between 60-70 graduates from that graduating year cohort.

In fact, for most of the years featured in Simkovic & McIntyre's sample, there are simply too few observations to normally draw any sort of conclusions about the sub-population of graduates for those years taken separately.  While the sample size is large enough to generate usable conclusions for the entire population independent of year graduated (well, if it were randomly drawn), such conclusions would be inescapably meaningless in a practical sense.

This is a classic subsampling problem, made all the more worse because Simkovic & McIntyre didn't have any samples at all drawn from 2010 and beyond despite those being the most relevant years for people applying to law school now.

Once we've shown the sample is non-representative of the subject population, and, in any event, too small to draw conclusions about sub-populations when broken down over time, its utility as a viable tool of prediction for independently-defined subgroups is little more than academic bullshit.

At its best, Economic Value only shows that a non-representative group of law graduates did better than a non-representative group of terminal bachelor's holders in the late 20th century.  Would such a conclusion mean anything?  Even be worth publishing?

For most of us, the answer would clearly be "no," and a rather resounding answer at that.  At this point, absent being able to make any bona fide conclusions about sub-populations, with only tepid conclusions even remotely possible, the survey would become an aborted idea of the type that good researchers not gung-ho on publishing for publishing's sake have all the time.

But not Simkovic & McIntyre.  Like the best of our pro se litigants, believing in their convictions without reservation or refined concession of error, they pressed forward.  They doubled down, and published their research anyway, small, non-representative sample size be damned.  They had biases to confirm, after all, and confirm biases they would.  That type of atypical thinking is a hallmark characteristic of both lunacy and genius, and I think we all know that Simkovic & McIntyre clearly fall into the latter camp.

As stated previously, the author completely disclaims any alleged facts in this article, advises the audience that there is a high chance of error due to the author's third-tier education, and encourages the wise reader to forward a link or article that satisfactorily addresses or corrects the above.


  1. Simkovic & McIntyre should withdraw their "article." Immediately.

    Failing that, the publisher should repudiate it. Immediately.

  2. Sampling problems aside, this study is irrelevant because it doesn't focus on the earnings premium (if any) for post-2008 grads. The 1M premium in today's dollars may very well have been true for folks that went to law school 50 years ago, but that means bupkes for whether law school is a wise investment today. I could probably perform a study that shows what a good trade blacksmithing was to get into way back when. Anyone today want to pick up a hammer and an anvil?

  3. This is great discussion, and the reason why there is backlash against the study. Law prides itself on working from first principles, but as we all know there are times where the answer is decided first, then the rationale on how to get to that "answer" is discovered, no matter how you torture the data (or the prior case law).