Tuesday, May 6, 2008

Updated Democratic Superdelegate Predictions: Introducing New Variables and Subtracting Others

If you read Carl Bialik's story about my superdelegate predictions, you'll recall that he mentioned that one factor that I could consider adding was which candidate the other committed superdelegates in the same state were supporting. This was easy enough to create, so I added it and when I did, it was a very significant predictor. This isn't too surprising since superdelegates from the same state are often facing the same considerations when making their decision. In fact, this new variable was so important that it washed out the effects of other predictors that I had been using, such as the percentage of the state's population that belonged to a union, the percent living in urban areas, the percent with a college degree, and the per capita income in the state. Therefore, I am revising the methodology for the predictions to simplify the model and, hopefully, give it greater predictive power at the same time.

I now include just 5 variables to predict who superdelegates will endorse. These variables are the gender of the superdelegate, the presidential vote in the superdelegate's state or congressional district (for House members), the percentage of the state's superdelegates who are supporting Clinton, whether Clinton or Obama won the state's primary/caucus, and whether the superdelegate made their endorsement before or after Super Tuesday. The last variable accounts for the fact that people who didn't line up behind Clinton early, when she was the front runner, are far less likely to endorse her now. Indeed, this proves to be a very important predictor in the model (and failing to include it changes the predictions substantially as I'll show below).

Based on this new model, I have now updated the superdelegate predictions. As always, information on the superdelegates is provided by the Democratic Convention Watch site. In the figure below, I present the distribution of unpledged superdelegates based on the probability of supporting Clinton: Superdelegates who are between 40% and 60% likely to vote for Clinton/Obama are labeled as "unclear." There are 61 superdelegates in this range. There are 139 unpledged superdelegates who are at least 60% likely to vote for Obama; just 41 unpledged superdelegates are at least 60% likely to vote for Clinton. These predictions suggest that unless something dramatically changes, Obama will be able to cut into and even overtake Clinton's superdelegate lead in the coming weeks and months.

The estimates for each unpledged superdelegate are listed here. Note that I am now generating predictions for superdelegates in NY, AR, and IL, which I was not doing previously. Not surprisingly, all unpledged superdelegates in NY and AR are estimated to go for Clinton while all unpledged IL superdelegates are predicted to support Obama.



I do want to return to a point I made above. It matters quite a bit if you include a variable in the model that accounts for when a superdelegate made his/her endorsement. This variable captures whether a superdelegate endorsed before Super Tuesday or if they endorsed after (or have not yet endorsed). This variable is intended to capture the dynamic aspect of the race that led many superdelegates to endorse Clinton before Super Tuesday, but then caused more to flock to Obama after Super Tuesday. But what happens if you ignore this factor? The figure below presents predictions from a model that removes the variable accounting for when a superdelegate made his/her decision.

As this figure clearly indicates, the predictions change dramatically when you don't account for the timing of a superdelegate's decision. In this model, 72 unpledged superdelegates are in the "unclear" range, 71 are at least 60% likely to endorse Obama and 98 are at least 60% likely to endorse Clinton. Thus, ignoring the dynamics of the race tends to favor Clinton. However, it is important to note that even in this scenario, Clinton would likely not pick up enough superdelegates to overtake Obama's overall delegate lead.


Mike3550 said...

I have thoroughly enjoyed your analysis of the race and predicting superdelegates. I had one question about the models since you mention the importance of the variable indicating whether superdelegates made their endorsement before or after Super Tuesday. I wonder what would happen if you added a second dummy variable indicating that the endorsement came after Clinton's win in Ohio. To capture the dynamics of the race, it seemed like Clinton was the early favorite, Obama was the "insurgent" candidate with most of the momentum after Super Tuesday and Clinton has gained (or at least neutralized) Obama's momentum following the Ohio primary.

Although there might not be enough observations of endorsements after the Ohio primary to come up with stable estimates, it seems like there might be a different calculus now for the remaining superdelegates than those who endorsed between Super Tuesday and Ohio.

Thanks again for posting all of this!

Anonymous said...

I haven't seen the formal write-up of your model, so you may well have responded to these two comments already:

1. I assume that you ran a binary choice model on the sample of already committed superdelegates, and then used the resulting parameter estimates to predict the choices of the remaining uncommitted delegates. But did you take into account the real possibility that the remaining uncommitted delegates are different? To address this issue, you would have to run an ancillary "commitment" equation and then use the results of the ancillary equation to compute a "self-selection" correction factor in your binary choice model. As you probably know, this is the standard problem of self-selection in labor economics and other fields.

2. Did you attempt to validate your model by splitting your sample? That is, did you attempt to derive your estimates from half of the superdelegates and then see if the resulting model correctly predicted the choices of the superdelegates in the other half of the sample?


Fianchetto, Providence RI

TINAandRON said...

Great Write up. Dang Crappy Luck for you though. Right off the bat you BLow Heath Shuler. You should have asked me I could have told you he would vote for the winner of his district and that that would be Hillary. Maybe take him out of the chart.

Brian Schaffner said...

Thanks for the comments everyone.

For those who have been following this, I have actually been generating these models since shortly after Super Tuesday.

You see the first predictions here:

To answer the question about self selection, my previous models were Heckman Probit Selection models. This worked pretty well for a while, but recently the test of rho was not even close to being statistically significance and Stata was having issues estimating the model at all. This could very well be because I don't have all the variables necessary to correctly specify the selection stage, but whatever the reason, I decided to go back to a more simple model.

As for splitting the sample, I have not done this. However, since I began generating predictions, over 100 superdelegates who were previously undeclared endorsed either Clinton or Obama, and the model did get 70% of those correct.

Here is a link outlining the methodology I was using previously.

Finally, Mike3550's idea about another variable capturing the potential change in dynamics after March 4th is a good idea and I may try this in the next iteration.

You can also see some answers to questions posed by other readers here:

LwPhD said...

70% seems quite good in terms of choosing who will endorse whom. However, I'm curious of how well your prediction margins have been?

You said you have 70% accuracy on over 100 predictions. So, what was your predicted margin among those >100 vs the actual margin? If exactly half the 30% wrong predictions were Clinton and the other half were for Obama, then your predicted margin would've been exactly correct, even though you got there the wrong way.

Anonymous said...

It makes sense that you can't easily find a plausible exclusion in the binary candidate choice model so as to identify the parameters of the commitment (selection) equation. That's probably why the Heckman selection model has problems converging.

However, you might consider an ordered probit model (oprobit in Stata), where Obama = +1, uncommitted = 0, and Clinton = -1. (As you know, these ordinal values are arbitrary.) This would be an alternative way of using all the information you have, including the fact that some delegates are uncommitted. In fact, you could run the oprobit model on the data post Super-Tuesday and compare the results with an oprobit model run on the most recent data. The comparison would show how the range of the underlying latent variable has narrowed. You would also have an interesting metric of how "close" an uncommitted delegate is to either candidate.

fianchetto, Providence RI