Model Methodology

I have been modeling political data here at Cabpolitical.com since the 2012 elections. The guiding rationale for my Models has been two-fold: a) analysis that simply averages all polls equally drastically over simplifies data; b) Models that rely on too many layers of statistics often over complicate data, thus creating more error than accuracy. My Predictive Models (which apply to general elections) combine multiple measures of central tendencies, coupled with my own historic formulas that I generate for each individual state. My Snapshot Models are only used during the primaries. These only use the measures of central tendencies, dropping the historic prong because of the unique nature of primaries. Primary Snapshot Models thus try to show where the race is at, as opposed to attempting to predict like a general election Model.

Both Snapshot and Predictive Models use the following formula structure:

  1. Inclusion of relevant polls. All available scientific polls are included, but some polls (partisan polls, etc.) are re-weighted, based on historical accuracy. Polls also drop off when too much time has passed. This varies depending on the type of race, coupled with the amount of data. Only in rare cases (states with limited data) do I include polls that are more than a month old. I will post polls that are included in a given Model at the bottom of a Model Update post.
  2. With these relevant polls included (and potentially re–weighted), I then calculate a few “central measure” prongs. They are not all truly central measurements per se, but I use the term in a conceptual sense to demonstrate how these prongs attempt to find an accurate range that attempts to minimize apparent outliers. These prongs include:
    1. A standard mean score, +;
    2. A standard median score, +;
    3. What I call a ‘high median’ score. This takes the highest polling number for an individual candidate from a reputable poll that is still similar to the central measure score range. Essentially, this rids the data of a high-end or low-end outliers, as well as minimize polls that fail to push “undecideds” that are more like ‘leans,’ +;
    4. A mid-point score that centralizes the range of available data, +;
    5. An Internet or automated central measure data point. Because of the changing methods of polling, this cycle, I’m including an additional measure to capture the differences between polling methods. This worked well in my Louisiana 2015 Governor Model, so I’ll be including it in some fashion, state by state, +;
    6. = A composite score then weights all of the above factors, producing a Snapshot Model. Keep in mind, while this is the overall formula, remember, there is variability within each prong, depending on the polls included, the number of polls included, the state and the respective pollster’s accuracy, polling methods, and the skew of the data, +;
    7. Predictive Models: In addition to the central measure composite score, Predictive Models then calculate in historic trends. This includes historic data error in a given state, coupled with a state’s historic voting patterns. Traditionally, this score is applied to those voters who have yet been assigned, as shown in the number of remaining voters left after calculating the composite score. This has worked well, though I’ve noticed over the last few cycles that a few states break harder than others (see Kentucky 2014/2015 and Kansas 2014). In the 2012 cycle, I also utilized a Model formula that used the same general foundation, but weighted the historic prong more heavily. I’ll continue to analyze both formulas throughout the 2016 cycle and adjust accordingly.

This should give you a good sense of how the Models are working, as well as the data that goes into the Model. Each cycle, I always gauge the formulas to account for potential variance. I’ll update you whenever I make changes to the overall formula.

Keep in mind, these scores are just scores. Meaning, you should always consider that all social science survey data has a MOE, reliability issues, etc. Therefore, even though the Model attempts to reduce the noise of many of these factors, a difference of 48.7% and 48.2% is likely not that meaningful, statistically speaking. Therefore, the focus of this website is not just Modeling, but analysis based on that Modeling. So, I always try to write up as much as possible to give you a greater context of what a given Model might mean.