Tip Top 25 in helmets, smaller
                                                    Home

Math-Based Ratings

Now known as computer ratings, math-based ratings actually predate computers (depending on one's definition of a computer). The Dickinson system started in 1926, Houlgate in 1927, Dunkel in 1929, Boand in 1930, Williamson in 1932, Litkenhous in 1934, and Poling in 1935. All except the Dunkel system are now defunct, and even Dunkel is practically defunct as far as the general public knows. I have covered the Dickinson system separately.

Today there are hundreds of rating formulas out there, probably one for each left-brained football fan, but the most important are the ever-changing systems used by the BCS for its ratings. As of the 2009 season, that would be the following six systems: Massey, Sagarin, Billingsley, Anderson-Hester, Colley, and Wolfe.

The advantage of these systems, of course, is their objectivity and consistency of criteria from year to year. Sometimes a system owner will change his math formula, but when he does, theoretically all of his "national champions" for all years change, so that they are all selected by the same criteria. Because of that, an older copy of the NCAA Records Book may list different "champions" for a system than a newer copy does. But each list should be consistent within its current formula.

The flaws of these systems are many, some more than others. But one flaw they all share lies in their bureaucratic approach to rating college football teams: a bureaucratic system is incapable of recognizing and handling exceptions.

The other flaws are dependent on the system in question, but nearly all rating formulas are based on premises that I, for one, would not agree with in the first place. This can be hard to judge for a system owner that keeps his recipe a closely-guarded secret, but even in that case you can judge the "national champions" his system has selected.

And that is where we see the big fail for computer systems as national championship selectors: their choices are too often clearly ridiculous. As a result, for the most part, no one sees them as constituting national championships.

Math Systems That Are Treated as National Championships

An unfortunate exception is the Dickinson system's 17 "champions" selected for 1924-1940. Some people do see those as legitimate, which is ironic, since it is the most primitive and one of the worst systems of them all.

Also, for some reason, Alabama absurdly recognizes their 1941 Houlgate system selection as a "national championship." Alabama was 9-2 that year, consensus champion Minnesota 8-0. Alabama finished fifth in the SEC that year, and was ranked #20 in the final AP poll (though they would be higher if the final poll had come out after the bowls. I myself would rank them about #10). Worst "national championship" recognized by any school? I would have to think so.

Other schools may claim similar titles, but Alabama's 1941 farce is the most famous case. For modern (post-WWII) years, however, no one recognizes math formulas as national championships. If they did, just using the systems listed in the NCAA Records Book, we would have an additional 42 so-called national champions between 1970 and the present (many of whom lost to the legitimate national champion).

Inability to Handle Exceptions

As I said, the bureaucratic approach of a math formula rating cannot recognize or handle exceptions. In other words, computer ratings do not account for injuries, illness, suspensions, expulsions, or any major losses in team personnel at all. They do not account for the effect of weather on a result. And they do not account for any of the many psychological factors that most of us humans can recognize and account for.

Player Losses

Of course, if a team loses a game because of injuries or other player losses, it shouldn't matter as far as that team's rating goes. A loss is a loss. Obviously that team was lacking in depth, and its rating should decline. But where it does matter is for the opponents that team plays.

As an example, let's say Team A loses to Team B in their finale in overtime. Team B loses ten key starters to mass suspensions for their bowl game, and Team C defeats them there in overtime. For that result alone, any math system will rank Team C higher than Team A, because Team B is judged by a math formula to be the same team regardless of who played. However, this is clearly unfair, as the opponents Team A and Team C faced in this example were, in reality, quite different.

Sometimes the loss of even one player can have a huge impact on how good a team is, such as Oregon in 2007. Through 9 games, Oregon was 8-1 and #2 in the BCS ratings that year, with big wins over final AP top 25 teams Michigan 39-7, Southern Cal 24-17, and Arizona State 35-23. Then they lost QB Dennis Dixon, their offense disappeared, and they lost their last 3 regular season games.

At that point, most systems will see USC and ASU as having lost to an 8-4 team, even though the team they faced was clearly far better than that (Billingsley's system is one exception that I will address next). And Arizona and UCLA, who beat Oregon in their first two games without Dixon, will be given credit for wins far beyond what they should be credited for.

Oregon's last loss, to Oregon State, presents a further complication, because by that point Oregon's offense had recovered, and they scored 31 on Oregon State in a loss and 56 in their bowl win over South Florida. So Oregon State faced an Oregon team that was better than Arizona and UCLA faced, but not quite as good as the one USC and ASU faced. All of this is too complex a set of exceptions for any math formula to properly handle, including Billingsley's.

Billingsley's Approach

Billingsley's approach is different from most in that he only takes strength of schedule into account at the point that games occurred. So in the Oregon 2007 example, his system sees both USC and ASU as losing to a highly-rated Oregon team, and USC and ASU's ratings are unaffected by anything Oregon does thereafter. That is a good thing in this case, at least where USC and ASU are concerned. But this approach has its own problems, both within this Oregon example and in general.

Where Oregon 2007 is concerned, his system still gives Arizona and UCLA enormous undue credit for having beaten a highly ranked Oregon team. It still does not account at all for the major injury underpinning those wins. Furthermore, even though Oregon's offense recovered by the Oregon State game, and Oregon State thus faced a much more powerful Oregon team than Arizona and UCLA did, Billingsley's system gives Oregon State less credit for beating Oregon, as Oregon was much lower-rated by then.

And in a general sense, I think his approach is poor. Let's say a team goes 6-0 against unrated teams, but plays its six strongest opponents to end the season and loses all six games. The first team that beats them gets a big rating boost for it in Billingsley's system. The second gets less. The last gets very little, having beaten a lowly-rated (6-6) team. Even though they all defeated the same team.

In 2009, Alabama opened the season with a 34-24 win over Virginia Tech, who finished 10-3 and very highly rated. But where Alabama is concerned, Billingsley's system ignores everything Virginia Tech did after that game. It had no effect on Alabama's rating. I think we all agree that that is a poor approach, and yet Billingsley is used in the BCS ratings, which have rather a profound effect on the fortunes of so many schools.

In general, it is just a bad idea to ignore so much data in a sport where there isn't enough as it is (due to 120 teams playing 12 game schedules).

Weather

Similar to player losses, if a team loses a game in part due to weather, it is still a loss. But should it be considered equal to a loss unaffected by weather? And should a poor performance in a win played in good weather be judged the same as a poor performance in bad weather? I think not. But math systems cannot account for weather.

For example, let's say that you are trying to rate two teams that are equal in every way, and that each was upset by a losing team. But one of them suffered their upset in a game that was played in a rainstorm or blizzard. You might therefore rightly rank that team higher. But a computer cannot see that. Similarly, a powerful team might win a game in an ice storm, but barely, and see their rating adversely affected by a system that measures performance. That is not fair, and humans can see and account for it.

Psychological Factors

It is easy to dismiss psychological factors as "a bunch of hooey," as my grandmother would say, but statistics will bear them out as real. And they are just a few more things computers cannot account for.

Take a rivalry game. The kind of a game where you can "throw out the records," as Keith Jackson would say. It is, in other words, much tougher than a game against a non-rival of the same record. Just ask 8-0-1 Army, who was tied by 0-8-1 Navy in 1948. Should this result be judged the same as though Army had been tied by 0-8-1 Indiana?

And it isn't just the famous rivalries, like Army-Navy, Ohio State-Michigan, and Auburn-Alabama. Stanford-Cal, South Carolina-Clemson, and Missouri-Kansas are just as heated. And there are a lot of them. Texas has two (Oklahoma and Texas A&M).

Now, just like the above factors, a loss in a rivalry game is still a loss. But for a given year, if you are looking at, for example, a powerful Missouri team that was upset by  a mediocre Kansas, and comparing them to a powerful team that was upset by a mediocre non-rival, you might well rank Missouri higher. Because a rivalry game is tougher. But a computer cannot see that.

Teams are also much, much tougher in their last game when their coach has announced his retirement, and often in the last game when their coach has been fired effective at season's end. In recent years we have seen this in Franchione's last game at Texas A&M, Carr's last game at Michigan, and Bowden's last game at Florida State, all big upset wins for the outgoing coach. Computers cannot account for this in judging the opponents those teams defeated.

Humans can also consider factors such as the difficulty of playing well the week before or after a big game, or the effect of a tragedy or public scandal on a team (it can inspire them to play better or depress/distract them into playing worse).

A win is a win and a loss is a loss, but all of these factors can and should affect the degree of impact some wins and losses have on a team's rating. And math systems are incapable in this regard.

Strength of Schedule

Although all math systems attempt to account for strength of schedule, virtually none properly do so. I covered most of this in the Strength of Schedule section of my How to Rate Teams guide.

First of all, let's say that two teams are playing the following schedules, and that all of these rankings are fairly accurate:

Team A: #5, #10, #15, #30, #70, #80, #90, #100, #105, #110
Team B: #30, #40, #45, #50, #55, #60, #65, #70, #72, #73

Which is the tougher schedule? Many math systems will say Team B's is tougher, since its average opponent ranking is 56, and Team A's is 61.5. But which team plays the tougher schedule is entirely dependent on how good Team A and Team B are.

If these teams are top ten teams, then Team A's schedule is vastly tougher than Team B's. It is not even close. If these teams are vying for a national title, and a computer selects Team B based on its average opponent ranking, that is just ridiculous. Team A will have played 3 top 25 teams, and Team B none.

For national championship contenders, whether the weak teams on their schedule are #100 or #70 is virtually irrelevant. Yet math systems will see that difference as the same as that between a #10 and #40 opponent.

On the other hand, if these teams are both about #75, then Team B has the far tougher schedule, as all of their opponents are ranked higher, whereas Team A is looking at a 5-5 season. So strength of schedule is very much relative to the power level of the team playing the schedule.

This principle is the same for a system that judges strength of schedule by win-loss records (though such systems are poorer and prone to other problems as well). If Team A goes unbeaten against teams that finished 9-1, 8-2, 7-3, and seven teams that were 1-9, and Team B defeats teams that finished 6-4, 5-5, and eight teams that finished 4-6, Team B's schedule will be deemed by a math system to be much stronger. But if these teams are national championship contenders, Team A's schedule is the far tougher. If they are #60-80 type teams, Team B's schedule is tougher.

The Straight Record Fallacy

Systems that simply judge strength of schedule by the straight records of opponents have other problems as well. Typically, there is a big power difference between a 7-3 SEC team and a 7-3 WAC team. But a simple system, particularly the older ones (such as Houlgate), will judge such opponents as the same. And even modern systems that alleviate this problem with a more sophisticated formula can still be affected by it, if to a lesser degree.

This is particularly a problem when systems try to name champions for seasons long ago, when there were fewer intersectional games. A team from the South, for example, could be selected as champion for a season early in the 20th century when neither they nor any of their opponents played a team from outside the South. They just played a lot of regional teams with strong records.

The problem is, major teams from the South fared consistently poorly against other regions at that time. They have a losing record against every other region they played 1901-1920. And a losing record against every region but the West Coast 1921-1929 (3-0-1 against the West).

From 1901 through 1929, major Southern teams were 7-49-3 against the Big 10 region, 17-36-2 against the East, and 3-9-3 against the Missouri Valley (which was far weaker than the Big 10 and East). So when Billingsley's system selects 8-0 Auburn as national champion of 1913, over 7-0 Chicago, 7-0 Notre Dame, 9-0 Harvard, and 8-0 Nebraska, it seems more than a little ridiculous.

Auburn's all-Southern schedule was strong on its face, with teams that finished 4-3, 6-1-1, 6-1-2, 7-2, 5-3, and 6-2. That is why they won Billingsley's system. But the flaw here is in treating 7-2 Southern teams as equal to 7-2 Big 10 and Eastern region teams, when they very clearly were not equal.

Auburn's opponents played only one game against a power region, and that was 5-3 Vanderbilt losing to Michigan 33-2 (Auburn beat Vandy 14-6). They went 3-0-1 against Southwest teams (irrelevant) and 0-2 against mid-Atlantic team Virginia (relevant only because they lost to a team from a weak region). Virginia was 7-1, but lost their only game against an Eastern team (4-4 Georgetown).

Performance

I covered much of this in the Performance section of my How to Rate Teams guide. Performance is basically score differential. Many, if not most, modern math formulas measure performance, but the BCS does not allow its computer rating systems to take it into account.

I do think math formulas have a lot of trouble properly measuring performance (details to follow), but on the other hand, people voting in polls can (and should) account for performance, so ultimately I don't think it's a good idea to force computer ratings to ignore so much potentially useful data. I have no doubt that Sagarin's original system (his real rating list) is better than the dumbed-down system he uses for the BCS.

But of course, I don't think the BCS should be using computer ratings at all.

Problems With Measuring Performance

Computers cannot account for many exceptions that artificially affect score differential, including things I've already covered, such as player losses, weather, and psychological factors. But unlike humans, math formulas also cannot see how games unfold. For example, if a team is down by 14, and scores a touchdown on a Hail Mary pass on the last play, the final score will look much closer than the actual game was. But the winner was not really threatened.

And on the other end, one team might take a lead with 30 seconds left. The other team, desperate to come back, throws risky passes, and one is returned for a touchdown (a sequence I've seen many times). This final score will not be as close as the actual game was.

Humans can see and account for these things, and computers cannot.

Furthermore, since different teams have different strengths and offensive approaches, it can be difficult to compare them using score differential. Some teams are more ball-control and defense oriented, and others run a no-huddle passing offense even when they are winning big. And some coaches like to run up scores, while others shut it down when they have a big lead. Just because one team wins by an average of 30 and another by an average of 15 doesn't mean that the higher-scoring team is better.

The most important dividing line in performance is the touchdown. A team that wins by 10 has performed better than a team that wins by 7, because the latter team was an unlucky bounce or Hail Mary pass away from a potential loss. When comparing national championship candidates, the main thing to look at is not whether they won by 20 or 30 points a game, but how many times they were threatened. How many close games they had.

All else being equal, a team that has one close game (winning by a touchdown or less) should be rated higher than a team that has 3 close games. Even though the team with 3 close games can have a higher average score differential due to big blowouts in the rest of their games.

The Pudding

There is much more to say about the flaws of computer rating systems, and more detail to get into. But this overview is long enough. Let's cut to the chase. Or the pudding, as it were. That's where I've been told the proof is. All you really need to know about computer ratings as national championship selectors is evident in the vast list of silly choices they have made over the years.

2002: Three systems preferred 11-2 Southern Cal to 14-0 Ohio State. Some 60 years from now, I fully expect Southern Cal to officially claim a national championship for 2002.

1998: Sagarin had 11-1 Ohio State over 13-0 Tennessee.

1996: Alderson picked Florida State over Florida, who defeated FSU 52-20 in the national championship game.

1994: 13-0 Nebraska or 12-0 Penn State? Dunkel said 10-1-1 Florida State.

1988 & 1989: Notre Dame beat Miami in '88, but one system selected Miami anyway. Miami won the next year, but a couple of systems took Notre Dame anyway.

1987: Berryman took 11-1 Florida State over 12-0 Miami, who beat them.

1986: Almost all systems tabbed 11-1 Oklahoma #1. 12-0 Penn State beat 11-1 Miami, who beat Oklahoma.

1984: If ever there was titanium-clad proof that Billingsley's computer formula has a problem, it's 1984, when his computer says Brigham Young was the best team in the land. Why bother with a computer at all if it's just programmed to be as dumb as that year's writers?

1980: Three systems would like to give the trophy to 10-2 Oklahoma over 12-0 Georgia.

1976: Five systems like 11-1 Southern Cal over 12-0 Pittsburgh.

1969: Matthews says no to 11-0 Texas. For 11-0 Penn State's sake? No. 10-0-1 Southern Cal? No. For 8-1 Ohio State.

1968: Litkenhous tells us that 8-1-2 Georgia is more worthy than 10-0 Ohio State (as well as 11-0 Penn State).

1961: Poling has 8-0-1 Ohio State over 11-0 Alabama.

1955: Boand pushes 9-1 Michigan State past 11-0 Oklahoma.

1941: Houlgate's aforementioned selection of fifth-place 9-2 SEC team Alabama over 8-0 Minnesota. Not to mention 8-0-1 Notre Dame and 8-0 Duquesne. Or 8-1-1 Mississippi State, the SEC champion who beat Alabama, but lost to Duquesne. And Alabama claims it. They even made "national championship" rings (more than 40 years after the fact). I have seen one.

1940: Williamson stays out of all the debate by picking 10-1 Tennessee over 8-0 Minnesota, 10-0 Stanford, and 11-0 Boston College (who beat Tennessee).

1939: Dickinson stands alone in 8-0-2 Southern Cal's corner, opposite 11-0 Texas A&M. As noted in my separate review of Dickinson, and as you can see in the link above, in recent years USC did some "research" and decided to claim this as a "national championship." Funny stuff.

There are plenty more. But you get the idea.

Conclusion

It may seem as if I dislike math-based ratings, but I don't. It's true that I have no use for the older, simpler systems, but modern ones that measure performance are interesting. I've been reading Sagarin's lists for decades. I just don't see them as legitimate national championship selectors, or even top 25 ratings. They are power ratings.

Power ratings don't care who beat who. They are lists of who the best teams are (or who the formula/selector measures/believes the best teams to be). And the best team doesn't always win. The computers are correct, I believe, that Oklahoma was a better team than Penn State in 1986 (and so was Miami). PSU is the only legitimate #1 team for that season, but that doesn't mean they were the best.

Similarly, there is no doubt in my mind that Miami was better than Ohio State in 2002. But Ohio State is the only legitimate #1 for that year.

Therefore, I think the math rating #1 teams throughout history should be put into a separate list from the human national championship selections in the NCAA Records Book. Math formulas are selecting the best teams (or attempting to, at least), not the national champions, and there is definitely a difference.

For all their limitations, math-based ratings can do a good, objective job of putting together a power rating. And they can be a helpful tool for humans that rate teams. If Sagarin has a 6-4 team at #15, for example, it might be a good idea to at least take a look at that team. Chances are, there's a reason for it.

Home