The Metric Mirage | Cat Yeldi

On September 13, 2015, the Dow Jones Sustainability Index named Volkswagen the world’s most sustainable automaker. A score of 91 out of 100. Perfect marks for compliance, anti-corruption, climate strategy. CEO Martin Winterkorn issued a press release celebrating the achievement.

Eight days later, the EPA announced that Volkswagen had installed defeat devices in 11 million diesel vehicles worldwide — cars that were emitting up to 40 times the legally permitted nitrogen oxide levels. Winterkorn resigned by month’s end. Penalties exceeded $30 billion.

VW had been on the Dow Jones Sustainability Index for thirteen years.

The natural reaction is to call this a fraud problem. VW cheated, VW got caught. But the more uncomfortable question is about the system that scored them. The Dow Jones Sustainability Index didn’t make an error. By its own methodology, the rating was accurate. VW’s corporate sustainability assessment (the questionnaire, the disclosures, the reported data) all checked out. The score measured what the score was designed to measure.

It just wasn’t designed to measure what was coming out of the tailpipe.

The Agreement Problem

You might assume ESG rating agencies would at least agree with each other. They don’t.

A 2022 study by MIT researchers Berg, Kölbel, and Rigobon analyzed six major ESG rating agencies (KLD, Sustainalytics, Moody’s ESG, S&P Global, Refinitiv, and MSCI) across 924 companies. The average correlation between their ratings was 0.61.

For context: credit ratings from Moody’s and S&P correlate at 0.99. When two credit agencies look at the same company, they almost always reach the same conclusion. When two ESG agencies look at the same company, they disagree roughly 40% of the time.

The researchers found that 56% of the divergence came from measurement: the agencies looked at the same category and reached different conclusions. Not different categories, not different weightings. Same category. Different answers. Harvard Law School’s forum on corporate governance described ESG ratings as “a compass without direction.”

There are now approximately 140 ESG data providers worldwide. Companies spend an estimated $675,000 per year on climate disclosure alone. The global ESG services market is projected to reach $65 billion by 2027. An enormous industry has been built to measure something that the measurers themselves can’t agree on.

The Score and the Thing the Score Claims to Measure

Bloomberg Businessweek investigated every ESG rating upgrade MSCI awarded to S&P 500 companies between January 2020 and June 2021. They found that roughly half of the upgraded companies hadn’t even disclosed their recent greenhouse gas emissions — either in full or at all.

The upgrades weren’t rewards for reducing environmental impact. MSCI doesn’t measure that. It measures the risk that environmental issues pose to a company’s profits. McDonald’s produced 54 million metric tons of CO₂ in 2019 (more than Portugal). MSCI upgraded them anyway, concluding that climate change didn’t threaten the firm’s bottom line.

Meanwhile, the S&P 500 ESG Index included ExxonMobil while excluding Tesla.

A study in Business Strategy and the Environment confirmed the pattern at scale: “high ESG-rated or environment-rated firms do not have lower carbon emissions.” Worse, those firms “are not incentivized to do more for the environment, as they have already been awarded with good publicity.”

The score goes up. The emissions don’t come down. Both are true simultaneously, and neither is lying.

This Isn’t Just ESG

The measurement-outcome gap isn’t unique to sustainability. It shows up anywhere a metric becomes the target instead of a proxy for the target.

In 2012, Medicare launched the Hospital Readmissions Reduction Program, penalizing hospitals for high readmission rates. The logic was straightforward: if patients keep coming back, you’re not treating them well enough. Reduce readmissions, improve care.

Readmissions fell. The metric improved.

A study in JAMA Cardiology examined what happened to the patients. Among 115,245 Medicare beneficiaries across 416 hospitals, 30-day mortality for heart failure patients rose from 7.2% to 8.6%. One-year mortality climbed from 31.3% to 36.3%. Researchers estimated between 5,200 and 10,400 additional deaths per year among heart failure patients alone.

Hospitals had learned to game the metric. They triaged ER visits to “observational status” instead of admitting patients. They avoided readmitting patients who needed readmission, because the readmission would hurt their score. The dashboard turned green. The patients got worse.

The same mechanism appeared at the VA, where managers created secret dual-list systems: one official list showing short wait times for Washington, one real list where at least 40 veterans died waiting for care.

To be clear: the problem here isn’t Medicare or the VA. These are vital institutions. The problem is what happens when any system — public or private — substitutes a metric for the thing the metric was supposed to represent. Darrell Huff wrote the book on this in 1954. We keep learning it the hard way.

It appeared at Wells Fargo, where the “Going for Gr-Eight” cross-selling target led employees to open 2 million unauthorized accounts. It appeared in Atlanta public schools, where 178 educators across 44 schools cheated on standardized tests (teachers held weekend pizza parties to erase wrong answers) after No Child Left Behind made test scores the metric that determined funding.

In every case, the numbers improved. In every case, the thing the numbers claimed to represent did not.

The Structural Problem

None of these are stories about bad people. The VW engineers, the hospital administrators, the schoolteachers, the VA managers. They were all responding rationally to what they were being measured on. That’s what makes it structural rather than moral.

When you design a measurement system for reporting rather than for understanding, you get a system that reports well. The metrics are built to be met. The dashboards are built to be green. And the distance between what the numbers say and what’s actually happening grows quietly until something breaks.

The $65 billion ESG measurement industry isn’t measuring whether companies are getting better. It’s measuring whether companies are getting better at being measured.

That’s not the same thing. And the gap between those two sentences is where a lot of what we’ve been told about corporate accountability falls apart.