The effect of preconceptions on the results of machine learning

The use of machine learning and artificial intelligence offers many possibilities, such as the improvement of medical treatment and diagnosis, identifying potential safety hazards, and advancing scientific research. However, when used inappropriately, data models can also perpetuate inequality or cause people and companies to focus on improving metrics to the detriment of actual performance.

In her book “Weapons of Math Destruction”, Cathy O’Neil argues that in order for a data model to be “healthy”, it should be transparent, continuously updated, and have statistical rigor. In harmful models, data relevant to the outcome that is being predicted is often lacking. Instead, it is substituted by proxies.

One example of models perpetuating inequality is an algorithm used to predict prisoner recidivism rates in the United States called Correctional Offender Management Profiling for Alternative Sanctions, or COMPAS1, which is the product of a for-profit company, Northpointe. The idea was to make criminal sentencing fairer by removing human bias.

The COMPAS software used defendants’ answers to a questionnaire to predict how likely they were to reoffend. The questionnaire included questions such as:
“Was one of your parents ever sent to jail or prison?”
“How many of your friends/acquaintances are taking drugs illegally?”
“How many of your friends/acquaintances served time in jail or prison?”
“In your neighborhood, have some of your friends or family been crime victims?”
“Do some of the people in your neighborhood feel they need to carry a weapon for protection?”


One can see how these types of questions, when used to assess recidivism, would further marginalize people from less privileged backgrounds, causing them to receive harsher sentences than someone from a more privileged background who had committed the same crime. At the same time, someone who is likely to commit more violent offences might be let off easier due to their background, potentially leading to more victims of violent crime.

In 2009, Northpointe co-founder Tim Brennan and colleagues published a validation study, according to which their algorithm had an accuracy rate of 68 percent in a sample of 2,328 people. Brennan says “it is difficult to construct a score that doesn’t include items that can be correlated with race — such as poverty, joblessness and social marginalization”, and removing those factors reduced accuracy.

According to an article published in ProPublica1, black defendants were twice as likely as white defendants to receive a score indicating that they were at a high risk to reoffend, yet not go on to do so. ProPublica’s analysis on data from Broward County, Florida also showed the opposite mistake for white defendants: They were much more likely to be labeled lower risk but go on to reoffend.

Skin colourWhiteBlack
Labeled Higher Risk, But Didn’t Re-Offend23.5%44.9%
Labeled Lower Risk, Yet Did Re-Offend47.7%28.0%

Results from ProPublica’s analysis on data from Broward County, Florida

Another example of a model where proxies were used to measure something immeasurable is the teacher evaluation program called IMPACT, developed by education reformer Michelle Rhee in Washington, D.C. According to Washington Post, “for some teachers, half of their appraisal is contingent on whether students meet predicted improvement targets on standardized tests.”2 It is probably obvious to anyone that students’ scores are affected by more things than their teacher. However, the use of these evaluations led to the firing of 206 teachers in District of Columbia public schools at the end of the 2009-2010 school year. These algorithmic evaluations overweighed positive reviews from school administrators and students’ parents, likely leading to the firing of competent teachers.

These are just a couple of examples of big data algorithms perpetuating inequality or otherwise leading to undesirable outcomes. Similar algorithms are used for many other purposes, such as hiring decisions by companies, university admissions, and granting people insurance or loans.

In conclusion, while the use of machine learning opens up many possibilities for advancing scientific understanding and improving society, we have to be careful that models are built on rigorous data instead of proxies. We should also keep in mind any assumptions or oversights made by models, especially when their use is widespread and can impact people in life-altering ways. This can only be done when the workings of the model are transparent and the data used is continuously updated.

Read more:
[1] https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[2] 206 low-performing D.C. teachers fired - The Washington Post

Kirjoittajasta

Anna Lohikko on edistyneen data-analytiikan asiantuntija

Lisää ajankohtaisia julkaisuja:

Avoimen datan hyödyntäminen automaattisesti Power BI:tä käyttäen
Tutustu tarinaan
Onko tekoälyn hyödyntämisestä vielä(kään) konkreettista hyötyä kilpailukyvylle?
Jo vuosia on puhuttu, että tekoälyn (AI) hyödyntäminen mullistaisi teollisen tuotannon ja palvelut lisäämällä tehokkuutta ja tuottavuutta. Missä ne konkreettiset hyödyntämismahdollisuudet sitten oikein piilevät?
Tutustu tarinaan
Johdatko toimintaasi tietoon perustuen?
Tutustu tarinaan
Organisaation hierarkian hallintaratkaisu Microsoft Power Platformia hyödyntäen 
Yleinen tilanne varsinkin suuremmissa yrityksissä on se, että olemassa olevat ERP-, henkilöstö- taikka taloushallinnon ohjelmistot tarjoavat mahdollisuuksia hierarkioiden luomiseen ja ylläpitämiseen. Mutta entä jos valmiit ratkaisut eivät tue yrityksen tarvetta, taikka toimivat turhan kankeasti käyttötarkoitukseen?
Tutustu tarinaan
Tekoälyn hyödyntäminen kuvien luonnissa – mahtavat mahdollisuudet – myös väärinkäyttöön.
Viime aikoina sosiaalisessa mediassa on kovaa vauhtia yleistynyt ilmiö, jossa esitellään kuvia, jotka on luotu tekoälyn avulla. Kurkataan nyt siihen, mitä pinnan alla tapahtuu, kun tekoäly muodostaa kuvia pelkän tekstisyötteen perusteella, ja mitä seurauksia tällaisen teknologian kehityksellä voi olla.
Tutustu tarinaan

Lisää ajankohtaisia julkaisuja:

Avoimen datan hyödyntäminen automaattisesti Power BI:tä käyttäen
Tutustu tarinaan
Onko tekoälyn hyödyntämisestä vielä(kään) konkreettista hyötyä kilpailukyvylle?
Jo vuosia on puhuttu, että tekoälyn (AI) hyödyntäminen mullistaisi teollisen tuotannon ja palvelut lisäämällä tehokkuutta ja tuottavuutta. Missä ne konkreettiset hyödyntämismahdollisuudet sitten oikein piilevät?
Tutustu tarinaan
Johdatko toimintaasi tietoon perustuen?
Tutustu tarinaan
Organisaation hierarkian hallintaratkaisu Microsoft Power Platformia hyödyntäen 
Yleinen tilanne varsinkin suuremmissa yrityksissä on se, että olemassa olevat ERP-, henkilöstö- taikka taloushallinnon ohjelmistot tarjoavat mahdollisuuksia hierarkioiden luomiseen ja ylläpitämiseen. Mutta entä jos valmiit ratkaisut eivät tue yrityksen tarvetta, taikka toimivat turhan kankeasti käyttötarkoitukseen?
Tutustu tarinaan
Tekoälyn hyödyntäminen kuvien luonnissa – mahtavat mahdollisuudet – myös väärinkäyttöön.
Viime aikoina sosiaalisessa mediassa on kovaa vauhtia yleistynyt ilmiö, jossa esitellään kuvia, jotka on luotu tekoälyn avulla. Kurkataan nyt siihen, mitä pinnan alla tapahtuu, kun tekoäly muodostaa kuvia pelkän tekstisyötteen perusteella, ja mitä seurauksia tällaisen teknologian kehityksellä voi olla.
Tutustu tarinaan