I entered the world of data in the 1990s, before the terms “Big Data” and “Data Scientist” existed. Today, we tend to think of data science as a relatively new phenomenon. But the impact of using numbers, data, and information-applying algorithms to explain and justify human behavior, and dare I say value, is a concept as old as mathematics itself. Statistics and data truly started shaping American history prior to the 1890s.
Economic Bias in data can be traced back to the late 1800s
Reconstruction in the United States was a precarious time for race and class. Prudential Life Insurance Company solicited a statistical justification that the American Negro was uninsurable. In 1896, Frederick L. Hoffman, a VP and Statistician with the Prudential Life Insurance Company, published a 330-page article in Publications of the American Economic Association titled, “Race Traits and Tendencies of the American Negro.”
This event set a precedent for the relationship between business, society, and data practice. Modern data practitioners encounter the effects and deep consequences every day. As one who belongs to the caste of the proverbial American Negroes, I want to be outraged. I know for a fact that we still choose and process variables and machine learning features that double down on archaic biases to determine value. Sometimes it's intentional; other times the past is so deeply entrenched in the present that it hides within the depths of the data. Seemingly unrecognizable.
A closer look at Hoffman’s findings
- “It is not the conditions of life, but in the race traits and tendencies that we find the causes of excessive mortality... For the root of evil lies in the fact of an immense amount of immorality, which is a race trait.”
- “For some generations of the colored element may continue to make decennial gains, but it is very probable that the next thirty years will be the last to show total gains, and then the decrease will be slow but sure until final disappearance.”
Instead of being either dismayed or indignant… I decided to intentionally focus on how and why this practice, and others, continue to affect class, gender, sexuality, and denotations of diversity in today’s world with the goal of disrupting these practices.
Hoffman’s assessments provided a framework to conduct business through the creation of products and financial instruments, with consequences still impacting the societies, communities, and constituents it serves.
It is worth noting that Hoffman has significant accomplishments in the areas of population and public health. Indeed, it was Hoffman’s work that first linked cancer to diet and tobacco. He is a co-founder of both the National Lung Association and the American Cancer Society.
Did people challenge Hoffman’s treatise?
Of course. Two prominent Americans with training in Mathematics and Sociology penned a response.
WEB DuBois, The Health and Physique of the Negro American 1907 wrote, “The careful statistician will immediately see that, while all these different sets of figures give data interest in themselves, they must be used with great care in comparison, because they relate to different classes of people and to widely different conditions of life.”
Kelley Miller, A Review of Hoffman’s Race Traits and Tendencies of the American Negro, 1897 wrote, “But, freedom from conscious personal bias does not relieve the author from the imputation of partiality to his own opinions beyond the warrant of the facts which he has presented.”
Why start with this history lesson?
Because the effects of this debate at the turn of the 20th century were far-reaching. WEB DuBois proclaimed in his book The Souls of Black Folk, 1903, “…for the problem of the Twentieth Century is the problem of the color line.” Shortly after the publication of Hoffman’s treatise, Mutual Life adopted Prudential's algorithms, contributing to the economic segregation of Black Americans. Over subsequent years, these business rules became industry-standard... adopted and adapted across the Finance and Insurance industries. These models undergird a practice that became known as redlining or, denying someone credit based on their economic context, or neighborhood.
I start with history to illustrate the risks of omitting context when wrangling data and bringing unconscious bias to the construction of models.
Were they really that sustainable?
In 2016, David Ingold and Spencer Soper published an article about Amazon’s same-day delivery maps titled, Amazon Doesn’t Consider the Race of Its Customers. Should It? The premise, “a solely data-driven calculation that looks at numbers instead of people can reinforce long-entrenched inequality in access to retail services.”
Zipcodes might seem like an innocuous variable or feature, but in the context of geography with a long history of both physical and economic segregation, it’s catastrophic. Especially when that segregation has been justified statistically and mathematically for over a hundred years. Our perception that numbers never lie is wrong. From a data sciences perspective, our top industries (Insurance, Finance, Real Estate, and Retail) were all early adopters of advanced analytics and machine learning, especially where it intersects with Marketing.
The evolution has been devastating, yet predictable. We are trapped in a dangerous feedback loop.
Where do we go from here?
How do we tactically mitigate and reverse the effects of bias in data practice as it relates to feature engineering and modeling? I’ll start with 3 suggestions:
- Recognize the impact of technology - It can, and will, play a significant role in shaping the way businesses and the country approach increased equality.
- Examine the scope of data governance - There are huge opportunities on many levels to rethink and rework how we as organizations, governments, and people think about data at the individual level.
- Keep data simple - The less we complicate it, the greater our chances to create more diversity, more inclusion, and more advancements for everyone.
If nothing else, 2020 elucidated the fault lines in Unites States culture. While it's easy to put a tribal spin on what’s happening, it's more interesting to observe it from a systems point of view. In fact, I’d argue that in the industries of Technology and Data, we’ve been largely untouched by the ongoings of the country.
This is not to say that I am personally unscathed. I am the mother of 3 young Black men and a person who has experienced loss due to COVID-19 disproportionately impacting the Black and Brown communities. I say this as a participant in the industry and acknowledge it was a bit easier for us to adapt. Many of us worked from home, maintained employment, received high wages, and lived away from the front-line protests.
We watch the events with rapt attention from inside the Tech and Data bubble. Except, we are complicit. We are part of an ongoing complicity that undergirds the infrastructure of our country with mathematical models and data driven decision making. Data is the scaffolding that tenuously holds us together and I believe we, as data practitioners, are waking up to this realization.
We can change harmful data practices
In 2018, a NCBI committee proposed a Data Science Oath:
In June 2020, 1,400 mathematicians, statisticians, and data scientists issued a letter boycotting collaboration with police on predictive techniques aimed to stop crime before it occurs. This addressed the widely documented disparities with US Law Enforcement agencies incongruently analyzing people of varying races and ethnicities.
We have entered a new era. While oaths and actions are a great first step, it's even more critical to absorb them into governance.
The practice of segmenting populations based on preconceived notions of social stratifications will hopefully be replaced by a more humane contextualization of personhood. As an industry, we’ve invested vast amounts of money into the collection and storage of data. Often, our colleagues and business stakeholders get frustrated with the return value on these investments. We’ve only recently learned how to mine and model the immensity of data we’ve amassed, and it’s still overwhelming.
Yet this - the Variety, the Velocity, the Volume, and the Veracity - is what makes us different. Harnessing the power of the Big Data Vs will help us mitigate the bias of the past century. Data has become a medium for telling the human story, for uncovering truths hidden deep within the facts we comb through every day. If we are brave enough, we can create something new.
Change is coming
In 2019, The Algorithmic Accountability Act was introduced to direct the FTC to develop regulations requiring large firms to conduct impact assessments for existing and new “high-risk automated decision systems.”
This is the initial effort to acknowledge the danger and hold data practitioners accountable. It allows us to start the process of intentionally mitigating the harmful effects of data practice. It’s a framework layered over the Time Value continuum of data that assigns accountability and suggests techniques for resolution.
I believe we are all concerned with this topic and discussing it with our colleagues. I warmly invite you to have this conversation with me. Perhaps together we can develop an open governance model of deep analytics where business stakeholders, data owners, data practitioners, and affected constituents are all engaged.