Developer Book Club: Weapons of Math Destruction
Good writers make you think, great writers make you want to change the world around you. I would argue without hesitation that Cathy O’Neil is a great writer, and with a PhD in Mathematics from Harvard you can rest assured that she knows what she’s talking about.
Her book, Weapons of Math Destruction tackles one of the tech world’s hottest phrases of the decade: Big Data.
Artificial Intelligence (AI), mobile, social and Internet of Things (IoT) are driving data complexity, new forms and sources of data. Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.
Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media – much of it generated in real-time and in a very large-scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in better and faster decisions.
– IBM
Big Data is being used in the modern world for anything and everything you could think of: it’s used to develop marketing campaigns, determines what Netflix suggests you watch next, predicts the spread of diseases, and it’s the reason why airline flights are more expensive on certain days (no one can quite agree on which days for very long). “Data Mining” is a hot topic as people try to identify patterns and use them in an ever-increasing number of fields.
O’Neil isn’t interested in Netflix suggestions, however, she’s far more concerned about the way data is being used to target certain populations through harmful proxy measurements. Proxies can seem like a good idea because they can be used to “measure” things that are subjective (e.g. how responsible a person is), but can turn destructive when the proxy can also be indicative of things that are largely outside of a person’s control (e.g. socioeconomic status).
A frequent example of a bad proxy measurement in the WOMD is the use of credit scores to judge people: a bad credit score may be interpreted as an indication that a person lacks financial responsibility, but it could just as easily indicate that someone is going through tough times due to circumstances beyond their control. If an employer weeds out an application based on the candidate’s credit score without even talking to the person, they’re inevitably missing out on potentially good employees while also preventing that person from earning money that could go towards improving their score. If enough employers rely on this proxy measurement, then what’s the chance that a candidate will ever find a job?
Many of the examples in the book rely on algorithms being able to cut people out of the equation in order to save time and energy: a school that wants to weed out 3/4 of applications, a police department that only wants to focus on the most dangerous parts of the city, an employer that only wants to spend time interviewing “qualified” applicants. The people who are denied access (or targeted) often don’t even know why they were disqualified, and the people relying on the algorithms will never know what they’re missing out on for the sake of efficiency. By using these flawed algorithms, those relying on them are acting a way that is systematically prejudiced against certain groups (systematic in the purest technological sense of the word, because even the humans utilizing them may not totally understand how they were designed).
How do bad proxies apply to software engineering?
In reading Weapons of Math Destruction, I began to think about how the applications programmers create can be biased in ways that aren’t commonly thought about but could be equally harmful.
In my university’s Graphics & User Interface (GUI) class we talked about accessibility for all of one class period. The focus was on color schemes and the colorblind: if you make a website with red on a green background, that will render it totally useless for about 8% of Northern European men. I’m happy that I learned to take that 8% into consideration, but there are other groups that we aren’t talking about.
One of the groups that’s been haunting me in particular is people who don’t have personal computers. In 2014 a study confirmed that the group most likely to use internet at public library are those living at or below the poverty line who generally do not have a computer at home.
According to the Gates Foundation, in 2010 seventy-seven million people used internet at public libraries in the United States; that’s almost 24% of the population (a huge percentage that was never discussed in any of my computer science classes).
What did those people use the internet for?
- Career needs: searching for jobs, filling out resumes, etc.
- Health issues: learning about illnesses, seeking health providers.
- Education: homework, online-classes
I don’t know a lot about internet at public libraries, but I would imagine that most of them aren’t operating on Google Fiber. As a developer in a first world country, I know that I’m personally accustomed to working with the best computers (and acceptable internet) that my company can get its hands on – this means that when I’m testing my software I’m generally not thinking about how long it takes someone on a 5-year-old computer to do the exact same thing.
If your website is aimed at providing a service for people, then make sure it’s providing that service for all people. More than once I’ve cursed at a website that takes a minute to render or I’ve just given up on looking at it altogether (I’m looking at you, Invision).
Ironically, much like biased algorithms, service-based websites that are created to “help” are also more likely to be slow and fail their end-users
Chalk it up to Capitalism, but it seems like websites that generate major revenue are rarely down, whereas websites offering free or affordable services for people suffer from poor architecture and testing strategies.
Do you remember the Obamacare website that crashed when people tried to sign up for health insurance?
“The problem has been that the website that’s supposed to make it easy to apply for and purchase the insurance is not working the way it should for everybody,” Obama said. “There’s no sugar-coating it. The website has been too slow. People have been getting stuck during the application process. And I think it’s fair to say that nobody is more frustrated by that than I am.”
You know who was probably more frustrated than the President? Single moms using public computers during their lunch hour who were trying to sign up for health insurance and couldn’t when they finally had the time to do so. I know people who will give up their access to whole websites just because they forgot their password information; how many people do you think ended up not signing up for health insurance because the website’s architecture wasn’t built to scale well? We will never know the answer to that because much like the employers who use algorithms to cut people out of the application process, developers usually only see the people who made it through successfully. Unless you have monitoring software or very vocal customers, you aren’t going to get a lot of valuable feedback on why people quit halfway through.
The Obamacare website got a lot of publicity because it involved a large government initiative, but there are other websites that are equally necessary that go down without notice or have unexplained latency issues.
- The website where my boyfriend pays off his student loans was down for at least 24 hours last month with no information as to why or when it would be up and no news coverage – there’s nothing scarier than your bank’s website just suddenly being offline
- I’ve personally been waiting for a money transfer to go through for almost a week to my Health Savings Account so I can re-order my prescriptions without trying to find time in my schedule to call my insurance and explain why they have to put the charge on two different cards
- The VA’s website allowing vets to obtain identification cards crashed earlier this year likely because they hadn’t properly load-tested their site
- And others that never got news coverage but definitely impacted people’s lives
We get it, websites go down, what are you going to do about it?
Besides just being frustrated, we (developers) need to hold ourselves and our websites to higher standards. Engage in not just appropriate testing practices, but go overboard in how thorough we are in developing user profiles.
TL;DR
Weapons of Math Destruction by Cathy O’Neil is a great quick and simple read. You don’t have to have a PhD in mathematics to appreciate her illustrations and writing style, but the concepts are far-reaching in our modern world. You can buy it here