Latanya Sweeney: When anonymized data is anything but anonymous

[APPLAUSE] >> Thank you. Wow, it’s really great to be here. I had a lot of fun, first of all, it’s a fantastic conference And I’ve never been to, I’ve been to very few conferences that had so many women, over all the years I’ve been here. And I, the mark, the hallmark of how many women are at this conference, was when there was time to take the break. They have four breast rooms on the first floor, three of them are women and you still had to wait >> [LAUGH] >> But you should give yourselves a hand for being here, it’s fantastic to see you >> [APPLAUSE] >> We also wanna thank the organizers for this conference It was pretty amazing as each speaker came up. They made your mind think about, this is a new technique, that’s a new idea. It really forced you to think about ways that you could expand the work that you’re doing. And that, and to accomplish that is, is quite difficult, given the space of activities that you guys represent. So let’s also thank the organizers >> [APPLAUSE] >> So up here somewhere, they tell me there’s a clicker. But I have failed to know where it is. And it’s not there, unless this is it >> That’s it >> See it looks like that, sorry about that [LAUGH]. So this is what a camera look like, at the time America jurisprudence was trying to figure out, what were gonna be the rules for photographing in people in public. And they decided that if you wanted to photograph people in public, you needed their, you did not need their permission That is we could take a photograph of everyone here, and we don’t need permission This is what a telephone looked like, at the time American jurisprudence was trying to figure out what were gonna be the rules for recording someone’s phone conversation. And they decided that you did need a permission to actually record people’s conversation. So for your photo, no permission, for your conversation, we need permission. Does anybody know what this is? >> [LAUGH] >> You laugh. Every year, as I ask, as I ask students, the number of students who know what this is gets fewer >> [LAUGH] >> This is the Sony camcorder And in the 1980s, it was the first mass market, product That combined both the recording of video and sound, and it had no mute button That minute when it hit the American market, it was automatically gonna put people at odds, with the rules that we had decided we lived by And sure enough in no time at all, there was the story of a mother who sort of equipped one of those camcorders to her child and recorded abuses from a bus driver on a bus She goes down to the police station, she, she records these atrocities happening to her child, and they arrest her for illegal wiretapping. There was a story of a, protest happening in Boston. And one of the by standers pulls out one of the camcorders, and begins recording the police arresting one of the protesters. The police stops arresting the protester, turns around and arrest the young man who had the camcorder, and he face seven years in prison for illegal wire tapping Now we fast forward to today Today, there are laws and there we had blah, blah, blah, [LAUGH]. Let’s try that again. Today most school buses in Pennsylvania under a state law, actually have cameras on them And today, most of us know that we can thanks to the ACLU, we can actually record, and, the public servants in the process of their service in public. So we’ve seen a lot of police arrest that, police arrests for example. That, that, those examples show us how technology design, how powerful it is. How easy it, how a simple arbitrary decision, the lack of a mute button forced us to change the laws that we live by This is a sleep number bed It’s an air mattress system And the new, one if its new innovation is this thing called Sleep IQ Which is basically a bank of sensors, that lay across the top of the bed. And it monitors how you sleep, how you move, your breathing rate, heart rate and so forth. And the data leaves your home goes through the Internet to their server, and in the morning you can get in online and see how well you slept. So this is the Apple watch, I’ve got one too, and it too is trying to tell me that I need to exercise more and how well I slept as well. But the difference in this design is the data’s stored locally on my cell phone Those are two different design choices. And I use these examples to point out that we live in a technocarcy. That is

that we live in a world in which technology design, dictates the rules we live by We don’t know these people, we didn’t vote for them in office, there was no debate about their design. But yet, the rules that they determined by the design decisions they make and many of them somewhat arbitrary, end up dictating how we will live our lives. In my own life, I begun this work as a computer scientist. I was a graduate student at MIT, compelled to build a thinking machine It was a lifelong dream I’d always had. And I was doing quite well I had, had sort of gotten my first major project done. It was a system that would learn the sounds of a system the way a child would, and sort of promise to totally revolutionize the way we do speech recognition. And one day a nephephist came by, and I heard her say, computers are evil Me who was passionately in love with technology, I had to stop because I was clearly going to have to correct her thinking. But she foretold the situation in which really is the world we have today. And the year back then was at the mid 1990s And she described the world in which information was flowing from individuals freely, and that the consequences of that was that many social contracts and many new harms were gonna be possible And so I told her, but yeah, but look at all the benefits of sharing all this data. You know, we may have new medical discoveries and so forth. And so she pointed to a particular data set that had just been released. It was medical information on state employees, their families and retirees And the Venn diagram you see in the upper left corner, is an example of the kind of data that was given away It didn’t have name or address, but it did include diagnosis code, procedure codes and basic demographics. So five digit ZIP code, month day, and year of birth, and gender And so I said to her, well, why do you think that’s anonymous? And I, I said look, there are 365 days in a year. Let’s say people live 100 years and two genders, that’s over 70,000 combinations per five digit ZIP code. But I happen to know that, at that time there were only about 20,000 people living in a particular five digit ZIP code. And so I wanted to test if this is right, so William Weld who is the governor of Massachusetts at the time had collapsed And there wasn’t much information about him, in the public media But on the other hand, his information was in this data set He lived in Cambridge, Massachusetts. And his demographics were well known. So for $20, I went and I purchased the Cambridge voter list, and it came on two floppy diskettes. [LAUGH] >> And I was able to show that only six people. In that voter data had his date of birth. Only three of them were men, and he was the only one in his five digit zip code That meant that that combination was unique for him, and therefore was unique in the data. What was powerful about this very simple experiment is that one day, I’m a graduate student at MIT, and the next day, I’m testifying before Congress Because it wasn’t just that data set that was shared that way. That’s how it was done around the world on all data sets. The idea of anonymous data was simply to take off the name and address and leave these other pieces of data And we then we did a model based on 1990 data that showed that most people in the United States are unique by date of birth in their own zip code. That impact, the ability to have a simple experiment, and, and have dramatic impact was huge, and something that stayed with me forever It’s been really guided most, most of my career. Because in fact, my work is, that simple experiment is quoted in the preamble of HIPPA and quoted in the rewrite of privacy laws around the world. As noted, it was just a very simple experiment, what about today? So we’ve had, we’ve seen the new thing for technology that’s new is AI. And what was AI becomes new technology, and that everything from location tracking to predicting the price of a ride to self-driving cars To even something like fraud prevention or determining credit lines, to face recognition and so forth. This is, the Echo by Amazon, and you put it in your house You can ask questions, it looks up answers somewhat on the Internet, in particular, through particular search engines. But the consequences that I talked about in my early years as a graduate student still remain. So for example, if you ask Alexa, who is an ape, Alexa will tell you, John McCain >> [LAUGH] >> Now, clearly,

that’s not desired >> [LAUGH] >> So we live in a technocracy and unforeseen consequences of this technology continue to be with us today. Let me give you another example, when I came to Harvard, I was having an interview by a reporter, Adam Tanner, who’s a well-known reporter And I wanted to show him a paper that I had done, and so I went to Google and I Googled my name, and yes, Google is a verb And so I type in my name in the Google search bar and up popped this ad, saying that, implying that I had an arrest record. And so I tell Adam, there’s the paper that we’re looking for And he says forget that paper, tell me about when you were arrested >> [LAUGH] >> And I tell Adam, well, I wasn’t arrested, right? And he says, then why does Google say you were? And I, you know, I told him, well, Google’s not saying I am, it’s an ad He says, well they can’t just be saying you’re arrested if, if they don’t have an arrest record on you. We go back and forth and back and forth, and so eventually I spend, I spend money to show him that, in fact, no one named Latanya Sweeney for, which, I think I’m the only one in the world. But no one with that name has an arrest record even by that company. And so we started searching around some more, and we found these other kinds of arrest ads for people whose first name were Latanya. So Adam, a white Italian-American, jumps to the conclusion, he says, that’s because you have one of those black sounding first names. And I said, what are you talking about, black sounding first names? So now, I put my experimental hat on, and I’m gonna show him, again, how he’s wrong So I go to Google Images and I type in Latanya [LAUGH] And then I type in Tanya, and then I realized, there really are these first names given more often to black babies than to white babies. And so eventually I end up doing studies that basically, using a VPN network, we collected 140,000 ad deliveries when searched the names of, of, of actual Americans And we were able to show that even in places where other people had no arrest record under that name, the company had one, they still implied arrest records. And places where you would see neutral ads that didn’t imply arrest records, they would have records of someone That’s not necessarily the Kristen Lynch, I just wanna point that out I don’t know if she’s here, but I’m just saying [LAUGH] But someone with their name actually did have one, even though the ad itself is neutral. What we found from the 140,000 ads was, we found that if you had a name given more often to black babies than white babies, that you were 80% more likely to get an arrest ad than the other way around And this turns out to be a violation of the Civil Rights Act. So discrimination in the United States is not illegal. So as you can see, my grey hair, I’m looking forward to senior citizen discounts I see some of you are quite young, I know you’re already taking advantage of student discounts. Discrimination is not illegal, but what is illegal are certain people in certain situations So one of those protected groups are blacks, and one of those situations is employment. So what happens when you apply for a job? Somebody goes online to see what they can find out about you, that is, they might in fact Google your name. And if the ads are popping up implying that you have an arrest record, then in fact, you’re at a disadvantage It’s not about the intent or whether it was intended, it’s just the fact alone is enough. And the 80/20 split happens to be the same split that the Department of Justice issues to open up a civil rights case. So again, a simple experiment ends up being the first example of, how do we think about our existing laws in a technological framework? And so since then, of course, there’s been lots of work on racial discrimination, algorithmic bias, and so forth, based on this study One of the things that’s interesting was, the company maintained that it had put down ads on every, the name of all adult Americans. And that they had put down, recommended the same, search strings, the same responses, or the same ads for everyone else. And that they claimed that what was really happening was the more society would click on it, the more the Google algorithm would reward that one And so that, in some sense, this was a bias of society, where the more often, the, that the arrest ads were clicked more often for black names. And the neutral ads were clicked more often than white names. However, later on, I become the chief technology officer at the FTC, I learned they got fined So I don’t know about that, that that was really true, but it was an interesting theory

But back to our talk, so we live in a technocracy, and look, clearly, the Civil Rights Act is up for grabs. This is another study while I was at the FTC, this is the Pittsburgh Courier It was one of the most popular and widely circulated black newspaper in its heyday At its peak, it had a population of about 200,000 If you wanted to place an ad in the Pittsburgh Courier, there were, they had a group of people who had to review that ad to see if it was appropriate for their audience, so this is an example of an ad for Neiman’s This is what the Pittsburgh Courier looks like today, it’s a website. And the ads are delivered automatically by, an ad network It’s not, it’s not reviewed by anyone who works there So whether or not it’s appropriate is not easy for them to necessarily to determine. So we were interested in what is that kind of ad experience for different groups when I was at the FTC. So we looked at this particular website, this is Omega Psi Phi which is a popular black fraternity and they were having their 100th year anniversary. And we were interested in what kinds of ads got delivered There were ads for getting a graduate degree There were ads for travel, all of which you would expect And then there are these arrest [LAUGH] ads back again They may not be on Google search but they are still around. But we also saw credit card ads, and one of the questions we had was, what is the credit card ad experience on these kinds of websites? And if you were to go and look at surveys of, of, of, of, of credit cards, you would get a rating of those that are highly praised and those that are considered kind of the or certainly, harshly criticized, one could argue kind of the worst credit cards. And what we saw were only ads for the worst possible credit cards showed up on the Omega Psi Phi website Why? So we looked at the most popular, most praised card, which is American Express Blue Shout out for all of you who’ve got that card, good for you. At that particular time, as you can see on the right, they were doing an ad campaign particularly focused at education. So that even begged the question even more, why is Omega Psi Phi not getting these ads? And we saw these other kinds of things. I use this an example to say the Credit Reporting Act up for grabs What is the fairness? You know, it’s actually illegal in a, in, to disadvantage a group by only providing them financial instruments of one kind, for example. Let’s take a couple more examples You visit a doctor, you said who gets your data? If we were to ask this question back in 1998, you’d probably kinda think around and say, well, the pharmacy, the insurance company, maybe my employer knows something about my health data So we did some homework and we surveyed and we were able to, to document thousands of data sharing arrangements of a typical person, Alice, who’s, who’s, where are all the places her medical record would go If you visit the website, you can click on one of the circles and it will tell you the documentation we have for each of those sources. One of the things that’s interesting other than the sheer volume of places that your health record could go, is that only about half of those records are actually covered by HIPAA The other half, there are no rules that govern them at all and what you see in the middle is this thing called discharge data. How many of you know what hospital discharge data is? Well, a couple of people. Let’s see, let’s see if you’re in it How many people have ever gone to a hospital or to a physician’s office in the United States? Okay, so a copy of that information is in hospital discharge data These are statewide collections, and the dash line means that when they share it or give it away, it’s de-identified. That is, it doesn’t have your name or address on it. So we wanted to know how, how good is that? So 33 of the states actually sell or share All 50 states collect it, 33 states sell it or give it away. But at that time, only three states did so in a way that was as strong as HIPAA. So now this is an interesting question Are the other 30 states, do they know something, is HIPAA too stringent, is it messing up data more than it should, and therefore we should lower the federal standard? Or is it that these 30 states are putting the data more at risk? So to answer this, for 50 bucks, we went and

we bought this data from the state of Washington It included millions of hospital visits. And it would, and they reported 99.99% compliance, and this is it had about 300 fields, this is an example of some of the fields It includes diagnosis codes, procedure codes, a breakdown of charges, and some demographics. And what we wanted to know was could someone who knew something about you be able to find you in the data? It might be your credit card company, it might be your employer It might just your neighbors being nosy or family or friends, could they actually figure out which of these millions of records are yours? So Harvard had a database of old newspaper archives that included places in Washington. And it had about 82 articles of these kinds of blotter stories reporting people who had to go to a hospital for what whatever reason, whether it’s a traffic accident or shooting or what have you. And we were interested in saying how unique is that blotter story, to which of these millions of record does it match exactly? Match one and only one record, so we began popping them in. And without using any statistics, this is just a one to one mass, only one and only one record showed up in 43% of the samples. So Washington state did respond You can imagine getting that phone call from me. [LAUGH] >> [LAUGH] >> But they did respond in a good way and that is they changed their laws. So within a year of this study, they changed the law so that you can still for $50 get the hospital discharge data from the state of Florida, I mean, state of Washington. But in fact, when you do so it, you can’t do this experiment again It won’t, it won’t work If you need the more detailed data, they have a more rigorous application process that you have to go through to get it The bad news, though, is after the experiment, it really only changed, of the 30 states that had these kinds of practices, only Washington state and California changed their rules. The other states, unlike what had happened with HIPAA where one experiment toppled around the world, we see a lot more rigidity And some of that rigidity is because there’s a lot of money in data and so forth And so it’s a lot harder to demonstrate, these, for one demonstration of a problem to replicate. So Jesu who’s here, Jesu, you’re somewhere, who works with me so she began tackling them state-by-state So she just did Maine and Vermont So we’re gonna see how many of the 30 she has to do before they all sort of adopt better practices. So we do thank her for that, cuz it raises the privacy bar for all of us >> [APPLAUSE] >> Where are you Jesu? All right, maybe she stepped out. Anyway, she’s over there [LAUGH] Okay, so health privacy though is also up for grabs This is an example from Alvaro Bedoya. He’s at Georgetown and he points out the idea of what happens if you have a a program, that’s actually gonna recommend people for employment So the first time around, it’s sort of give you all of these potential applicants and the employer selects some and rejects others. It’s a learning algorithm, so over time it’s only gonna bring you young people cuz that was your bias in your selection. And I use that example to say yet another of our rules, equal employment, is up for grabs. And we recently published a paper when we follow the 2016 election, that pointed out that 36 websites A state websites allowed hacking through identity theft, to impact voters’ ability to cast votes from their website So even elections are up for grabs. In fact, I would use those examples to say every value. Every democratic value is up for grabs by what technology allows or doesn’t allow. And what’s scary about is there’s no, it’s sort of like we’re in car, we’re going for this ride, but nobody is driving. It’s like people are taking turns randomly So what do we do about it? So when I was at the FTC, fraud is one of the things the FTC looks at. And we did yet another experiment We came up with this idea of a calculation called exclusivity index of a domain And it basically says, what percentage of people visited domain more than any other percentage of like groups? The reason that, that exclusivity is useful, it’s because when you look at households who are visiting different domains on the internet

The first ten are all kind of the same,Twitter, Facebook, Google But as you get down, people go to particular places, particular kind of communities for more intimate conversation, more trust And fraud at the FTC is about the kind of trust where you open up your savings account, and you’re willing to give the person all your money It’s not really the spam stuff, it’s the when you’re having a conversation among like-minded people, you’re much more likely to have trust. And so, the question is could we find these packets? And so, we created this index, and the closer the value is to one like in this example, the percentage of routine of these errors the more unique it is to you So this is an example with households with children, people who have children go to website There are people who don’t have children never see, and that kinda makes sense. And we can do this also by race The red line is two standard deviations above, and we see also particular domains that also makes sense by race. But more importantly, we were able to use those domains to track fraud. That is to actually go to find a website that’s more likely to have fraud, and be able to find evidence of it there. I should have walked through that quickly, because I wanna get to the end and I’m running out of time But I wanna use that example to say how it is we’re able to do experiments, sort of harnessed technology in a way for the good of the public interest. That is we already have rules and we already have helpers. The question is, how do we help the helpers? How do we help the advocates, the regulators, the journalists do a better job? And as I looked back over the examples that I just gave you, you can see that really has been the hallmark of the work that I described from myself. That is somewhat simple experiments. I like to think I’m really smart, but the truth is there’s really simple experiments But they have profound impact because they empower someone else to be able to do their job better, or to be able to take that message. So when I came back to Harvard after being in the FTC, I decided I didn’t want it to stop there. That is, these experiments are attainable by undergraduate students. So I taught a class called Data Science to Save the World And I told the students at the end of this class, you’re gonna do a project And if your project is any good I’ll take you down to DC, and give you an audience of regulator. And so, I put up side about money to take down two or three students, and I took down 26 students It was pretty amazing And so, we set it up in a poster session style and it was totally electric It was supposed to go for two hours, it went for four hours The students, the typical regulator is a white guy, a middle-aged white guy, who does who’s who has the responsibility of regulating the technology. But who really doesn’t understand the technology or know about it. Meanwhile, you have these young students who are early adopters of technology, who are right on the front line, who know the technology intimately. And so, the idea and the exchange was very, very powerful The students really did feel as though they had impact and the regulators learned critical information We decided when we left there we weren’t gonna let it stop there. We should find a way to memorialize this to keep repeating this process So we had the idea of starting a new journal called the Journal of Technology Science. It was actually called the Harvard Journal of Technology Science. And then Harvard says, you can’t use our name unless you go get a board of people who say the papers are good enough to use Harvard’s name It doesn’t matter that you’re a Harvard professor, you have to go get another group of people. Well, I did go to get this other group of people. I asked 50 researchers from around the world, who sort of work in this area All 48 of them said yes, and two said, I would do it, but I’m doing something else, I’m a dean, or whatever. And with that kind of response we realize it was a much bigger thing than a Harvard thing. So we dropped the whole Harvard name, [LAUGH] >> [LAUGH] >> And so, the Journal of Technology Science has been publishing these kinds of papers from researchers, and students, and scholars from around the world, these kinds of things that have on forcing consequences But the first papers, and I’ll just finish in just a second, the first papers are those that came from those 26 students. As you can see all of the papers online. But this is, I just wanna give you a sample of a few. This was Daniel Rothschild fraud at the FTC is really was a chase him down kind of thing, enough people would complain that they have been defrauded, and then an investigator would come to see what was wrong But by then, the fraudster is long gone, and so Daniel just simply built a really simple system that would identify through tweets where fraud is actually happening

Where website is it currently live on And that actually has transformed the way the FTC does its fraud They use a software and they’ve build on its since the students had recently, another group of students had recently taking the SAT, and they’d use Princeton Review, as a service, an online service to tutor them in the SAT And they remembered that the, the Princeton Review would only give them a price once you gave them a ZIP code. So they mined out the prices for all 33,000 ZIP codes in the United States And they were able to show with scale sort of this, people who were in New England pay a lot more than people in San Francisco for the same service. But if you go into, but it’s not everyone in New England paying the higher price. It’s primarily communities that have an Asian population, or said the other way, an Asian family is more, almost twice as likely to pay the higher price They ended up on the Today Show and so forth. And similarly, we showed price discrimination, or a group of students showed price discrimination on Airbnb Airbnb responded efficient by here from Airbnb you get thumbs up They responded, cuz look a lot of this is basically shaming them into something? And Airbnb responded in a great way, they now do price recommendations for example. This will be the last one I will show you which is Facebook. So Facebook had a feature that most of the in industry media called the bug. That and Facebook Messenger will report your location as you’re, as you’re sending out messages. And so, Aran built a little plug in that you can load in your browser, and then you can track people as they’re sending out Facebook messages And within nine days Facebook should fix this bug, I mean feature And, and, and corrected it Something they hadn’t been able to do for many years And and so, but what made Aran’s story, I think, even more powerful was the simple fact that he had been given a summer fellowship. He arrived here in Silicon Valley and the day before he’s to show up at Facebook, they cancelled his summer fellowship. So, no, don’t feel bad about it Within the week, he had, he got, somebody else hired him with almost twice the pay, he ended up with the TED Talk, a Time Magazine article He, you know, now he works for Amazon. He’s in good shape But it does, those examples do point us to the impact that simple well done experiments could have. But this talk isn’t actually about me, and it’s not even about my students, it’s about you. All of you all empowered with the ability to do those experiments and to save the world. Thank you >> [APPLAUSE] >> So we have time for just really one great question, and I know there will be many. So who wants to ask the question? Before lunch >> But we will not leave until the question- >> That’s right, they’re staying here >> [LAUGH] >> So if you wanna do us all a favor and get to lunch >> There’s a question there at the back >> We have a few hands up >> Yeah >> Here we go >> [LAUGH] >> I actually work on the Alexa project, so thanks for- >> [LAUGH] >> So GDPR just came out and a lot of companies are scrambling to be compliant with new European regulations And from my side of the story it feels like there wasn’t a study done to understand the technology before the policy was release. And I’m just wondering what your thoughts are on how do we make sure that policy and technology go at the same pace? >> Well, policy and technology will never go at the same pace, policy is sort of a function of monster years, and technology is a function of days. So there is automatic simple mismatch. And what the EU is trying to do, is they’re trying to come up with draconian policies, draconian in the sense that they’re massive, so that there’s no wiggle room Just to give you an example, to sort of, to level set everyone If the idea of privacy laws is to cover a naked body The 2,167 privacy laws we have in the United States are like little dots that clump up on 50 spots on the, on that body Leaving almost the entire body exposed. The EU approach is to drape the entire body with a law, with a sort of like, with a cloth to give it protection. This difference is really important because it means that, and because in the United States, there are no rules, there are no laws, to prohibit anything

Then it means anything goes until it doesn’t. But right until you hit the border of the EU, at which time their laws would apply. And they’re making a bet that if you want the EU market, instead of having two versions of something, you’re gonna change the version and therefore, raise the bar for everyone. And so the different way, I often say the difference of these legal paradigms is sort of why Mark Zuckerberg could introduce Facebook in Cambridge, Massachusetts, but not Cambridge, England He’s a student and he decided the people who’s gonna create a platform where individuals, students could share this kind of information. But in Cambridge England, that kind of information was covered even then, and he would have had to go get permission. And maybe he would have just not taken the time, he would have just done something else with it instead. So, this relationship over our lack of privacy laws, our lack of coverage of privacy laws has given our technology companies an advantage of a kind. But our technology companies now run a series of disruptions unless they get better at predicting our control in that >> Thanks very much, Letania This has been a [CROSSTALK] >> [APPLAUSE] >> So glad you’re here >> Thank you