Following up on their recent decision to “anonymize” its server logs after 18-24 months, Google’s Global Privacy Counsel Peter Fleischer has posted an explanation about why Google captures and retains users’ search queries in the first place. While a good step towards greater transparency, it’s a bit unsatisfying. Let’s address each of the three points Fleischer makes.
Improve our services: Search companies like Google are constantly trying to improve the quality of their search services. Analyzing logs data is an important tool to help our engineers refine search quality and build helpful new services. Take the example of Google Spell Checker. Google’s spell checking software automatically looks at your query and checks to see if you are using the most common version of a word’s spelling. If it calculates that you’re likely to generate more relevant search results with an alternative spelling, it will ask “Did you mean: (more common spelling)?” We can offer this service by looking at spelling corrections that people do or do not click on. Similarly, with logs, we can improve our search results: if we know that people are clicking on the #1 result we’re doing something right, and if they’re hitting next page or reformulating their query, we’re doing something wrong. The ability of a search company to continue to improve its services is essential, and represents a normal and expected use of such data.
First, the claim that capturing and processing user data is “normal and expected” is fraught with assumptions about what people’s expectations are for the data-trail they leave behind. Many users don’t know they are leaving a data-trail to be collected. Many don’t know that such a trail is being collected. Many don’t understand that their data-trails from session-to-session can be aggregated, mined, and scrutinized. Many don’t understand that their data-trails can be retained for as long as Google wants. You get the point.
Second, and more fundamental, if the first purpose of collecting user data is to improve services such as Spell Checker or to measure the relevancy of results, there is little need for personally-identifiable information to be collected as well. Simply put, to accomplish these goals, Google does not need to collect the IP address of a search or click, just the search or click itself.
Maintain security and prevent fraud and abuse: It is standard among Internet companies to retain server logs with IP addresses as one of an array of tools to protect the system from security attacks. For example, our computers can analyze logging patterns in order to identify, investigate and defend against malicious access and exploitation attempts. Data protection laws around the world require Internet companies to maintain adequate security measures to protect the personal data of their users. Immediate deletion of IP addresses from our logs would make our systems more vulnerable to security attacks, putting the personal data of our users at greater risk. Historical logs information can also be a useful tool to help us detect and prevent phishing, scripting attacks, and spam, including query click spam and ads click spam.
I’ve been researching and writing about search data retention issues for a while now, and honestly, this is the first time I’ve heard the “security and fraud prevention” argument. I suppose I can understand the usefulness of having logs of server activity in case some kind of security or fraud concern arises, but if that’s the case, the data should only accessed in such cases. It seems Google has not placed such restrictions on the use of server logs, and I fear this reasoning is meant to scare users in acquiescence more than anything else.
Comply with legal obligations to retain data: Search companies like Google are also subject to laws that sometimes conflict with data protection regulations, like data retention for law enforcement purposes. For example, Google may be subject to the EU Data Retention Directive, which was passed last year, in the wake of the Madrid and London terrorist bombings, to help law enforcement in the investigation and prosecution of “serious crime”. The Directive requires all EU Member States to pass data retention laws by 2009 with retention for periods between 6 and 24 months. Since these laws do not yet exist, and are only now being proposed and debated, it is too early to know the final retention time periods, the jurisdictional impact, and the scope of applicability. It’s therefore too early to state whether such laws would apply to particular Google services, and if so, which ones. In the U.S., the Department of Justice and others have similarly called for 24-month data retention laws.
One can’t argue against the need for Google to comply with data retention laws….other than fighting to prevent such laws from being enacted.
In sum, I applaud Google for trying to be more transparent about why it collects user data and what it does with it, but they still keep much in the dark. I renew my call for a formal Google Data Privacy Center, where users can view all the data collected, selectively edit and purge their data, see exactly how it is being used, and selectively restrict its use, view any third-parties the data is being shared with, and selectively restrict the transferring of personal data to third-parties.
UPDATE: It appears I haven’t been critical enough of Google’s 3rd point above. Commenters familiar with the EU Data Directive point out below that the law covers only “communications,” and not necessarily search queries. And Seth Finkelstein points to this Ars Technica post that reminds us that in the US, of course, data retention laws don’t even exist yet, and certainly wouldn’t be retroactive if passed. So how does the apparent non-existence of these laws actually compel Google to keep user data? Got me….