15 September 2010

Limited data retention through selective data degradation

The social web can only thrive if its participants are willing to share personal data, data about themselves, with each other. So, you have an account with some social network (Twitter, del.ico.us, LinkedIn, etc) in order to allow others to read your Tweets, peruse your presentations or, quite generally, find out who you are and what you do. With the advent of the semantic web, of systems that can make inferences on the basis of the data that are fed to them, this is all the more true. Individual users profit from the services that the web offers to them, often for free; the service providers profit, mainly from the advertisements that accompany their services. Although there are other business models, this is the prevailing one, it seems.

So far so good then. But what if service providers sell the data they have acquired in the course of their business to other providers; or worse even, what if these data end up in the hand of others because of clumsiness (a stolen USB stick, a lost laptop) or criminal intent (hacking servers, bribing personel)? Admittedly, you may decide to shut down your Facebook account or give up Twittering, but this freedom of choice is absent for many services. What about your loyalty card with your favourite grocery store, which not only registers your purchasing behaviour but also gives you access to sizeable discounts; or your public transportation travel pass, a system recently introduced in The Netherlands, which allows you to travel throughout the country with one card but registers routes and start and end times in its database; or a road use system installed in your car which helps prevent traffic congestions but does so by logging your car's GPS track data in its central database? In each of these examples - and many more can easily be given - data about a person are logged into a database and it is not transparent to the data providing individual what the associated privacy risks are.

In 1981 the states that jointly form the European Council signed a ‘Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data’. Among other things, it stipulates that no more data may be stored than needed for a particular, identified purpose, and that those data may not be kept for longer than strictly needed. The lack of transparency compels the individual to simply trust the database manager to abide by these rules. Experience teaches us that often this trust is misguided, even if we ignore cases of intentional theft and accidental loss of data. The issue of whom to trust with what data is a complex one. It touches upon the closely related questions of what data to collect and whom to allow to access them. The other day, I read a PhD thesis that sheds an interesting light on the first one of these questions (Harold J.W. van Heerde (2010) Privacy-aware data management by means of data degradation; making private data less sensitive over time. Universiteit Twente).

Ignoring for now the possibility to grant differential access rights, someone can decide to make her particular personal data available or decide not to do so. A LinkedIn profile may contain a photo but need not. Something similar goes for the data that are collected through someone's public transport travel pass. If one uses the pass, route and time data will be collected and stored. Could a user still decide to remove or replace her photo, the storage of travel data is fully beyond her control. The point to note here is that the decisions are all-or-none decisions. Someone’s photo is there or it isn’t, travel data or logged or aren’t. There is no middle ground. Van Heerde shows that a sensible middle ground does exist. He introduces a limited retention principle, meaning that data degrade over time. So, the public transportation database may remain fully intact for a month to allow sending out bills. The data may subsequently be degraded to the level of the route and day of the week someone has travelled to allow sending out special offers. This level of detail is maintained, say, for a year. After one year only the cumulative frequency of use or routes per day of the week, decoupled from the individual, are still available. This still allows the statistical analysis of travel data, for planning purposes for instance. Data degradation allows for a more subtle marriage of the interests of the individual (new, better, cheaper services) with those of the service provider (a more efficient and effective business). Of course, there are all sorts of theoretical and practical problems to be dealt with. Van Heerde discusses many of them, he also suggests how to solve them. For me, the importance of his contribution is his description of how one may come one step closer to heeding the European Council's admonition only to store just enough data for just long enough. This is in the interest of both web service users (aren't we all) and web service providers.