21 July 2011

Privacy and your online learner identity

This post is prompted by an article I happened to read in the Chronicle of Higher Education of May 15th, 2011 entitled Why privacy matters even if you have nothing to hide, written by Daniel Solove, a professor of law at George Washington University. It is a prequel to a book called Nothing to Hide. My interest in it stems from an article Adriana Berlanga and I wrote about online learner identities. The question we address there is how best to balance the need to know as much as you can about a lifelong learner to be able to offer him or her the best possible learning arrangements (in an online learning environment) with the justified worry that yielding all those data may easily invade that person's privacy.

the identity question, finger print with that text
Fundamental to our argument is the observation, made by many, that the online realm or cyberspace becomes ever more a place where we lead our social lives, also our live as a (lifelong) learner and worker. Consequently, we need to build online identities, which we dubbed a online learner identity in so far as that identity should allow us to 'live' in networked environments geared for learning and professional development (Learning Networks, if you like). However, since these identities are fragmented across the various social networking sites out there (Facebook, Google, Ning, LinkedIn, ...) it is difficult for an individual user to build, let alone maintain, such an identity. One needs to repeatedly update various sites and, even harder, one needs to imagine what the big picture of oneself is that emerges this way. So technical solutions may be attempted that allow data to be automatically exchanged between those sites. Perhaps a kind of dashboard that aggregates data from various sources is a good idea. (This assumes the hosting parties would allow that, which does not go without saying as sharing with such a dashboard site lowers traffic and thus is not in their interest.) Also, a learning perspective is needed to dictate what data the dashboard should collect. Past education, for instance, seems more important than the kinds of movies one likes.

However, there is another issue that is inextricably linked to these technical and learning-theoretical issue. It is whether we as users of such a dashboard do indeed want to aggregate our existing fragmented identities. It does not go without saying that we do. Facebook, for instance, once was a fun site only but increasingly has earned itself a bad reputation for revealing ever more data about its users without asking them explicitly beforehand. And every service Google offers us for free betrays Google's hunger for our (profiling) data. This should not come as a surprise, of course. Somebody should foot the bill for the services provided to us. It turns out that we ourselves do so by giving up our data for free, allowing the Facebooks and Googles of this world to make money through targeted advertising and selling of profiling data to third parties. But we need at least ask the question if this is the way we want it, for Facebook and Google but also for dashboard-like services that ostensibly only have the best intentions. At face value, this question is about privacy issues. Solove's paper shines an illuminating light on helping us understand it that way.

His point of departure is the often voiced argument that if you have nothing to hide, it is ok for the government to know anything there is to know about you. The counterargument is that this constitutes an invasion of your privacy. Parenthetically, in the discussion that follows the article someone rightly points out that privacy is a Human Right (number 12) granted to you by birth and that invasions thereof are a privilege that needs to be granted through proper argument, even by governments. However, to make the counterargument stick we need to understand what privacy is. Solove attempts to delineate the notion by using two metaphors, a quite ingenious move in my view. Some aspects of privacy are addressed by George Orwell in his Nineteen Eighty-Four novel, by describing the omnipresent state which watches and stores in huge databases our every step. This is the surveillance aspect of privacy. The other metaphor is discussed by Franz Kafka in his Der Prozess (The Trial). This is about someone who has to stand trial but has no idea what he is accused of nor is he allowed to have access to the accusations and the reasoning behind it. This aspect of privacy Solove calls information processing, it addresses the government as a bureaucracy, which lacks transparency and refuses to be accountable for what it does with those data. He then argues: the problems [with privacy invasions] are not just Orwellian but Kafkaesque. Government information-gathering programs are problematic even if no information that people want to hide is uncovered. In The Trial, the problem is not inhibited behavior but rather a suffocating powerlessness and vulnerability created by the court system's use of personal data and its denial to the protagonist of any knowledge of or participation in the process. The harms are bureaucratic ones—indifference, error, abuse, frustration, and lack of transparency and accountability.

So, one should not so much worry about the mere storage of data, that which George Orwell denounced, but about the subsequent processing of them in opaque ways, that which worried Franz Kafka so much. To unpack the processing, data aggregation is one way of data processing, 'the fusion of small bits of seemingly innocuous data'. Aggregation may be objected to since the picture of someone that emerges after aggregation is not apparent in the constituting bits. The whole is more than the sum of its parts, sums this up nicely. Exclusion, preventing people 'from having knowledge about how information about them is being used' and barring them 'from accessing and correcting errors in that data', is another way. Exclusion goes to the heart of the Kafka objection. Job applicants whose application was turned down because they were unable to remove online pictures taken of them taken in a moment of weakness understand the harm exclusion can do full well. This problem is exacerbated when secondary use of those data is made, as the route from misuse to the data source is now even harder to trace. Distortion is a third kind of data processing, meaning that, necessarily, stored data only show part of a personality, which may lead to a distorted picture of that person. When first impressions matter, as in job interviews, distortion can do much harm.

In the case of a learner's online identity, Adriana and I argue against the fragmentation of someone's identity across the various social media sites in existence. This is a variation of the distortion argument. Even if we admit that people may have good reasons to maintain several, separate online identities (one for work, one or more for your leisure activities), what such an identity should look like should be under the identified person's control and only his or her control. After all, only that person can oversee the degree and kind of allowable distortion. Thus, the practical argument we leveled against fragmentation proves to have a privacy aspect as well. This brings us to the exclusion argument. People need to have access to the data stored about them to correct those data, extend them, prune them, etc. In our paper, we offered a practical argument for this, arguing that people should be able to build an online identity qua learner that suits their learning and professional development best. This argument too turns out to have a privacy twist to it, being that control over one's data is a matter of principle (privacy) and not only convenience.  And finally, the defragmentation that we argued for of course is a form of aggregation. However interesting the technical challenges may be to overcome defragmentation and however useful it may be from a learning perspective, doing so inevitably also impacts our privacy. That is the key value of Solove's argument.

Solove thus exposes the nothing-to-hide argument as too simplistic. Privacy is a multifaceted thing, nothing to hide only addresses the data surveillance aspect of it, not the data processing aspect. Data processing itself is complex, encompassing such things as aggregation, exclusion and distortion. Any one of these impinges on efforts to arrive at the consolidated online learner identity we argued for in our paper. Solove, in focusing on debunking the nothing-to-hide argument, does not offer any solutions on how someone's privacy may be safeguarded against the aggregation, exclusion and distortion of their data. But perhaps this cannot be discussed in general terms, perhaps it can only be understood in the concrete case of, for instance, building a consolidated digital identity for learners. If so, his refined understanding of what privacy is about should help us do so. It should help us to reap the benefits of online learning while giving due attention to the privacy challenges that come in its wake.

October 22, 2012. Note added after publication: It has come to my attention that there is an EU funded, 7th framework project that goes by the name of Trusted architecture for securely shared services (TAS3). I quote from their summary: TAS3 will develop and implement an architecture with trusted services to manage and process distributed personal information. [...] TAS3 will focus an instantiation of this architecture in the employability and e-health sector allowing users and service providers in these two sectors to manage the lifelong generated personal employability and e-health information of the individuals involved. This sounds like an architecture that should also work for online learner identities, even though TAS3 will focus on data in offline databases and we are more interested in online databases (behind social media interfaces). Second, the EIfEL team has published a blog post with the intriguing title: To create a trustworthy Internet respectful of our privacy, shouldn't we simply make our personal data public? Without going into detail, their solution is to spread your personal data over various sites, but anonymously. You as the owner keep a bundle of private keys through which you can grant access to those data in a piecemeal fashion. This way, you can allow whoever you want to access and disallow everybody else access. Quite ingenious, although I am not sure Facebook and Google would like the idea of only having uninformative bits and pieces of your personal profile data hidden behind an alias. Even so, Google just said are considering allowing aliases on their Google+ service.

10 July 2011

Educational Data Mining Conference 2011, Eindhoven

Since is was practically around the corner and I'd been wanting to acquaint myself with the latest news on educational data mining for some time already, I decided to spend three days at the Educational Data Mining conference, which was held in Eindhoven, July 6 through 8, 2011. Having sat it all out, I have to say that my feelings are mixed, saw very good stuff and some work that makes you wonder. A couple of general observations first, then some details on a few papers and posters. A confession out the outset: I am interested in informal (non-formal) kinds of learning, so I was specifically on the lookout for uses of data mining that would foster this kind of learning.

First, the EDM community is heavily dominated by people of US extraction. That inevitably brings a bias in that 'educational' is surreptitiously being defined as 'in accordance with the US educational system'. This is not necessarily bad, but it is something to keep in mind. Second, and perhaps as a consequence of this, data mining seems almost congruent with intelligent tutoring systems. Even though the title of a paper may suggest something different, ITSs are never far away. Third, and most importantly in my view, the conference's take on what data there are to mine is a very narrow one. This is connected to their narrow view of what constitutes education: school-based, teacher-led formal learning, with no concept of other forms of learning. This may simply be a choice, which is already narrows down the field. However, it gets worse as within the confines of formal learning, their sole educational model is that of the teacher as the sage on the stage, who may be assisted by ITSs to relieve them from some of the drudgery of repeatedly having to answer the same questions. I am exaggerating, true, but not all that much. My main problem with this is that to the extent that EDM is successful, it acts as a conserving force, reinforcing received testing methods and having little eye for educational innovation. From which indeed follows that I do not see EDM as espoused in the conference as innovation of education, at best as innovative methods to support traditional forms of learning.

That being out of the way, there were several papers and posters of interest to be seen and heard at the conference. A few observations on just three of them. First, Kelly Wauters et al. from K.U. Leuven discussed a novel means of rating proficiency in their Monitoring Learners' Proficiency: Weight Adaptation in the ELO Rating System. They used a modified version of the ELO rating system that chess players use for this and apply it to rate proficiency on learning items. If you want to sequence learning items adaptively, you not only need to know how 'difficult' the items are, but also how good someone is at particular ones. That way, you can provide learners with items that in terms of their difficulty match their proficiency. Second, and breaking away from tradition, Worsley and Blikstein from Stanford also worry about proficiency or expertise. They wonder What is an Expert? and seek an answer in the use of Learning Analytics to identify emergent markers of expertise through automated speech, sentiment and sketch analysis. Thus they look at say speech utterances and sketches to acquire an impression of someone's expertise at a particular subject. Interestingly, both novices and experts reveal little lack of confidence, the former since they are sure not to know, the latter since they are sure they do know. Both (short) papers are fun for their innovativeness, they are also useful in the context of informal learning (in, say, Learning Networks) as they provide means to characterize learners' expertise and thus means better to help them.

Third and finally, there was a nice poster by Anna Lea Dyckhoff from the computer supported learning group, informatics at RWTH Aachen, practically our neighbours at OUNL. Although still in its infancy, she is developing a learning analytics toolkit (eLAT) that allows teachers to gauge their students interaction with the content in Personal Learning Environments. I am not sure whether the use of the term teacher in connection with a PLE is entirely fortunate - after all, if PLEs are really personal they must by definition also refer to informal learning situations in which the role of teachers is not self-evident. However, such toolkits are very valuable as they provide a means to help personal learners that self-guide their learning, or so I would hope. In this same vein, R. Pedraza-Perez et al. from Cordoba, Spain offer a Java desktop tool to mine Moodle log files, and GarcĂ­a-Saiz et al. from Cantabria have built an E-Learning Webminer (EIWM) that, by discovering student's profiles, is intended to help them navigate and work in distance taught courses.