The GDPR forces us to look at our data, categorise it as personal, personally identifiable and everything else (keeping in mind that what was once impersonal can become personally identifiable in association), but often we don't question why we collect and store this information. It is already part of existing Data Protection legislation that only data that is necessary should be collected and then kept only for as long as it is necessary, Rarely do we consider whether data or metadata is useful in itself once we add it into our model and data stores. Often we start collecting it for some future use which is neither clear, decided or planned; and once we have it we keep it because its data and must be valuable.
I'm suggesting that we not collect common personal categorisation data unless there is an overriding need and that for the overwhelming cases there is no such need. This thought was provoked most recently by this Tweet.
My initial response was:
Because it is straightforward to not engage in writing systems and applications that can be used to aid prejudice and foster division but its very hard to avoid modelling and designing into data stores categorisations that can be used in ways which are prejudicial to the owners of that personal data.
But what about needing to know who is affected by this or that prejudice and persecution so we can protect them and improve their lot? Surely we need to collect information to identify those parts of the population that need help? But do you need to count in order to know what is the right way to treat everyone? Is counting and identifying itself the wrong?
For specific needs does someone really need to identify themselves as disabled, or are they really an individual with a requirement?
Personally, I don't categorise myself in any ethnic, religious category on any form and would avoid gender and age if I could. As a modeller, architect, would I argue against collecting this data? I would now.
I would employ all the arguments about not collecting personnel data unnecessarily. Does your application/system require gender to be relevant? Really? Age? Should the provision of public services need ethnic data? And so on.
Really ask if each one of these metadata categories is necessary, bear in mind that each of the categories will likely be from a control list plus 'other' perhaps. What purpose will be served, if its for some broad population statistical use ask how does this category give meaning ful information that actually matters in a statistic. Take Male/Female (I won't fall back on self described gender issues to begin with the traditional simple case should suffice), how does it help knowing someone ticked either box? Will they buy or be interested in a different product or service? Will they want different information, will the content be filtered?
If the answer is yes I'd ask the question, so you wouldn't sell to someone of the wrong gender? Would you only show pink bikes to girls? Carbon fibre drop handlebars to boys? I'd hope not, so how does your case differ?
Are the services, products and information that I'm interested in in any way absolutely connected with my age, gender, ethnicity, mobility? I don't think so. The actual services, products and information might very well have particular characteristics that I want to filter and search by but not necessarily because I possess or share those characteristics.
If you're building recommenders why limit or filter or weight your recommendations based upon any of these largely loose non-authoritative categories? Isn't the behaviour and content far more important. Quite a while ago now, over ten years, we were involved in building a streaming personal video platform and the Advertising people wanted us to include something like 200+ questions on the user's likes and dislikes. That got rejected fairly quickly just in contemplating the registration funnel but it did betray the then assumptions about how to collect and slice and dice population data. Analyse the individual, get as much stuff about the individual from the individual and apply that to the content, product or service. It came straight out of the publishing industry with the cards for subscribers to punch or circle their characteristics.
Now of course we apply it the other way round, the behaviour of the individual and their peers or group, their history of success and failure or abandonment, the content/product/service chosen by that group along with many other dimensions of behaviour to new and evolving content, products and services, all that Big Data stuff. And we don't really need all that categorisation up front, in fact it could skew the results badly.
But for those organisations whose data sets were modelled and collected well before this Big Data magic and they had all this carefully researched (or not) data, has it been reevaluated? Or is it sitting there in the rest of the data but consciously or unconsciously splitting your data sets into what might be irrelevant and even misleading subsets.
I think a great many of these characteristics should not be collected and stored and they should all be reevaluated periodically. I include official forms in this, actually I especially include official forms in this.