Can bad data be good data? Reflections upon the Consumer Data Research Council Partner Forum

Image from https://www.cdrc.ac.uk/wp-content/uploads/2015/04/sustainability121714.jpg
Image from https://www.cdrc.ac.uk/wp-content/uploads/2015/04/sustainability121714.jpg

by Ed Dargan*

The Consumer Data Research Council (CDRC) (established by the ESRC) held the CDRC Data Partner Forum on the 6th May at the Saïd Business School, University of Oxford. The key aim of the CDRC is to help organisations maximise the potential of innovation by opening up their data to trusted researchers so that they can provide solutions that drive economic growth and improve our society. During the day, the presentations were based around three themes of missing data, data sources and research design.

For the retail demand modellers, the inclusion of seasonal demand, especially for seaside locations, being able to account for natural barriers and include travel times based upon real journey times were seen as important. It was useful to see how different data values were being clustered to form classifications, as this is something that needs to be done with the footfall data available to IPM, in the big data project we are just about to start with Springboard.

The importance of data representation was an important theme. Missing data, both spatially and temporally was identified as a challenge and a number of techniques were identified to ‘fill-in’ missing data. A recurring theme was the problem of using time constrained census data when analysing concurrent data that is updated more frequently. Also identified was the accuracy problem of end-user supplied outcome codes, in this case failed delivery reasons.

With any spatial and temporal data, there is the challenge of providing a digestible visual display. With so much data available, this was acknowledged as a challenge that most of the presenters using geographical mappings faced.

As a data source, supermarket loyalty cards were discussed. Interestingly, it was found that loyalty card usage was least likely to occur for small and frequent purchases, no matter what type of store was visited or the socio-demographic classification of the customer. The map of users of a store showed a more dispersed geographical spread around the UK than expected. This highlighted the problem of customers failing to update their home address details when moving home and the subsequent difficulties in interpreting loyalty card spatial data.

However, when problems in the data were identified, this fed into the recurring observation that so called bad data, that is data identified statistically to be problematic, should not always be removed or cleansed using missing data techniques. Alternatively, this so called bad data could be the most interesting data of all for a researcher and/or commercial organisation. For example, people who don’t update their loyalty card details could lead to some very useful insights into such customers. Perhaps they are a very profitable segment?

Useful resources identified during the presentations included: http://maps.cdrc.ac.ukwhich includes views of geodemographic, retail and general metrics for the larger towns and cities. Various views are provided, one that seemed a useful barometer of high street health was the retail view which for some towns (presumably only a few have the data available) provides changes to retailer types and vacancy rates over a set period of time.

Overall, it was a very good day. The presentations were very interesting and there was also the opportunity to meet and mix with other academics and business representatives.

Below is a list of the sessions and presentations:

Session 1: Missing Data and Missing People

Thomas Waddington: Modelling the temporal variation in supermarket revenue estimates

Eusebio Odiari – Infilling missing values in consumer Big Data

• Michail Pavlis – The geography of non-delivery

Emily Sheard – Enumerating the ambient population in the context of crime

Guy Lansley, Chrysanthi Kollia – The spatio-temporal geodemographics of youth

Session 2: Novel Data Sources and their Geographic Integration

Nik Lomax and Martin Clarke – Home owner mobility: assessing distance and geodemographic consistency using consumer data

• Hai Nguyen, Oliver O’Brien – naming conventions and ethnicity

Guy Lansley, Wen Li – Areas and activities: integrating consumer registers

Alyson Lloyd, James Cheshire, Roberto Murcio – How representative are high street retailer data?

Anastasia Ushakova – Temporal patterns of energy consumption and vulnerable consumers

Tim Rains – Data linkage of store loyalty cards

Session 3: Big Data and Research Design

• Alex Singleton, Bala Soundararaj: Dynamic high streets – SmartStreetSensor

Mark Birkin – Spatial microsimulation, big data and policy analysis: an example from the UK travel market consumer data

Phani Chintakayala – Do green attitudes and demographics drive sustainable product consumption?


*Ed Dargan is a PhD Student at the Institute of Place Management, Manchester Metropolitan University.

This article was first published on Prof Cathy Parker’s blog

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.