Generating datasets with the same summary stats but very different graphs

I saw this on Boing Boing thanks to Cory, who found it via jwz. At Cory’s book talk last night in DC, he made an observation about morality and algorithms, namely that data is the more salient part of the questions around algorithmic transparency and fairness. This research is revealing in terms of how far data can be stretched and pulled while maintaining some high level coherence.

I may try to tease out an essay on big data, algorithms, and morality on the question of whether we  can formulate a framework similar to Big O and space/time complexity, that can help us reason about these issues. Are there ways to speak to the variance and bias in data that may produce telling differences with the same algorithm?

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy.

Source: Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing | Autodesk Research

2 Replies to “Generating datasets with the same summary stats but very different graphs”

Leave a Reply

Your email address will not be published. Required fields are marked *