Tuesday, March 15, 2011

Deb Roy on wordling junior's life

Stumbled upon a fascinating longitudinal data-intensive, visualization-intensive series of ongoing work at Deb Roy's group at the MIT Media Lab which prompted me to share some quick thoughts. Watch it here:



The talk is brilliantly structured: starting with an emotional moment, walking through some stunning data visualizations, transitioning into a pitch for bluefinlabs (his startup) with more chutzpah, and ending with the personal yet transcendental.

Scientifically, the work itself, IMHO is meant to be treated as a glimpse into the kinds of hypothesis that can be tested, rather than a definitive statement on language acquisition patterns in children. Further, I've only watched the TED talk and 18 minutes is not enough time to point out caveats, that too obvious ones. I imagine that among the various controls, they would have in fact segmented and separately treated utterances of 'water' by adult to adult vs. by adult to child, since implicit and explicit learning presumably have different mechanisms. But I'll refrain from speculating further without digging deeper here.

Technically, managing parallel feeds of audiovisual data does not seem straightforward in the least. Further, segmenting humans from cluttered fisheye scenes or target words from natural speech, and 100 TB of it, despite 50 years of AI research, is still a pretty big deal. Besides, it's the first graph on a TED talk with error bars! Respect!

Commercially, social media is one big ticket application, but couldn't the same infrastructure be applied to support decisions in interviews or boardrooms, or evaluations in high end schools or creches? What else?

But what captured my fascination the most was the power of big data to accelerate discovery at an unprecedented rate.

On large datasets and fishing expeditions

Scientists are just beginning to appreciate the power of trawling the world for extremely large datasets and subsequently testing various hypotheses on small subsets of the data. This approach (sometimes derogatorily called a fishing expedition) is in stark contrast to classical scientific method where an apriori hypothesis dictates an experiment design and an analysis procedure. The LHC experiment is one such expedition to fish for postulated elementary particles.

To illustrate the power of the fishing expedition approach, imagine if Deb had decided apriori that he wanted to study word utterance length over time. In a conventional longitudinal experiment he might have chosen a subset of words used in natural conversation, asked several child and caregiver pairs to come into a studio for 1h/day and recorded their speech. He might then have analyzed the data, observed this U shaped phenomenon, and reported it in a journal about language acquisition, where it would have promptly gathered dust. Even potentially interested colleagues might think several times before replicating the study with a different set of words, or correlating the word utterance length with spatial context, since acquiring funding and approvals for a longitudinal study would be prohibitive. Instead, with Deb's fishing expedition dataset which might become publicly available someday, any armchair scientist with computational resources can ask their own creative follow up questions of the data with minimal entry barrier!

However, fishing expeditions have their downsides. First and most obvious perhaps is the risk involved: what if there are no fish to be found? In other words, what if a generic experiment design is not powerful enough to eliminate alternative hypotheses until the ones being tested are left standing? Second, what if something is found but it is hard to tell with any confidence whether that something is in fact, a fish? Put differently, does the subset of data relevant to a particular hypothesis being tested (such as a correlation between two variables) have enough statistical power to falsify its corresponding null hypothesis? Last (and perhaps the hardest to spot), fishing expeditions may encourage scientists to operate in a complete vacuum of hypotheses (as opposed to designing for a multiplicity of hypotheses).

What are some other fishing expeditions and their successes and failures? How is the Human Genome Project different from the LHC experiment?