Like many of my recent pieces, this one stems from a Quora question: "What are some good examples of unstructured data domains where much more work is needed?" This is an interesting question. Businesses love to talk about the big data role of unstructured data -- so do I -- and the technical problems are far from solved.
The question isn't specific, so it allows a wide-ranging response. I’ll take it as an invitation to look at unstructured data domains in three senses of the term: data types and sources (obviously), deep information content (very important), and challenges posed by two of the much-talked-about big data Vs, variety and velocity (the third being volume).
More than text
Text is the most software-accessible and analytically useful form of unstructured data, but unstructured data comprises much more than just text. Electronically stored images, audio, and video are similarly unstructured. To a computer, a piece of content in these data domains is typically just a mass of bits. Descriptive metadata, minimally capturing file type and encoding, provides enough information to support content display and editing, but Photoshop, iTunes, and the Chrome Web browser don’t have a clue about the meaning of the information they render.
To extract and exploit meaning -- parsing facts and opinions from a written sentence, recognizing a face in an image, detecting emotion in a recorded voice -- you need analytical technologies. Text analytics, particularly natural-language processing of text, is far more advanced than analogous techniques applied to images, audio, and video. Our first set of much-more-work-needed data domains, then, consists of nontextual forms of unstructured data.
What information do we seek in these sources?
- Images: We wish to determine "features" that include people, places, and objects, along with image context and in-image interrelationships (e.g., President Obama is delivering the 2012 State of the Union address, with the vice president and House speaker in the background). We also would like to infer information about the image-capture apparatus.
- Audio: We wish to determine who or what is heard in an audio stream; what activities, speech, emotions, etc. are captured; what interrelationships occur among the things recorded; and how the sound producers and their relationships change over the course of the recording. Speech is an audio special case. It can be treated as spoken text with extra-textual information, such as emotion inferred from tone, pace, and volume.
- Video: This type of information is the cross-product of the information content of images and audio.
We’re nowhere close to doing a great job in these unstructured data domains. (And maybe, in a few years, we’ll be recording smells, tastes, and other sensations as unstructured data.) Work is still needed on text, too.
Going deep
Much more work is needed to go deep and get at the rich information content of unstructured sources. We focus a lot on the entities named in source materials (people, companies, locations, etc.) and not enough on facts, events, relationships, and fine-grain attributes such as sentiment and identity. More attention is going to opinions, attitudes, and emotions (and they are subject matter for my upcoming Sentiment Analysis Symposium), which can be derived from text but also from speech and images.
In addition, much more work is needed on targeting genre (whether the material is explaining, informing, arguing, etc.), identity (what language use and style, or the captured physical characteristics, tell us about someone’s age, sex, and cultural or economic background), and narrative (how topics and attitudes change in the course of a multiple-participant exchange).
Lastly, putting aside volume, the two other big data Vs that define unstructured data domains -- velocity and variety -- need much more work.
Right time, right way
Data velocity refers to high rates of data production and transmission. Competitive pressure is making the ability to transform and analyze high-velocity data "in flight," whether in real-time or "right-time," into a big data imperative. Applications include online commerce, customer service and support, reputation management, and news-influenced automated trading.
Handling variety is similarly important. Enterprises need access to a spectrum of social, online, and enterprise data sources, and they need to handle (acquire, store, analyze, and disseminate) data of all types, including the unstructured sorts I’ve cited. You also need to be able to link data across sources and types the right way.
So we have a right-time velocity imperative that extends to as-produced unstructured data, and we have a right-way imperative: integrated, not siloed.
Let’s put a spin on the "right way" concept as it relates to data domains, unstructured or otherwise. Our analytical tools have been doing only half the job. We’ve been collecting and crunching data and delivering insights via interfaces. Yet dashboards, tables, and visualizations are often not the best way, or the right way, to deliver information. Take this article. It’s a written narrative. The panel I attended yesterday conveyed information via spoken presentations and conversations, along with presentation slides and nonverbal speaker gestures and expressions, all of which are "unstructured" in a machine sense but readily understandable by us humans given the linguistic, cultural models we unconsciously apply to understand one another.
Another "right way" -- an unstructured data domain where much more work is needed -- involves natural-language generation by machines and more natural interaction of all "unstructured" sorts. Much more work is needed, indeed!