Scouting the Next Frontiers for 'Unstructured' Data

Like many of my recent pieces, this one stems from a Quora question: "What are some good examples of unstructured data domains where much more work is needed?" This is an interesting question. Businesses love to talk about the big data role of unstructured data -- so do I -- and the technical problems are far from solved.

The question isn't specific, so it allows a wide-ranging response. Iíll take it as an invitation to look at unstructured data domains in three senses of the term: data types and sources (obviously), deep information content (very important), and challenges posed by two of the much-talked-about big data Vs, variety and velocity (the third being volume).

More than text
Text is the most software-accessible and analytically useful form of unstructured data, but unstructured data comprises much more than just text. Electronically stored images, audio, and video are similarly unstructured. To a computer, a piece of content in these data domains is typically just a mass of bits. Descriptive metadata, minimally capturing file type and encoding, provides enough information to support content display and editing, but Photoshop, iTunes, and the Chrome Web browser donít have a clue about the meaning of the information they render.

To extract and exploit meaning -- parsing facts and opinions from a written sentence, recognizing a face in an image, detecting emotion in a recorded voice -- you need analytical technologies. Text analytics, particularly natural-language processing of text, is far more advanced than analogous techniques applied to images, audio, and video. Our first set of much-more-work-needed data domains, then, consists of nontextual forms of unstructured data.

What information do we seek in these sources?

  • Images: We wish to determine "features" that include people, places, and objects, along with image context and in-image interrelationships (e.g., President Obama is delivering the 2012 State of the Union address, with the vice president and House speaker in the background). We also would like to infer information about the image-capture apparatus.
  • Audio: We wish to determine who or what is heard in an audio stream; what activities, speech, emotions, etc. are captured; what interrelationships occur among the things recorded; and how the sound producers and their relationships change over the course of the recording. Speech is an audio special case. It can be treated as spoken text with extra-textual information, such as emotion inferred from tone, pace, and volume.
  • Video: This type of information is the cross-product of the information content of images and audio.

Weíre nowhere close to doing a great job in these unstructured data domains. (And maybe, in a few years, weíll be recording smells, tastes, and other sensations as unstructured data.) Work is still needed on text, too.

Going deep
Much more work is needed to go deep and get at the rich information content of unstructured sources. We focus a lot on the entities named in source materials (people, companies, locations, etc.) and not enough on facts, events, relationships, and fine-grain attributes such as sentiment and identity. More attention is going to opinions, attitudes, and emotions (and they are subject matter for my upcoming Sentiment Analysis Symposium), which can be derived from text but also from speech and images.

In addition, much more work is needed on targeting genre (whether the material is explaining, informing, arguing, etc.), identity (what language use and style, or the captured physical characteristics, tell us about someoneís age, sex, and cultural or economic background), and narrative (how topics and attitudes change in the course of a multiple-participant exchange).

Lastly, putting aside volume, the two other big data Vs that define unstructured data domains -- velocity and variety -- need much more work.

Right time, right way
Data velocity refers to high rates of data production and transmission. Competitive pressure is making the ability to transform and analyze high-velocity data "in flight," whether in real-time or "right-time," into a big data imperative. Applications include online commerce, customer service and support, reputation management, and news-influenced automated trading.

Handling variety is similarly important. Enterprises need access to a spectrum of social, online, and enterprise data sources, and they need to handle (acquire, store, analyze, and disseminate) data of all types, including the unstructured sorts Iíve cited. You also need to be able to link data across sources and types the right way.

So we have a right-time velocity imperative that extends to as-produced unstructured data, and we have a right-way imperative: integrated, not siloed.

Letís put a spin on the "right way" concept as it relates to data domains, unstructured or otherwise. Our analytical tools have been doing only half the job. Weíve been collecting and crunching data and delivering insights via interfaces. Yet dashboards, tables, and visualizations are often not the best way, or the right way, to deliver information. Take this article. Itís a written narrative. The panel I attended yesterday conveyed information via spoken presentations and conversations, along with presentation slides and nonverbal speaker gestures and expressions, all of which are "unstructured" in a machine sense but readily understandable by us humans given the linguistic, cultural models we unconsciously apply to understand one another.

Another "right way" -- an unstructured data domain where much more work is needed -- involves natural-language generation by machines and more natural interaction of all "unstructured" sorts. Much more work is needed, indeed!

Seth Grimes,

Seth Grimes is an technology strategy consultant, a recognized expert on business intelligence and text analytics. He is a long-time contributor at TechWeb's InformationWeek and a member of Internet Evolution's ThinkerNet.  He is founding chair of the Sentiment Analysis Symposium and the Text Analytics Summit. Seth founded Washington-based Alta Plana Corporation in 1997. He consults, writes, and speaks internationally on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies.

Please visit Seth's on-line business card for more information, and follow Seth on Twitter at @sethgrimes.

Scouting the Next Frontiers for 'Unstructured' Data

Beyond text, lots of work remains in getting a handle on unstructured data.

Re: Identity vs People
  • 4/9/2012 11:07:31 PM

Shawn, is still privacy still something one navigates? I see it more, nowadays, as something that one party ("the user") simply surrenders in exchange for vague, untrustworthy, mutable representations from another party ("the service provider"). In the words of Prophet Scott, "get over it."

Re: Identity vs People
  • 4/9/2012 9:18:04 PM

Hi Seth Breedlove,

True enough! The questiojn is how long it will take us to get there and what privacy and other issues will need to be navigated in the process.

Re: Video Analytics.
  • 4/9/2012 8:58:43 PM

Hey Shawn, Thanks for the heads up, I will keep an eye out for that. I will be interested to know if they use many of the same methods that video service providers use for MPEG QoS

Re: Identity vs People
  • 4/9/2012 8:57:49 PM

Hi Seth,

You said:

A person's identity can be seen as an assemblage of personas or, if you prefer, masks or roles that come into play at different times, and also a person has attibutes that can be mined or inferred from various media in which he or she is mentioned or involved.

Another interesting way to look at measuring online personas. I guess I'm wondering how we use this kind of information and to what extent we can measure indentities in on the social Web.

Re: Video Analytics.
  • 4/9/2012 8:37:48 PM

Hi Bulk,

In fact, a post on coming up soon will look at how face recognition analytics uses still images to evaluate video content.

Re: Identity vs People
  • 4/9/2012 4:48:38 PM

About six billion photos are posted on Facebook each month.  That's a lot of information about how we interact with each other, what we are wearing, doing and buying.  I would imagine being able to study that data would be one of the next frontiers. 

Re: Identity vs People
  • 4/9/2012 1:28:16 PM

Thanks Jennifer.

There's a hierarchy of personal identifiers from named entities (e.g., Jennifer Roberts) to co-references (Ms Roberts) to anaphora ("she" "her"). You could be unambiguously be designated by a Uniform Resource Identifier (URI).  But sometimes we want more than an explicit Who.  A person's identity can be seen as an assemblage of personas or, if you prefer, masks or roles that come into play at different times, and also a person has attibutes that can be mined or inferred from various media in which he or she is mentioned or involved.  Actually, attibutes can be extracted even if the particular person can not be identified, and we can create models that segment or describe classes of individuals by characteristics.  All this, to me, is identity analytics.

Video Analytics.
  • 4/9/2012 12:54:57 PM

Video is a difficult medium to break down. Currently there are companies that provide analytics on video QOS, and some companies offer a very robust suit of tools with their MPEG monitoring solution.

from what I have seen of this kind of funcitonality it would not be much of a streatch to make the leap to getting the data you are talking about from video. I think they key would be to break this down frame by frame, it would just be a lot of data.

Identity vs People
  • 4/9/2012 11:16:51 AM

Hi Seth,


Can you provide a bit more clarity around your idea of identity. I read this line:

"We focus a lot on the entities named in source materials (people, companies, locations, etc.) and not enough on facts, events, relationships, and fine-grain attributes such as sentiment and identity."

and wasn't sure what you meant by "identity" in contrast to 'people' as source materials. Do you have some other concept of identity beyond the idea of 'people'?


Great post and thanks for sharing!

-Jennifer @collectual

Greatest efforts thus far?
  • 4/9/2012 9:46:58 AM

Thanks for the overview of the changes facing us in unstructured data, Seth. I wonder if you could share any specific case studies of progress made in any of these areas so far, just to get a bit more specific.