Last month, the Text Analytics World conference (a spin-off from Predictive Analytics World) took place here in San Francisco. I was struck by the two very different kinds of talks that took place.
A few, including my own on how to evaluate text analytics software (which I also wrote about here), were on text analytics, but the majority focused on either sentiment analysis (including one from the study that predicted general stock market direction based on Twitter sentiment -- fascinating) or on some form of text mining.
The emphasis on text mining was understandable given the predictive analytics parentage, but it sparked two ideas. The first is that text mining and text analytics are significantly different in terms of focus and methods. The second is that few people have attempted to explore the integration of text analytics (TA) and text mining (TM).
TM is mostly natural language processing (NLP) software analysis looking for patterns based on large numbers with little human input. TA, on the other hand, usually has a large human input (building categorization rules, taxonomies, etc.) and relies more on deeper conceptual analysis of documents. The split between the two is similar to the two “systems” of the brain detailed by psychologist Daniel Kahneman in his recent book, Thinking, Fast and Slow.
But in the human brain, we have learned to use both systems -- so why not in the realm of text?
It is potentially a big topic, but as a start, TA and TM can enrich each other. A basic method is to use TA as a pre-processing step before running TM analyses. TA can find structure in unstructured text, opening up huge new areas of possible applications for TM. For example, TA can find additional metadata within text in a variety of ways. In a current project, we are using a combination of categorization rules and extraction rules to populate fields such as Title, Author, Publisher, Date, JournalTitle, and Keywords. These values are in the text rather than being specified in metadata fields.
The rules are somewhat context dependent. For example, the title is the first sentence in one set of documents, while titles follow the text "Title" in another set. In a third set, title is the sentence preceding a code such as NJTR. By building a combination of rules, the software can find values for a wide range of fields in unstructured (or semi-structured) text. In addition, entity extraction can populate fields like Organizations or People or Products, and categorization rules can populate the most difficult of fields, Subject.
Once the TA rules have populated these metadata fields, TM has a whole new set of variables to use in its analysis.
On the other hand, TM can help build everything from taxonomies to categorization rules associated with those taxonomies. For example, on one project we used text mining to create clusters of co-occurring terms, then drilled down two levels to find new terms for important topics not covered in an existing taxonomy and thus candidates for the new taxonomy. In another technique, we mapped frequent terms to other variables like date or journal or taxonomy node and looked for changes in terminology over time or for candidate categorization terms to add to specific taxonomy nodes.
A typical manual method for building categorization rules is to start with a few high-level terms like Healthcare or Medicare, run the rule against selected documents, and then open the highest-scoring documents and look for concepts and phrases that you believe might be unique to that taxonomy node. Then repeat the process until you get better accuracy.
TM can speed this process, which also is improved by mapping frequent terms to every taxonomy node. Although you still need to run manual scans for candidates to use in categorization rules, scanning lists of preselected terms is much easier than entire documents. In addition, TM can give you terms that show up in documents associated with a specific taxonomy node but not in other closely related nodes. This can improve accuracy and help fine-tune the rules.
TA and TM can enrich each other in many other ways, too, but some will have to wait until the availability of software that can fully integrate text analytics and text mining. Soon, I hope.
Tom, I like the hybrid content management/text analytics model you propose. I would think that'd be an approach that would be appealing to content authors and editors, too -- not too burdensome.
Beth, I always get nervous when software vendors talk about everything being automatic.There is always a human touch – it just depends where and how.Sometimes it is afterwards like your example, and sometimes it is prior – setting up the categorization rules or statistical models or a combination of the two.One model I've pushed for years in terms of finding information is the hybrid model of a content management software with integrated text analytics capabilities helping authors and editors to add structure (metadata) to documents as they are published – using the strengths of both human and machine.
Beth, absolutely agree on the need for an interdisciplinary approach – in most things, but certainly for anything to do with getting value from text.Hmm, we now have knowledge management departments in organizations, what about text analytics departments?There would be a need for interdisciplinary cooperation between the TA department and other departments like IT, BI, CI, business units, and even sales.And also, the TA department should be interdisciplinary – NLP experts, statisticians, linguists, programmers or script writers, librarian-taxonomists, cognitive modelers, predictive analysis experts, and business analysts along-side.(Can you tell I was just reading about Bell Labs where they had theorists and practitioners sitting next to each other?)
And the model could scale down by having some people with multiple skills so even smaller organizations could have their own TA department.
Right now, there are new departments / titles growing out of social media (Customer Experience analysts, etc.), so who knows maybe a TA department with connections throughout the organization might become a reality?
I guess the kind of app I would like to see is something that can take the Mining results from an analysis of business environment, or a country political environment and then feed that into an internal program to generate a list of possible outcomes internal to a particular business. I know this is not very acute, but as an example. What if you pushed the reality of an earthquake in Japan six months ago and the program came back with, "hey go buy some hard drives now, you won't be able to get them soon."
Tom, thanks so much for this explanation. The IT as platform analogy is perfect, and the TA/TM skills distinction an important one to note. I could see how more purposeful use could come from cross-functional teams and lots more knowledge sharing within companies on their TA, TM, BI, IT and business goals!
Tom, I recall talking with the director of information and semantics management for The Tribune Co. not so long about the need for the human touch, which I think applies to both sentiment analysis and ontology as you've described. He cited a few examples where the machine learning just isn't adequate. One of his examples:
While the Trib's text analytics platform might watch out for the term "Barack Obama" and the word "president" in the same sentence, what's it do when a new nickname -- say, President Numero Uno -- surfaces? "We might not find 'President Numero Uno' appear in a news article," says DeWeese, referring to the Trib's chief business case for text analytics, "but boy, when people start commenting, that's what they're calling him."
If you're trying to deliver all relevant content or collect sentiments about President Obama, any comments referencing that latest nickname would surely be among the most desirable but likely missed.
"You have to know or somehow be able to see this -- it has to get on your radar, and that's where the initial human analysis comes into play. Humans create that bridge to the machine."
Certainly Homeland Security has lots of people doing text analysis – they bought copies of every single vendor of TA/TM software out there – which is why companies finally stopped citing them as a customer.And yes, all those other fields have some TA/TM capabilities that they are using – but the question is how are they using them?Sentiment analysis became its own "field" for a lot of reasons (social media hype and promise of more sales, etc.) but also because it uses a specific set of words (the sentiment dictionary).The methods used in sentiment analysis are really a combination of TM and TA when it's done right, so it is a difficult but interesting task to figure out which if any of these specific application areas will take off and become its own field.
I'm giving a talk on future directions in social media next week at the Social Media Analytics Summit in San Francisco in which I talk about expertise analysis and behavior prediction as two possibilities but I'm not sure is they have the potential for becoming their own field like sentiment analysis – combination of scale of interest along with specific techniques and/or terminology.Have to give that one some more thought.
It seems to me that the examples you cite are more applications areas rather than different fields (yes, it's a somewhat arbitrary line).
One immediate example is sentiment analysis which more often than not is done with mostly text mining approaches (statistical) but with a large dictionary of sentiment terms.The vendors like to focus on the strength of their software out of the box and downplay the need for customization.But in this case, customization is basically adding categorization to look at the context around various sentiment statements.This context is what is needed to distinguish things like sarcasm, but also even to distinguish things like conditionals ("I'd like "this feature" if it had better design") which can come out as positive "like this feature".
As far as ontology development, the situation is similar in that you are trying to extract "facts" or triples which can also be ambiguous in a number of ways.First, in terms of identifying the entities (classic example is Ford the car, the person, the company, and a way to cross a stream).Second, in identifying the relationships between those entities where you have some of the same kinds of issues as with sentiment – conditionals, etc.
As an aside, I'd like to see ontology development have a text mining piece in it that would function like the taxonomy example I used earlier.
Beth, on silo effect – I don't see it so much as a traditional silo, but rather that TA/TM is a platform technology and most organizations are organized around specific applications.So while, for example, Business Intelligence depends on TA/TM, they don't consider it as fundamental to what they do and the unfortunate effect of that is to downplay its importance and be satisfied with simple, one dimensional approaches.
In some ways it is like IT as a platform for building applications, but in this case the platform involves the actual language of the business functions that want to have the application.This sets up an unusual situation, which is that business people would not assume they could develop an IT application because it involves a specialized language, but too often think that because they know the business language, that they can develop a TA application.However, this overlooks the fact that having an expertise in a given subject doesn't mean that the person has the expertise to categorize/organize/analyze the contents of that expertise.
As far as the TA/TM split that does seem to be somewhat of a silo effect – but in this case a silo based on different skills such as mathematics and language or knowledge organization (I'd say librarian but that title carries such negative connotations that I hesitate to use it these days. The trick is to figure out how to enable communication between silos without losing the benefits of a silo (they do exist for a reason).
LEADERS FROM THE BUSINESS AND IT COMMUNITIES DUEL OVER CRITICAL TECHNOLOGY ISSUES
The Current Discussion
Visual Analytics: Who Carries the Onus? The Issue: Data visualization is an up-and-coming technology for businesses that want to deliver analytical results in a visual way, enabling analysts the ability to spot patterns more easily and business users to absorb the insight at a glance and better understand what questions to ask of the data. But does it make more sense to train everybody to handle the visualization mandate or bring on visualization expertise? Our experts are divided on the question. The Speakers: Hyoun Park, Principal Analyst, Nucleus Research; Jonathan Schwabish, US Economist & Data Visualizer
To save this item to your list of favorite AllAnalytics content so you can find it later in your Profile page, click the "Save It" button next to the item.
If you found this interesting or useful, please use the links to the services below to share it with other readers. You will need a free account with each service to share an item via that service.
Elizabeth Barth-Thacker, a BI and informatics technology manager at Humana, tells us how her team is creating data transparency and building engagement with the business – with the help of an internal collaboration portal called Humanalytics.
Speaking at SAS Global Forum Executive Conference, Rajeev Kaul, SVP of pricing at OfficeMax, uses a Chinese proverb to explain one of the reasons he's deploying SAS Visual Analytics.
In an All Analytics interview, Mike Cavaretta, technical leader, predictive analytics at Ford Research & Advanced Engineering, shares how big-data is fueling vehicle decisions.
Analytics professionals and SAS executives share how organizations can get on with their work so much faster when working in a high-performance and visual analytics environment.
Analytics professionals who attended SAS's recent Executive Briefing in New York share how they think visual analytics might help their organizations get better value from data.
At Boeing, effective decision making comes down to this simple formula: QxA=E, as executive Jerry Allyne explained at the recent INFORMS analytics conference.
Whether working in major league sports, financial services, or healthcare, analytics, and data, professionals are checking out how visual analytics and high-performance technologies can help them optimize their environments, shrink their cycle times, and improve decision making, as attendees at the recent SAS Executive Briefing in New York share with us.
SAS CEO Jim Goodnight speaks with us at a recent SAS Executive Briefing about getting a feel for what's in your big-data and other new realities powered by advanced analytics.
Dynamic data visualizations let analysts and business users interact with the data, changing variables or drilling down into data points, and see results in a flash. Advance your use of data visualization with tools that support features like auto-charting, explanatory pop-ups, and mobile sharing.
No doubt your enterprise is amassing loads of data for fact-based decision-making. Hand in hand with all that data comes big computational requirements. Can traditional IT infrastructure handle the increasing number and complexity of your analytical work? Probably not, which is why you need a backend rethink. Big data calls for a high-performance analytics infrastructure, as Fern Halper, a partner at the IT consulting and research firm, Hurwitz & Associates, discusses here.
Redbox's bright-red DVD kiosks are all but ubiquitous these days, located in more than 28,000 spots across the country. Jayson Tipp, Redbox VP of Analytics and CRM, provides an insider's look at how the company has accomplished its phenomenal nine-year growth.
InterContinental Hotels Group (IHG), a seven-brand global hotelier, has woven analytics into the fabric of its operations. David Schmitt, director of performance strategy and planning, shares IHG's analytics story and his lessons learned.