Last month, the Text Analytics World conference (a spin-off from Predictive Analytics World) took place here in San Francisco. I was struck by the two very different kinds of talks that took place.
A few, including my own
on how to evaluate text analytics software (which I also wrote about here
), were on text analytics, but the majority focused on either sentiment analysis (including one from the study that predicted general stock market direction based on Twitter sentiment -- fascinating) or on some form of text mining.
The emphasis on text mining was understandable given the predictive analytics parentage, but it sparked two ideas. The first is that text mining and text analytics are significantly different in terms of focus and methods. The second is that few people have attempted to explore the integration of text analytics (TA) and text mining (TM).
TM is mostly natural language processing (NLP) software analysis looking for patterns based on large numbers with little human input. TA, on the other hand, usually has a large human input (building categorization rules, taxonomies, etc.) and relies more on deeper conceptual analysis of documents. The split between the two is similar to the two “systems” of the brain detailed by psychologist Daniel Kahneman in his recent book, Thinking, Fast and Slow.
But in the human brain, we have learned to use both systems -- so why not in the realm of text?
It is potentially a big topic, but as a start, TA and TM can enrich each other. A basic method is to use TA as a pre-processing step before running TM analyses. TA can find structure in unstructured text, opening up huge new areas of possible applications for TM. For example, TA can find additional metadata within text in a variety of ways. In a current project, we are using a combination of categorization rules and extraction rules to populate fields such as Title, Author, Publisher, Date, JournalTitle, and Keywords. These values are in the text rather than being specified in metadata fields.
The rules are somewhat context dependent. For example, the title is the first sentence in one set of documents, while titles follow the text "Title" in another set. In a third set, title is the sentence preceding a code such as NJTR. By building a combination of rules, the software can find values for a wide range of fields in unstructured (or semi-structured) text. In addition, entity extraction can populate fields like Organizations or People or Products, and categorization rules can populate the most difficult of fields, Subject.
Once the TA rules have populated these metadata fields, TM has a whole new set of variables to use in its analysis.
On the other hand, TM can help build everything from taxonomies to categorization rules associated with those taxonomies. For example, on one project we used text mining to create clusters of co-occurring terms, then drilled down two levels to find new terms for important topics not covered in an existing taxonomy and thus candidates for the new taxonomy. In another technique, we mapped frequent terms to other variables like date or journal or taxonomy node and looked for changes in terminology over time or for candidate categorization terms to add to specific taxonomy nodes.
A typical manual method for building categorization rules is to start with a few high-level terms like Healthcare or Medicare, run the rule against selected documents, and then open the highest-scoring documents and look for concepts and phrases that you believe might be unique to that taxonomy node. Then repeat the process until you get better accuracy.
TM can speed this process, which also is improved by mapping frequent terms to every taxonomy node. Although you still need to run manual scans for candidates to use in categorization rules, scanning lists of preselected terms is much easier than entire documents. In addition, TM can give you terms that show up in documents associated with a specific taxonomy node but not in other closely related nodes. This can improve accuracy and help fine-tune the rules.
TA and TM can enrich each other in many other ways, too, but some will have to wait until the availability of software that can fully integrate text analytics and text mining. Soon, I hope.