“What is Content Analytics?, Alex”

“The technology behind Watson represents the future of data management and analytics.  In the real world, this technology will help us uncover insights in everything from traffic to healthcare.”

– John Cohn, IBM Fellow, IBM Systems and Technology Group

How can the same technology used to play Jeopardy! give you better business insight?

Why Watson matters

You have to start by understanding that IBM Watson DeepQA is the world’s most advanced question answering machine.  It uncovers answers by understanding the meaning buried in the context of a natural language question.  By combining advanced Natural Language Processing (NLP) and DeepQA automatic question answering technology, Watson represents the future of content and data management, analytics, and systems design.  IBM Watson leverages core content analysis, along with a number of other advanced technologies, to arrive at a single, precise answer within a very short period of time.  The business applications for this technology is limitless starting with clinical healthcare, customer care, government intelligence and beyond.  I covered the technology side of Watson in my previous posting 10 Things You Need to Know About the Technology Behind Watson.

Amazingly, Watson works like the human brain to analyze the content of a Jeopardy! question.  First, it tries to understand the question to determine what is being asked.  In doing so, it first needs to analyze the natural language text.  Next, it tries to find reasoned answers, by analyzing a wide variety of disparate content mostly in the form of natural language documents.  Finally, Watson assesses and determines the relative likelihood that the answers found, are correct based on a confidence rating.

A great example of the challenge is described by Stephen Baker in his book Final Jeopardy: Man vs. Machine and the Quest to Know Everything: ‘When 60 Minutes premiered, this man was U.S. President.  ‘ Traditionally it’s been difficult for a computer to understand what ‘premiered’ means and that it’s associated with a date.  To a computer, ‘premiere’ could also mean ‘premier’.  Is the question about a person’s title or a production opening?  Then it has to figure out the date when an entity called ’60 Minutes’ premiered, and then find out who was the ‘U.S. President’ at that time.  In short, it requires a ton of contextual understanding.

I am not talking about search here.  This is far beyond what search tools can do.  A recent Forrester report, Take Control Of Your Content, states that 45% of the US workforce spends three or more hours a week just searching for information.  This is completely inefficient.  See my previous posting Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy! for more on this topic.

Natural Language Processing (NLP) can be leveraged in any situation where text is involved. Besides answering questions, it can help improve enterprise search results or even develop an understanding of the insight hidden in the content itself.  Watson leverages the power of NLP as the cornerstone to translate interactions between computers and human (natural) languages.

NLP involves a series of steps that make text understandable (or computable).  A critical step, lexical analysis is the process of converting a sequence of characters into a set of tokens.  Subsequent steps leverage these tokens to perform entity extraction (people, places, things), concept identification (person A belongs to organization B) and the annotation of documents with this and other information.  A feature of IBM Content Analytics (known as LanguageWare) is performing the lexical analysis function in Watson as part of natural language processing.

Why this matters to your business

Jeopardy! poses a similar set of contextual information challenges as those found in the business world today:

  • Over 80 percent of information being stored is unstructured (is text based).
  • Understanding that 80 plus percent isn’t simple.  Like Jeopardy! … subtle meaning, irony, riddles, acronyms, abbreviations and other complexities all present unique computing challenges not found with structured data in order to derive meaning and insight. This is where natural language processing (NLP) comes in.

The same core NLP technology used in Watson is available now to deliver business value today by unlocking the insights trapped in the massive amounts of unstructured information in the many systems and formats you have today.  Understanding the content, context and value of this unstructured information presents an enormous opportunity for your business.  This is already being done today in a number of industries by leveraging IBM Content Analytics.

IBM Content Analytics (ICA) itself is a platform to derive rapid insight.  It can transform raw information into business insight quickly without building models or deploying complex systems.  Enabling all knowledge workers to derive insight in hours or days … not weeks or months.  It helps address industry specific problems such as healthcare treatment effectiveness, fraud detection, product defect detection, public safety concerns, customer satisfaction and churn, crime and terrorism prevention and more.  Here are some actual customer examples:

Healthcare Research – Like most healthcare providers, BJC Healthcare, had a treasure trove of historical information trapped in unstructured clinical notes, diagnostic reports containing essential information for the study of disease progression, treatment effectiveness and long-term outcomes.  Their existing Biomedical Informatics (BMI) resources were disjointed and non-interoperable, available only to a small fraction of researchers, and frequently redundant, with no capability to tap into the wealth of research information trapped in unstructured clinical notes, diagnostic report and the like.

With IBM Content Analytics, BJC and university researchers are now able to analyze unstructured information to answer key questions that were previously unavailable.  Questions like: Does the patient smoke?, How often and for how long?, If smoke free, how long? What home medications is the patient taking? What is the patient sent home with? What was the diagnosis and what procedures performed on patient?  BJC now has deeper insight into medical information and can uncover trends and patterns within their content, to provide better healthcare to their patients.

Customer Satisfaction – Identifying customer satisfaction trends about products, services and personnel is critical to most businesses.  The Hertz Corporation and Mindshare Technologies, a leading provider of enterprise feedback solutions, are using IBM Content Analytics software to examine customer survey data, including text messages, to better identify car and equipment rental performance levels for pinpointing and making the necessary adjustments to improve customer satisfaction levels.

By using IBM Content Analytics, companies like Hertz can drive new marketing campaigns or modify their products and services to meet the demands of their customers. “Hertz gathers an amazing amount of customer insight daily, including thousands of comments from web surveys, emails and text messages. We wanted to leverage this insight at both the strategic level and the local level to drive operational improvements,” said Joe Eckroth, Chief Information Officer, the Hertz Corporation.

For more information about ICA at Hertz: http://www-03.ibm.com/press/us/en/pressrelease/32859.wss

Research Analytics – To North Carolina State University, the essence of a university is more than education – it is the advancement and dissemination of knowledge in all its forms.  One of the main issues faced by NC State was dealing with the vast number of data sources available to them.  The university sought a solution to efficiently mine and analyze vast quantities of data to better identify companies that could bring NC State’s research to the public.  The objective was a solution designed to parse the content of thousands of unstructured information sources, perform data and text analytics and produce a focused set of useful results.

Using IBM Content Analytics, NC State was able to reduce the time needed to find target companies from months to days.  The result is the identification of new commercialization opportunities, with tests yielding a 300 percent increase in the number of candidates.  By obtaining insight into their extensive content sources, NC State’s Office of Technology Transfer was able to find more effective ways to license technologies created through research conducted at the university. “What makes the solution so powerful is its ability to go beyond conventional online search methods by factoring context into its results.” – Billy Houghteling, executive director, NC State Office of Technology Transfer.

For more information about ICA at NC State: http://www-01.ibm.com/software/success/cssdb.nsf/CS/SSAO-8DFLBX?OpenDocument&Site=software&cty=en_us

You can put the technology of tomorrow to work for you today, by leveraging the same IBM Content Analytics capability helping to power Watson.  To learn more about all the IBM ECM products utilizing Watson technology, please visit these sites:

IBM Content Analytics: http://www-01.ibm.com/software/data/content-management/analytics/

IBM Classification Module: http://www-01.ibm.com/software/data/content-management/classification/

IBM eDiscovery Analyzer: http://www-01.ibm.com/software/data/content-management/products/ediscovery-analyzer/

IBM OmniFind Enterprise Edition: http://www-01.ibm.com/software/data/enterprise-search/omnifind-enterprise/

You can also check out the IBM Content Analytics Resource Center or watch the “what it is and why it matters” video.

I’ll be at the Jeopardy! viewing party in Washington, DC on February 15th and 16th … hope to see you there.  In the mean time, leave me your thoughts and questions below.

10 Things You Need to Know About the Technology Behind Watson

What is so fascinating about a Computer System vs. Quiz Show?  The popularity of America’s favorite quiz show, Jeopardy!, stems from the unique challenges it poses to its contestants: the breadth of topics; the puns, metaphors, and slang in the questions; the speed it takes to buzz and answer.

These factors make Jeopardy! the perfect testing ground for Watson, the IBM computing system that can understand the complexities of human language and return a single, precise answer to a question.

Next month, IBM’s Watson will play Jeopardy! (on live network TV) with two of the all-time champions.  IBM offered a press sneak peek this week at a practice round that included Alex Trebek.  After seeing the clips, I am getting excited and am convinced this technology breakthrough is something special.  Here s what you need to know:

1.  What is Watson?

Watson is the name for IBM’s Question Answering (QA) computing system, built by a team of IBM Research scientists and university collaborators who set out to accomplish a grand challenge – to build a computing system that rivals a human’s ability to answer questions poised in natural language with speed, accuracy and confidence. It leverages Natural Language Processing (or NLP) to process extreme volumes of text.

Watson is powered by an IBM POWER7 platform to handle the massive analytics at speeds required to analyze complex language and deliver correct responses to natural language clues.  The system is a combination of current and new IBM technologies optimized to meet the specialized demands of processing an enormous amount of concurrent tasks, and content while analyzing content in real time.

2.  What is Natural Language Processing?

Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. It describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extracted for other uses such as Question Answering or Content Analytics.

3.  What are QA and DeepQA?

Question Answering (QA) is the task of automatically answering a question posed in natural language. It involves first trying to understand the question to determine what is being asked. Then by analyzing a wide variety of disparate content mostly in the form of natural language documents to find reasoned answers. And finally, to assess based on the evidence, the relative likelihood that the found answers are correct. Collections can vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web. QA is regarded as the next step beyond current search engines.

DeepQA goes well beyond simple question reformulation or keyword analyses. Queries that include disambiguation, unfamiliar syntax, spatially or temporally constrained questions – or simply bad question framing – require a deeper level of content and text analysis.

4.  What is unique about the QA implementation for Watson?

Competing with humans on Jeopardy! poses an additional set of challenges, including, the variety of question types and styles, the broad and varied range of topics, the demand for high degrees of confidence and speed required a whole new approach to the problem.

5.  How does QA technology compare to document search?

The key difference between QA technology and document search is that document search takes a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking), while QA technology takes a question expressed in natural language, seeks to understand it in much greater detail, and returns a precise answer to the question.

I touched on the frustrations of search in my previous posting Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy

6.  How does Watson compare to the chess-playing system, Deep Blue?

Deep Blue demonstrated that computers can solve problems once thought the exclusive domain of human intelligence, albeit in perhaps very different ways than humans do.  Deep Blue was an amazing achievement in the application of compute power to an extraordinarily challenging but computationally well-defined and well-bounded game.  By searching and evaluating a huge space of possible chess board configurations, Deep Blue had the compute power to beat a grand master.

Watson faces a challenge that is entirely open-ended and defies the sort of well-bounded mathematical formulation that fits a game like Chess.  Watson has to operate in the near limitless, ambiguous and highly contextual domain of human language and knowledge.  Ultimately Watson’s scientific goal is to demonstrate how computers can get at the meaning behind a natural language question and infer precise answers from huge volumes of content, with justifications that ultimately make sense to humans.

Rather than challenging the human to search a vast mathematical space, the Watson project challenges the computer to operate in human terms.  Watson strives to understand and answer human questions and to know when it does and doesn’t know the answer.  The capability to assess its own knowledge and abilities, something humans find relatively easy, is exceedingly difficult for computers.

7.  How would this QA technology be used in a business setting?

DeepQA technology provides humans with a powerful tool for their information gathering and decision support.  One of many possible scenarios could be for the end user to enter their question in natural language form, much as if they were asking another person, and for the system to sift through vast amounts of potential evidence to return a ranked list of the most compelling, precise answers along with links to supporting or refuting evidence.  Other important scenarios will use DeepQA to analyze a collection of content and data representing a problem, for example a technical support problem or a medical case.  DeepQA will start to search for solution gathering and assessing evidence from many disparate data sources engaging human users to help provide the missing pieces of information that can help arrive at a solution or for example a differential diagnosis, in the case of medicine.

In addition, these answers would include summaries of their justifying or supporting evidence, allowing the user to quickly assess the evidence and select the correct answer.

Business applications include Customer Relationship Management, Regulatory Compliance, Contact Centers, Help Desks, Web Self-Service, Business Intelligence and more.  These applications will demand a deep understanding of users’ questions and analysis of huge volumes of natural language, structured and semi-structured content to rapidly deliver and justify precise, succinct, and high-confidence answers.

8.  What is the role of Unstructured Information Management Architecture (UIMA) in DeepQA and the Watson project?

Unstructured Information Management Architecture (UIMA) is the IBM developed open-source framework for analysis of unstructured content, such as natural language text, speech, images and video, which Watson uses to integrate and deploy a broad collection of deep analysis algorithms over vast amounts of content.

A number IBM ECM products are based on and leverage UIMA today. IBM Content Analytics, IBM OmniFind Enterprise Edition, IBM eDiscovery Analyzer and IBM Classification Module all are powered by, or benefit from, natural language processing and UIMA.

9.  Are any Enterprise Content Management (ECM) technologies actually part of Watson?

Yes, IBM Content Analytics is part of Watson.  After the question is asked, the text needs to be processed using natural language processing.  IBM Content Analytics (LanguageWare) and other techniques (secret sauce) are used to process the text, and understand the question, as part of the complex processing required to fully answer questions with confidence.  I will tackle this issue in more detail in my next blog posting.

IBM Content Analytics (or ICA) is a content analysis platform used to derive rapid insight from content and data.  It can transform raw information into business insight quickly without building models or deploying complex systems enabling businesses to derive insight in hours or days … not weeks or months.  It’s easy to use and designed for any knowledge worker who needs to search and explore content.  ICA can be extended for deeper insights by integrating to Cognos, SPSS, InfoSphere, Netezza and other Business Intelligence, Analytics and Data Warehouse systems. 

The ICA product itself includes tooling (LanguageWare) which is used to customize NLP processing and build industry or customer specific models and solutions.  This capability is at the core of natural language processing and is the very same ICA capability that is used in Watson.

10.  Who is going to win on February 14-16th?

My prediction … Watson is.

I watched the video yesterday of the practice rounds and Watson is impressive.  Watson performed impressively against Ken Jennings and Brad Rutter (the two contestants and the all-time champions).

So … who won the Jeopardy! practice round?

Watson won handily …


Watson’s score was $4,400, beating Jennings by $1,000 and nearly quadrupling Rutter’s score.

IBM will donate 100% of Watson’s winnings to charity, while Rutter and Jennings said they will each donate 50% of their prizes. 

I am going to host a viewing party for colleagues, friends and family.  This is going to be exciting and fun … I can’t wait.  As an IBMer, I’ll be rooting for Watson to win but not for the obvious reason.  My rooting is really about my passion for the amazing technology breakthrough and the power of content analytics. 

Who do you think will win?  Leave me your thoughts below.