10 Things You Need to Know About the Technology Behind Watson

What is so fascinating about a Computer System vs. Quiz Show?  The popularity of America’s favorite quiz show, Jeopardy!, stems from the unique challenges it poses to its contestants: the breadth of topics; the puns, metaphors, and slang in the questions; the speed it takes to buzz and answer.

These factors make Jeopardy! the perfect testing ground for Watson, the IBM computing system that can understand the complexities of human language and return a single, precise answer to a question.

Next month, IBM’s Watson will play Jeopardy! (on live network TV) with two of the all-time champions.  IBM offered a press sneak peek this week at a practice round that included Alex Trebek.  After seeing the clips, I am getting excited and am convinced this technology breakthrough is something special.  Here s what you need to know:

1.  What is Watson?

Watson is the name for IBM’s Question Answering (QA) computing system, built by a team of IBM Research scientists and university collaborators who set out to accomplish a grand challenge – to build a computing system that rivals a human’s ability to answer questions poised in natural language with speed, accuracy and confidence. It leverages Natural Language Processing (or NLP) to process extreme volumes of text.

Watson is powered by an IBM POWER7 platform to handle the massive analytics at speeds required to analyze complex language and deliver correct responses to natural language clues.  The system is a combination of current and new IBM technologies optimized to meet the specialized demands of processing an enormous amount of concurrent tasks, and content while analyzing content in real time.

2.  What is Natural Language Processing?

Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. It describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extracted for other uses such as Question Answering or Content Analytics.

3.  What are QA and DeepQA?

Question Answering (QA) is the task of automatically answering a question posed in natural language. It involves first trying to understand the question to determine what is being asked. Then by analyzing a wide variety of disparate content mostly in the form of natural language documents to find reasoned answers. And finally, to assess based on the evidence, the relative likelihood that the found answers are correct. Collections can vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web. QA is regarded as the next step beyond current search engines.

DeepQA goes well beyond simple question reformulation or keyword analyses. Queries that include disambiguation, unfamiliar syntax, spatially or temporally constrained questions – or simply bad question framing – require a deeper level of content and text analysis.

4.  What is unique about the QA implementation for Watson?

Competing with humans on Jeopardy! poses an additional set of challenges, including, the variety of question types and styles, the broad and varied range of topics, the demand for high degrees of confidence and speed required a whole new approach to the problem.

5.  How does QA technology compare to document search?

The key difference between QA technology and document search is that document search takes a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking), while QA technology takes a question expressed in natural language, seeks to understand it in much greater detail, and returns a precise answer to the question.

I touched on the frustrations of search in my previous posting Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy

6.  How does Watson compare to the chess-playing system, Deep Blue?

Deep Blue demonstrated that computers can solve problems once thought the exclusive domain of human intelligence, albeit in perhaps very different ways than humans do.  Deep Blue was an amazing achievement in the application of compute power to an extraordinarily challenging but computationally well-defined and well-bounded game.  By searching and evaluating a huge space of possible chess board configurations, Deep Blue had the compute power to beat a grand master.

Watson faces a challenge that is entirely open-ended and defies the sort of well-bounded mathematical formulation that fits a game like Chess.  Watson has to operate in the near limitless, ambiguous and highly contextual domain of human language and knowledge.  Ultimately Watson’s scientific goal is to demonstrate how computers can get at the meaning behind a natural language question and infer precise answers from huge volumes of content, with justifications that ultimately make sense to humans.

Rather than challenging the human to search a vast mathematical space, the Watson project challenges the computer to operate in human terms.  Watson strives to understand and answer human questions and to know when it does and doesn’t know the answer.  The capability to assess its own knowledge and abilities, something humans find relatively easy, is exceedingly difficult for computers.

7.  How would this QA technology be used in a business setting?

DeepQA technology provides humans with a powerful tool for their information gathering and decision support.  One of many possible scenarios could be for the end user to enter their question in natural language form, much as if they were asking another person, and for the system to sift through vast amounts of potential evidence to return a ranked list of the most compelling, precise answers along with links to supporting or refuting evidence.  Other important scenarios will use DeepQA to analyze a collection of content and data representing a problem, for example a technical support problem or a medical case.  DeepQA will start to search for solution gathering and assessing evidence from many disparate data sources engaging human users to help provide the missing pieces of information that can help arrive at a solution or for example a differential diagnosis, in the case of medicine.

In addition, these answers would include summaries of their justifying or supporting evidence, allowing the user to quickly assess the evidence and select the correct answer.

Business applications include Customer Relationship Management, Regulatory Compliance, Contact Centers, Help Desks, Web Self-Service, Business Intelligence and more.  These applications will demand a deep understanding of users’ questions and analysis of huge volumes of natural language, structured and semi-structured content to rapidly deliver and justify precise, succinct, and high-confidence answers.

8.  What is the role of Unstructured Information Management Architecture (UIMA) in DeepQA and the Watson project?

Unstructured Information Management Architecture (UIMA) is the IBM developed open-source framework for analysis of unstructured content, such as natural language text, speech, images and video, which Watson uses to integrate and deploy a broad collection of deep analysis algorithms over vast amounts of content.

A number IBM ECM products are based on and leverage UIMA today. IBM Content Analytics, IBM OmniFind Enterprise Edition, IBM eDiscovery Analyzer and IBM Classification Module all are powered by, or benefit from, natural language processing and UIMA.

9.  Are any Enterprise Content Management (ECM) technologies actually part of Watson?

Yes, IBM Content Analytics is part of Watson.  After the question is asked, the text needs to be processed using natural language processing.  IBM Content Analytics (LanguageWare) and other techniques (secret sauce) are used to process the text, and understand the question, as part of the complex processing required to fully answer questions with confidence.  I will tackle this issue in more detail in my next blog posting.

IBM Content Analytics (or ICA) is a content analysis platform used to derive rapid insight from content and data.  It can transform raw information into business insight quickly without building models or deploying complex systems enabling businesses to derive insight in hours or days … not weeks or months.  It’s easy to use and designed for any knowledge worker who needs to search and explore content.  ICA can be extended for deeper insights by integrating to Cognos, SPSS, InfoSphere, Netezza and other Business Intelligence, Analytics and Data Warehouse systems. 

The ICA product itself includes tooling (LanguageWare) which is used to customize NLP processing and build industry or customer specific models and solutions.  This capability is at the core of natural language processing and is the very same ICA capability that is used in Watson.

10.  Who is going to win on February 14-16th?

My prediction … Watson is.

I watched the video yesterday of the practice rounds and Watson is impressive.  Watson performed impressively against Ken Jennings and Brad Rutter (the two contestants and the all-time champions).

So … who won the Jeopardy! practice round?

Watson won handily …

http://www.youtube.com/watch?v=12rNbGf2Wwo

Watson’s score was $4,400, beating Jennings by $1,000 and nearly quadrupling Rutter’s score.

IBM will donate 100% of Watson’s winnings to charity, while Rutter and Jennings said they will each donate 50% of their prizes. 

I am going to host a viewing party for colleagues, friends and family.  This is going to be exciting and fun … I can’t wait.  As an IBMer, I’ll be rooting for Watson to win but not for the obvious reason.  My rooting is really about my passion for the amazing technology breakthrough and the power of content analytics. 

Who do you think will win?  Leave me your thoughts below.