10 Things You Need to Know About the Technology Behind Watson

What is so fascinating about a Computer System vs. Quiz Show?  The popularity of America’s favorite quiz show, Jeopardy!, stems from the unique challenges it poses to its contestants: the breadth of topics; the puns, metaphors, and slang in the questions; the speed it takes to buzz and answer.

These factors make Jeopardy! the perfect testing ground for Watson, the IBM computing system that can understand the complexities of human language and return a single, precise answer to a question.

Next month, IBM’s Watson will play Jeopardy! (on live network TV) with two of the all-time champions.  IBM offered a press sneak peek this week at a practice round that included Alex Trebek.  After seeing the clips, I am getting excited and am convinced this technology breakthrough is something special.  Here s what you need to know:

1.  What is Watson?

Watson is the name for IBM’s Question Answering (QA) computing system, built by a team of IBM Research scientists and university collaborators who set out to accomplish a grand challenge – to build a computing system that rivals a human’s ability to answer questions poised in natural language with speed, accuracy and confidence. It leverages Natural Language Processing (or NLP) to process extreme volumes of text.

Watson is powered by an IBM POWER7 platform to handle the massive analytics at speeds required to analyze complex language and deliver correct responses to natural language clues.  The system is a combination of current and new IBM technologies optimized to meet the specialized demands of processing an enormous amount of concurrent tasks, and content while analyzing content in real time.

2.  What is Natural Language Processing?

Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. It describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extracted for other uses such as Question Answering or Content Analytics.

3.  What are QA and DeepQA?

Question Answering (QA) is the task of automatically answering a question posed in natural language. It involves first trying to understand the question to determine what is being asked. Then by analyzing a wide variety of disparate content mostly in the form of natural language documents to find reasoned answers. And finally, to assess based on the evidence, the relative likelihood that the found answers are correct. Collections can vary from small local document collections, to internal organization documents, to compiled newswire reports, to the World Wide Web. QA is regarded as the next step beyond current search engines.

DeepQA goes well beyond simple question reformulation or keyword analyses. Queries that include disambiguation, unfamiliar syntax, spatially or temporally constrained questions – or simply bad question framing – require a deeper level of content and text analysis.

4.  What is unique about the QA implementation for Watson?

Competing with humans on Jeopardy! poses an additional set of challenges, including, the variety of question types and styles, the broad and varied range of topics, the demand for high degrees of confidence and speed required a whole new approach to the problem.

5.  How does QA technology compare to document search?

The key difference between QA technology and document search is that document search takes a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking), while QA technology takes a question expressed in natural language, seeks to understand it in much greater detail, and returns a precise answer to the question.

I touched on the frustrations of search in my previous posting Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy

6.  How does Watson compare to the chess-playing system, Deep Blue?

Deep Blue demonstrated that computers can solve problems once thought the exclusive domain of human intelligence, albeit in perhaps very different ways than humans do.  Deep Blue was an amazing achievement in the application of compute power to an extraordinarily challenging but computationally well-defined and well-bounded game.  By searching and evaluating a huge space of possible chess board configurations, Deep Blue had the compute power to beat a grand master.

Watson faces a challenge that is entirely open-ended and defies the sort of well-bounded mathematical formulation that fits a game like Chess.  Watson has to operate in the near limitless, ambiguous and highly contextual domain of human language and knowledge.  Ultimately Watson’s scientific goal is to demonstrate how computers can get at the meaning behind a natural language question and infer precise answers from huge volumes of content, with justifications that ultimately make sense to humans.

Rather than challenging the human to search a vast mathematical space, the Watson project challenges the computer to operate in human terms.  Watson strives to understand and answer human questions and to know when it does and doesn’t know the answer.  The capability to assess its own knowledge and abilities, something humans find relatively easy, is exceedingly difficult for computers.

7.  How would this QA technology be used in a business setting?

DeepQA technology provides humans with a powerful tool for their information gathering and decision support.  One of many possible scenarios could be for the end user to enter their question in natural language form, much as if they were asking another person, and for the system to sift through vast amounts of potential evidence to return a ranked list of the most compelling, precise answers along with links to supporting or refuting evidence.  Other important scenarios will use DeepQA to analyze a collection of content and data representing a problem, for example a technical support problem or a medical case.  DeepQA will start to search for solution gathering and assessing evidence from many disparate data sources engaging human users to help provide the missing pieces of information that can help arrive at a solution or for example a differential diagnosis, in the case of medicine.

In addition, these answers would include summaries of their justifying or supporting evidence, allowing the user to quickly assess the evidence and select the correct answer.

Business applications include Customer Relationship Management, Regulatory Compliance, Contact Centers, Help Desks, Web Self-Service, Business Intelligence and more.  These applications will demand a deep understanding of users’ questions and analysis of huge volumes of natural language, structured and semi-structured content to rapidly deliver and justify precise, succinct, and high-confidence answers.

8.  What is the role of Unstructured Information Management Architecture (UIMA) in DeepQA and the Watson project?

Unstructured Information Management Architecture (UIMA) is the IBM developed open-source framework for analysis of unstructured content, such as natural language text, speech, images and video, which Watson uses to integrate and deploy a broad collection of deep analysis algorithms over vast amounts of content.

A number IBM ECM products are based on and leverage UIMA today. IBM Content Analytics, IBM OmniFind Enterprise Edition, IBM eDiscovery Analyzer and IBM Classification Module all are powered by, or benefit from, natural language processing and UIMA.

9.  Are any Enterprise Content Management (ECM) technologies actually part of Watson?

Yes, IBM Content Analytics is part of Watson.  After the question is asked, the text needs to be processed using natural language processing.  IBM Content Analytics (LanguageWare) and other techniques (secret sauce) are used to process the text, and understand the question, as part of the complex processing required to fully answer questions with confidence.  I will tackle this issue in more detail in my next blog posting.

IBM Content Analytics (or ICA) is a content analysis platform used to derive rapid insight from content and data.  It can transform raw information into business insight quickly without building models or deploying complex systems enabling businesses to derive insight in hours or days … not weeks or months.  It’s easy to use and designed for any knowledge worker who needs to search and explore content.  ICA can be extended for deeper insights by integrating to Cognos, SPSS, InfoSphere, Netezza and other Business Intelligence, Analytics and Data Warehouse systems. 

The ICA product itself includes tooling (LanguageWare) which is used to customize NLP processing and build industry or customer specific models and solutions.  This capability is at the core of natural language processing and is the very same ICA capability that is used in Watson.

10.  Who is going to win on February 14-16th?

My prediction … Watson is.

I watched the video yesterday of the practice rounds and Watson is impressive.  Watson performed impressively against Ken Jennings and Brad Rutter (the two contestants and the all-time champions).

So … who won the Jeopardy! practice round?

Watson won handily …

http://www.youtube.com/watch?v=12rNbGf2Wwo

Watson’s score was $4,400, beating Jennings by $1,000 and nearly quadrupling Rutter’s score.

IBM will donate 100% of Watson’s winnings to charity, while Rutter and Jennings said they will each donate 50% of their prizes. 

I am going to host a viewing party for colleagues, friends and family.  This is going to be exciting and fun … I can’t wait.  As an IBMer, I’ll be rooting for Watson to win but not for the obvious reason.  My rooting is really about my passion for the amazing technology breakthrough and the power of content analytics. 

Who do you think will win?  Leave me your thoughts below.

Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy!

Does anyone really like searching for stuff?  It conjures up images of looking through old boxes in the attic to find that one thing you can never seem to lay your hands on.  Recently, I went looking for my junior high school yearbook when someone “friended” me on FaceBook and I couldn’t remember them.  The experience was exasperating. I looked through at least 20 boxes of stuff, started sneezing from the dust, and never found the darn yearbook.  As a result, I am still not sure I was actually in the same science class as this person.  The experience reminded me of today’s enterprise search limitations.  I blogged about this recently as part of my Top 10 Pet Peeves for 2010

If you think about it … no one actually likes the searching part.  It’s no fun nor is it intuitive.  You have figure out a “query” or “search string” and hope for the best.  Maybe you’ll get lucky and maybe not.  It’s what I call the “search and hope” model and it can be even more frustrating then my attic experience (I feel a sneeze coming on).

In an AIIM Industry Watch Survey earlier this year, one of the key findings was 72% of the people surveyed say it’s harder, or much harder, to find information and documents held on their own internal systems compared to the Web.  That makes you scratch your head for sure.

In the end, no one “wants” to search anyway … it’s the thing we seek that we care about, and not the searching process.  All I wanted was an answer to my question, which was to see if I could remember this former classmate.

IBM has been working at systems to find answers since the 1950s when the first steps were taken with research on machine based learning.  Over 50+ years (and many millions later), we have history being made.  An IBM computing system (Watson) will play Jeopardy! live on television against Ken Jennings and Brad Rutter, the two all-time most successful contestants, in a series of battles to be aired February 14-16. The series will feature two matches to see if a machine can compete by interpreting real-language questions, in the Jeopardy! format, by using text analysis (natural language processing), automated classification and other technologies to find the correct answers.  Here is a brief overview to Watson.

Watson must find the answers in the same timeframe as the two former champs by processing and understanding the question, researching the possible answers, determining the response and answering quicker than the two former champs … plus it has to be right. WOW!

Jeopardy! is the No. 1-rated quiz show in syndication, with more than 9 million daily viewers. Watson has already passed the test that Jeopardy! contestants take to make it on the show and been has warming up by competing against other former Jeopardy! players.  The top prize for the contest is $1 million, $300,000 for second and $200,000 for third. Jennings and Rutter plan to donate half their winnings to charity.  IBM will donate all winnings to charity.

I can’t wait to see this. I suspect my fascination has to do with my being involved with content analytics as part of my job at IBM.  Or maybe it’s just about the coolest thing ever.

Either way, finding answers sure beats searching and hoping … and this ought to be very very interesting.

Here is a deeper explanation of the DeepQA techology behind Watson for those who are as fascinated by this as I am.