IBM at 100: A Computer Called Watson

Watson is an efficient analytical engine that pulls many sources of data together in real-time, leverages natural language processing, discovers an insight, and deciphers a degree of confidence.

In my continuing series of IBM at 100 achievements, I saved the Watson achievement posting for today. In an historic event beginning tonight, in February 2011 IBM’s Watson computer will compete on Jeopardy! against the TV quiz show’s two biggest all-time champions. Watson is a supercomputer running software called DeepQA, developed by IBM Research. While the grand challenge driving the project is to win on Jeopardy!, the broader goal of Watson was to create a new generation of technology that can find answers in unstructured data more effectively than standard search technology.

Watson does a remarkable job of understanding a tricky question and finding the best answer. IBM’s scientists have been quick to say that Watson does not actually think. “The goal is not to model the human brain,” said David Ferrucci, who spent 15 years working at IBM Research on natural language problems and finding answers amid unstructured information. “The goal is to build a computer that can be more effective in understanding and interacting in natural language, but not necessarily the same way humans do it.”

Computers have never been good at finding answers. Search engines don’t answer a question–they deliver thousands of search results that match keywords. University researchers and company engineers have long worked on question answering software, but the very best could only comprehend and answer simple, straightforward questions (How many Oscars did Elizabeth Taylor win?) and would typically still get them wrong nearly one third of the time. That wasn’t good enough to be useful, much less beat Jeopardy! champions.

The questions on this show are full of subtlety, puns and wordplay—the sorts of things that delight humans but choke computers. “What is The Black Death of a Salesman?” is the correct response to the Jeopardy! clue, “Colorful fourteenth century plague that became a hit play by Arthur Miller.” The only way to get to that answer is to put together pieces of information from various sources, because the exact answer is not likely to be written anywhere.

Watson leverages IBM Content Analytics for part of the natural language processing. Watson runs on a cluster of PowerPC 750™ computers—ten racks holding 90 servers, for a total of 2880 processor cores. It’s really a room lined with black cabinets stuffed with hundreds of thousands of processors plus storage systems that can hold the equivalent of about one million books worth of information. Over a period of years, Watson was fed mountains of information, including text from commercial sources, such as the World Book Encyclopedia, and sources that allow open copying of their content, such as Wikipedia and books from Project Gutenberg.  Learn more about the technology under the covers on my previous posting 10 Things You Need to Know About the Technology Behind Watson.

When a question is put to Watson, more than 100 algorithms analyze the question in different ways, and find many different plausible answers–all at the same time. Yet another set of algorithms ranks the answers and gives them a score. For each possible answer, Watson finds evidence that may support or refute that answer. So for each of hundreds of possible answers it finds hundreds of bits of evidence and then with hundreds of algorithms scores the degree to which the evidence supports the answer. The answer with the best evidence assessment will earn the most confidence. The highest-ranking answer becomes the answer. However, during a Jeopardy! game, if the highest-ranking possible answer isn’t rated high enough to give Watson enough confidence, Watson decides not to buzz in and risk losing money if it’s wrong. The Watson computer does all of this in about three seconds.

By late 2010, in practice games at IBM Research in Yorktown Heights, N.Y., Watson was good enough at finding the correct answers to win about 70 percent of games against former Jeopardy! champions. Then in early 2011, Watson went up against Jeopardy! superstars Ken Jennings and Brad Rutter.

Watson’s question-answering technology is expected to evolve into a commercial product. “I want to create something that I can take into every other retail industry, in the transportation industry, you name it,” John Kelly, who runs IBM Research, told The New York Times. “Any place where time is critical and you need to get advanced state-of-the-art information to the front decision-makers. Computers need to go from just being back-office calculating machines to improving the intelligence of people making decisions.”

When you’re looking for an answer to a question, where do you turn? If you’re like most people these days, you go to a computer, phone or mobile device, and type your question into a search engine. You’re rewarded with a list of links to websites where you might find your answer. If that doesn’t work, you revise your search terms until able to find the answer. We’ve come a long way since the time of phone calls and visits to the library to find answers.

But what if you could just ask your computer the question, and get an actual answer rather than a list of documents or websites? Question answering (QA) computing systems are being developed to understand simple questions posed in natural language, and provide the answers in textual form. You ask “What is the capital of Russia?” The computer answers “Moscow,” based on the information that has been loaded into it.

IBM is taking this one step further, developing the Watson computer to understand the actual meaning behind words, distinguish between relevant and irrelevant content, and ultimately demonstrate confidence to deliver precise final answers. Because of its deeper understanding of language, it can process and answer more complex questions that include puns, irony and riddles common in natural language. On February 14–16, 2010, IBM’s Watson computer will be put to the test, competing in three episodes of Jeopardy! against the two most successful players in the quiz show’s history: Ken Jennings and Brad Rutter.

The full text of this article can be found on IBM at 100: http://www.ibm.com/ibm100/us/en/icons/watson/

As for me … I am anxiously waiting to see what happens starting tonight.  See my previous blog postings on Watson at:  “What is Content Analytics?, Alex”, 10 Things You Need to Know About the Technology Behind Watson and Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy!

Good luck tonight to Watson, Ken Jennings and Brad Rutter … may the best man win (so to speak)!

Introducing IBM at 100: Patents and Innovation

With the looming Jeopardy! challenge competition involving IBM Watson, I am feeling proud of my association with IBM.  In part because IBM is an icon of business.  As a tribute, I plan to re-post a few of the notable achievements by IBM and IBMers from the past 100 years as an attempt to put the company’s contributions years into perspective.   Has IBM made a difference on our world … our planet?  What kind of impact has IBM had on the world?  Is it really a smarter planet as a result of the past 100 years?

I hope to answer these and other questions through these posts.  A dedicated website has these postings and much more about IBM’s past 100 years.   There is also a great overview video.  Check back often.  New stories will be added throughout the centennial year.  Let’s start with Patents and Innovation … a cornerstone of IBM’s heritage and reputation.

IBM’s 100 Icons of Progress

In the span of a century, IBM has evolved from a small business that made scales, time clocks and tabulating machines to a globally integrated enterprise with 400,000 employees and a strong vision for the future. The stories that have emerged throughout our history are complex tales of big risks, lessons learned and discoveries that have transformed the way we work and live. These 100 iconic moments—these Icons of Progress—demonstrate our faith in science, our pursuit of knowledge and our belief that together we can make the world work better.

Patents and Innovation

By hiring engineer and inventor James W. Bryce in 1917, Thomas Watson Sr. showed his commitment to pure inventing. Bryce and his team established IBM as a long-term leader in the development and protection of intellectual property. By 1929, 90 percent of IBM’s products were the result of Watson’s investments in R&D. In 1940, the team invented a method for adding and subtracting using vacuum tubes—a basic building block of the fully electronic computers that transformed business in the1950s. This pattern—using innovation to create intellectual property—shaped IBM’s history.

On January 26, 1939, James W. Bryce, IBM’s chief engineer, dictated a two-page letter to Thomas J. Watson, Sr., the company’s president. It was an update on the research and patents he had been working on. Today, the remarkable letter serves as a window into IBM’s long-held role as a leader in the development and protection of intellectual property.

Bryce was one of the most prolific inventors in American history, racking up more than 500 U.S. and foreign patents by the end of his career. In his letter to Watson, he described six projects, each of which would be considered a signature life achievement for the average person. They included research into magnetic recording of data, an investigation into the use of light rays in computing and plans with Harvard University for what would become one of the first digital computers. But another project was perhaps most significant. Wrote Bryce: “We have been carrying on an investigation in connection with the development of computing devices which do not employ the usual adding wheels, but instead use electronic effects and employ tubes similar to those used in radio work.”

The investigation bore fruit. On January 15, 1940, Arthur H. Dickinson, Bryce’s top associate and a world-beating inventor in his own right, submitted an application for a patent for “certain improvements in accounting apparatus.” In fact, the patent represented a turning point in computing history. Dickinson, under Bryce’s supervision, had invented a method for adding and subtracting using vacuum tubes—a basic building block of the fully electronic computers that began to appear in the 1940s and transformed the world of business in the 1950s.

This pattern—using innovation to create intellectual property—is evident throughout IBM’s history. Indeed, intellectual property has been strategically important at IBM since before it was IBM.

The full text of this article can be found on IBM at 100: http://www.ibm.com/ibm100/us/en/icons/patents/

“What is Content Analytics?, Alex”

“The technology behind Watson represents the future of data management and analytics.  In the real world, this technology will help us uncover insights in everything from traffic to healthcare.”

– John Cohn, IBM Fellow, IBM Systems and Technology Group

How can the same technology used to play Jeopardy! give you better business insight?

Why Watson matters

You have to start by understanding that IBM Watson DeepQA is the world’s most advanced question answering machine.  It uncovers answers by understanding the meaning buried in the context of a natural language question.  By combining advanced Natural Language Processing (NLP) and DeepQA automatic question answering technology, Watson represents the future of content and data management, analytics, and systems design.  IBM Watson leverages core content analysis, along with a number of other advanced technologies, to arrive at a single, precise answer within a very short period of time.  The business applications for this technology is limitless starting with clinical healthcare, customer care, government intelligence and beyond.  I covered the technology side of Watson in my previous posting 10 Things You Need to Know About the Technology Behind Watson.

Amazingly, Watson works like the human brain to analyze the content of a Jeopardy! question.  First, it tries to understand the question to determine what is being asked.  In doing so, it first needs to analyze the natural language text.  Next, it tries to find reasoned answers, by analyzing a wide variety of disparate content mostly in the form of natural language documents.  Finally, Watson assesses and determines the relative likelihood that the answers found, are correct based on a confidence rating.

A great example of the challenge is described by Stephen Baker in his book Final Jeopardy: Man vs. Machine and the Quest to Know Everything: ‘When 60 Minutes premiered, this man was U.S. President.  ‘ Traditionally it’s been difficult for a computer to understand what ‘premiered’ means and that it’s associated with a date.  To a computer, ‘premiere’ could also mean ‘premier’.  Is the question about a person’s title or a production opening?  Then it has to figure out the date when an entity called ’60 Minutes’ premiered, and then find out who was the ‘U.S. President’ at that time.  In short, it requires a ton of contextual understanding.

I am not talking about search here.  This is far beyond what search tools can do.  A recent Forrester report, Take Control Of Your Content, states that 45% of the US workforce spends three or more hours a week just searching for information.  This is completely inefficient.  See my previous posting Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy! for more on this topic.

Natural Language Processing (NLP) can be leveraged in any situation where text is involved. Besides answering questions, it can help improve enterprise search results or even develop an understanding of the insight hidden in the content itself.  Watson leverages the power of NLP as the cornerstone to translate interactions between computers and human (natural) languages.

NLP involves a series of steps that make text understandable (or computable).  A critical step, lexical analysis is the process of converting a sequence of characters into a set of tokens.  Subsequent steps leverage these tokens to perform entity extraction (people, places, things), concept identification (person A belongs to organization B) and the annotation of documents with this and other information.  A feature of IBM Content Analytics (known as LanguageWare) is performing the lexical analysis function in Watson as part of natural language processing.

Why this matters to your business

Jeopardy! poses a similar set of contextual information challenges as those found in the business world today:

  • Over 80 percent of information being stored is unstructured (is text based).
  • Understanding that 80 plus percent isn’t simple.  Like Jeopardy! … subtle meaning, irony, riddles, acronyms, abbreviations and other complexities all present unique computing challenges not found with structured data in order to derive meaning and insight. This is where natural language processing (NLP) comes in.

The same core NLP technology used in Watson is available now to deliver business value today by unlocking the insights trapped in the massive amounts of unstructured information in the many systems and formats you have today.  Understanding the content, context and value of this unstructured information presents an enormous opportunity for your business.  This is already being done today in a number of industries by leveraging IBM Content Analytics.

IBM Content Analytics (ICA) itself is a platform to derive rapid insight.  It can transform raw information into business insight quickly without building models or deploying complex systems.  Enabling all knowledge workers to derive insight in hours or days … not weeks or months.  It helps address industry specific problems such as healthcare treatment effectiveness, fraud detection, product defect detection, public safety concerns, customer satisfaction and churn, crime and terrorism prevention and more.  Here are some actual customer examples:

Healthcare Research – Like most healthcare providers, BJC Healthcare, had a treasure trove of historical information trapped in unstructured clinical notes, diagnostic reports containing essential information for the study of disease progression, treatment effectiveness and long-term outcomes.  Their existing Biomedical Informatics (BMI) resources were disjointed and non-interoperable, available only to a small fraction of researchers, and frequently redundant, with no capability to tap into the wealth of research information trapped in unstructured clinical notes, diagnostic report and the like.

With IBM Content Analytics, BJC and university researchers are now able to analyze unstructured information to answer key questions that were previously unavailable.  Questions like: Does the patient smoke?, How often and for how long?, If smoke free, how long? What home medications is the patient taking? What is the patient sent home with? What was the diagnosis and what procedures performed on patient?  BJC now has deeper insight into medical information and can uncover trends and patterns within their content, to provide better healthcare to their patients.

Customer Satisfaction – Identifying customer satisfaction trends about products, services and personnel is critical to most businesses.  The Hertz Corporation and Mindshare Technologies, a leading provider of enterprise feedback solutions, are using IBM Content Analytics software to examine customer survey data, including text messages, to better identify car and equipment rental performance levels for pinpointing and making the necessary adjustments to improve customer satisfaction levels.

By using IBM Content Analytics, companies like Hertz can drive new marketing campaigns or modify their products and services to meet the demands of their customers. “Hertz gathers an amazing amount of customer insight daily, including thousands of comments from web surveys, emails and text messages. We wanted to leverage this insight at both the strategic level and the local level to drive operational improvements,” said Joe Eckroth, Chief Information Officer, the Hertz Corporation.

For more information about ICA at Hertz: http://www-03.ibm.com/press/us/en/pressrelease/32859.wss

Research Analytics – To North Carolina State University, the essence of a university is more than education – it is the advancement and dissemination of knowledge in all its forms.  One of the main issues faced by NC State was dealing with the vast number of data sources available to them.  The university sought a solution to efficiently mine and analyze vast quantities of data to better identify companies that could bring NC State’s research to the public.  The objective was a solution designed to parse the content of thousands of unstructured information sources, perform data and text analytics and produce a focused set of useful results.

Using IBM Content Analytics, NC State was able to reduce the time needed to find target companies from months to days.  The result is the identification of new commercialization opportunities, with tests yielding a 300 percent increase in the number of candidates.  By obtaining insight into their extensive content sources, NC State’s Office of Technology Transfer was able to find more effective ways to license technologies created through research conducted at the university. “What makes the solution so powerful is its ability to go beyond conventional online search methods by factoring context into its results.” – Billy Houghteling, executive director, NC State Office of Technology Transfer.

For more information about ICA at NC State: http://www-01.ibm.com/software/success/cssdb.nsf/CS/SSAO-8DFLBX?OpenDocument&Site=software&cty=en_us

You can put the technology of tomorrow to work for you today, by leveraging the same IBM Content Analytics capability helping to power Watson.  To learn more about all the IBM ECM products utilizing Watson technology, please visit these sites:

IBM Content Analytics: http://www-01.ibm.com/software/data/content-management/analytics/

IBM Classification Module: http://www-01.ibm.com/software/data/content-management/classification/

IBM eDiscovery Analyzer: http://www-01.ibm.com/software/data/content-management/products/ediscovery-analyzer/

IBM OmniFind Enterprise Edition: http://www-01.ibm.com/software/data/enterprise-search/omnifind-enterprise/

You can also check out the IBM Content Analytics Resource Center or watch the “what it is and why it matters” video.

I’ll be at the Jeopardy! viewing party in Washington, DC on February 15th and 16th … hope to see you there.  In the mean time, leave me your thoughts and questions below.

WikiLeaks Disclosures … A Wakeup Call for Records Management

Earlier in my professional career, I used to hit the snooze button 4 or 5 times every morning when the alarm went off. I did this for years until I realized it was the root cause of being late to work and getting my wrists slapped far too often. It seems simple, but we all hit the snooze button even though we know the repercussions. Guess what … the repercussions are getting worse.

For years, the federal government has been hitting the snooze button on electronic records management. The GAO has been critical of the Federal Government’s ability to manage records and information saying there’s “little assurance that [federal] agencies are effectively managing records, including e-mail records, throughout their life cycle.” During the past few administrations, similar GAO reports and/or embarrassing public information mismanagement incidents have reminded us (and not in a good way) of the importance of good recordkeeping and document control. You may recall incidents over missing emails involving both the Bush and Clinton administrations. Now we have Wikileaks blabbing to the world with embarrassing disclosures of State Department and military documents. This is taking the impact of information mismanagement to a whole level of public embarrassment, exposure and risk. Although it should not be surprising to anyone that this is happening considering the previous incidents and GAO warnings it has still caused quite a stir and had a measurable impact. Corporations should see this as a cautionary tale and a sign of things to come … so start preparing now.

Start by asking yourself, what would happen if your sensitive business records were made publicly available and the entire world was talking, blogging and tweeting about it. For most organizations, this is a very scary thought. Fortunately, there are solutions and best practices available today to protect enterprises from these scenarios.

Implement Electronic Records Management: Update your document control policies to include the handling of sensitive information including official records. Do you even have an Information Lifecycle Governance strategy today? Start by getting the key stakeholders from Legal, Records and IT involved, at a minimum, and ensure you have top down executive support. Implement an electronic records program and system based on an ECM repository you can trust (see my two earlier blogs on trusting repositories). This will put the proper controls, security and policy enforcement in place to govern information over it’s lifespan including defensible disposition. Getting rid of things when you are supposed to dramatically reduces the risk of improper disclosure. Although implementing a records management system has many benefits, including reducing eDiscovery costs and risks, it is also the cornerstone of preventing information from falling into the wrong hands. Standards (DoD 5015.02-STD, ISO 15489), best practices (ARMA GARP) and communities (CGOC) exist to guide and accelerate the process. Records management can be complimented by Information Rights Management and/or Digital Loss Prevention (DLP) technology for enhanced security and control options.

Leverage Content Analytics: Use content analytics to understand employee sentiment and as well as detect any patterns of behavior that could lead to intentional disclosure of information. These technologies leverage text and content analytics to identify disgruntled employees before an incident occurs enabling proactive investigation and management of potentially troublesome situations. They can also serve as background for any investigation that may happen in the event of an incident. Enterprises should proactively monitor for these risks and situations … as an ounce of prevention is worth a pound of cure. Content analytics can also be extended with predictive analytics to evaluate the probably of an incident and the associated exposure.

Leverage Advanced Case Management: Investigating and remediating any risk or fraud scenario requires advanced case management. These case centric investigations are almost always ad-hoc processes with unpredictable twists and turns. You need the ad-hoc and collaborative nature of advanced case management to serve as a process backbone as the case proceeds and ultimately concludes. Having built-in audit trails, records management and governance ensures transparency into the process and minimizes the chance of any hanky-panky. Enterprises should consider advanced case management solutions that integrate with ECM repositories and records management for any content-centric investigation.

This adds up to one simple call to action … stop hitting the snooze button and take action. Any enterprise could be a target and ultimately a victim. The stakes are higher then ever before. Leverage solutions like records management, content analytics and advanced case management to improve your organizations ability to secure, control and retain documents while monitoring and remediating for potential risky disclosure situations.

Leave me your thoughts and ideas. I’ll read and respond later … after I am done hitting the snooze button a few times (kidding of course).