IBM at 100: TAKMI, Bringing Order to Unstructured Data

As most of you know … I have been periodically posting some of the really fascinating top 100 innovations of the past 100 years as part of IBM’s Centennial celebration.

This one is special to me as it represents what is possible for the future of ECM.  I wasn’t around for tabulating machines and punch cards but have long been fascinated by the technology developments in the management and use of content.  As impressive as Watson is … it is only the most recent step in a long journey IBM has been pursuing to help computers better understood natural language and unstructured information.

As most of you probably don’t know … this journey started over 50 years ago in 1957 when IBM published the first research on this subject entitled A Statistical Approach to Mechanized Encoding and Searching of Literary InformationFinally … something in this industry older then I am!

Unstructured Information Management Architecture (UIMA)

Another key breakthrough by IBM in this area was the invention of UIMA.  Now an Apache Open Source project and OASIS standard, UIMA is an open, industrial-strength platform for unstructured information analysis and search.  It is the only open standard for text based processing and applications.  I plan to write more on UIMA in a future blog but I mention it here because it was an important step forward for the industry, Watson and TAKMI (now known as IBM Content Analytics).

TAKMI

In 1997, IBM researchers at the company’s Tokyo Research Laboratory pioneered a prototype for a powerful new tool capable of analyzing text. The system, known as TAKMI (for Text Analysis and Knowledge Mining), was a watershed development: for the first time, researchers could efficiently capture and utilize the wealth of buried knowledge residing in enormous volumes of text. The lead researcher was Tetsuya Nasukawa.

Over the past 100 years, IBM has had a lot of pretty important inventions but this one takes the cake for me.  Nasukawa-san once said,

“I didn’t invent TAKMI to do something humans could do, better.  I wanted TAKMI to do something that humans could not do.”

In other words, he wanted to invent something humans couldn’t see or do on their own … and isn’t that the whole point and value of technology anyway?

By 1997, text was searchable, if you knew what to look for. But the challenge was to understand what was inside these growing information volumes and know how to take advantage of the massive textual content that you could not read through and digest.

The development of TAKMI quietly set the stage for the coming transformation in business intelligence. Prior to 1997, the field of analytics dealt strictly with numerical and other “structured” data—the type of tagged information that is housed in fixed fields within databases, spreadsheets and other data collections, and that can be analyzed by standard statistical data mining methods.

The technological clout of TAKMI lay in its ability to read “unstructured” data—the data and metadata found in the words, grammar and other textual elements comprising everything from books, journals, text messages and emails, to health records and audio and video files. Analysts today estimate that 80 to 90 percent of any organization’s data is unstructured. And with the rising use of interactive web technologies, such as blogs and social media platforms, churning out ever-expanding volumes of content, that data is growing at a rate of 40 to 60 percent per year.

The key for the success was natural language processing (NLP) technology. Most of the data mining researchers were treating English text data as a bag of words by extracting words from character strings based on white spaces. However, since Japanese text data does not contain white spaces as word separators, IBM researchers in Tokyo applied NLP for extracting words, analyzing their grammatical features, and identifying relationships among words. Such in-depth analysis led to better results in text mining. That’s why the leading-edge text mining technology originated in Japan.

The complete article on TAKMI can be found at http://www.ibm.com/ibm100/us/en/icons/takmi/

Fast forward to today.  IBM has since commercialized TAKMI as IBM Content Analytics (ICA), a platform to derive rapid insight.  It can transform raw information into business insight quickly without building models or deploying complex systems enabling all knowledge workers to derive insight in hours or days … not weeks or months.  It helps address industry specific problems such as healthcare treatment effectiveness, fraud detection, product defect detection, public safety concerns, customer satisfaction and churn, crime and terrorism prevention and more.

I’d like to personally congratulate Nasukawa-san and the entire team behind TAKMI (and ICA) for such an amazing achievement … and for making the list.  Selected team members who contributed to TAKMI are Tetsuya Nasukawa, Kohichi Takeda, Hideo Watanabe, Shiho Ogino, Akiko Murakami, Hiroshi Kanayama, Hironori Takeuchi, Issei Yoshida, Yuta Tsuboi and Daisuke Takuma.

It’s a shining example of the best form of innovation … the kind that enables us to do something not previously possible.  Being recognized along with other amazing achievements like the UPC code, the floppy disk, magnetic stripe technology, laser eye surgery, the scanning tunneling microscope, fractal geometry, human genomics mapping is really amazing.

This type of enabling innovation is the future of Enterprise Content Management.  It will be fun and exciting to see if TAKMI (Content Analytics) has the same kind of impact on computing as the UPC code has had on retail shopping … or as laser eye surgery has had on vision care.

What do you think?  As always, leave for your thoughts and comments.

Other similar postings:

Watson and The Future of ECM

“What is Content Analytics?, Alex”

10 Things You Need to Know About the Technology Behind Watson

Goodbye Search … It’s About Finding Answers … Enter Watson vs. Jeopardy!