jump to navigation

Introduction to Semantic Information Processing January 11, 2008

Posted by Jeff in enterprise 2.0, Technology.
Tags: , , ,
add a comment

One of the major research areas that is now appearing in the IT industry is called “semantic information processing”, or the building and using of semantic information repositories.  A semantic information repository is a data collection that links concepts and names together.  We have probably all seen the need for this when performing Internet searches.  Suppose that we search for “chip”: this could mean a search for semiconductor chip, potato chip, or a person named Chip.  From the context, we can determine the meaning.

For instance, if I wrote the search:  “tell me about the nutrition value in chips”, I am probably talking about potato chips, since they are the only kind of chip that is food.

If I wrote the search “collect sales of the Intel T7200 processor chip”, the words “Intel” and “processor” would mean that I am talking about a computer chip or a semiconductor chip.

The intent is to enhance the usability and usefulness of the Web and its interconnected resources.

Historical Context

Early efforts in semantics were the knowledge representation and machine understanding efforts of the AI field in the late 60’s to late 80’s.  During this period, many researchers focused on how to represent knowledge such as scenes in stories, where character Jack is sitting in a house, Jack is married to Jill, Jack has a job of fetching water, water is located at the well, the well is located on a hill, and the hill is behind the house.  Each of these phases connects two nouns with a relationship.

Such knowledge context would then be used to guide the understanding of text or speech.  While many interesting presentations and demonstrations were given, the effort died with the “AI bubble” of the mid-late 1980’s.

Around the late 1990’s, some new efforts to organize semantic knowledge appeared based on the use of descriptive technologies such as Resource Description Framework (RDF) and Web Ontology Language (OWL), and the data-centric, customizable Extensible Markup Language (XML).  All take advantage of the hyperlinks already present in web-based context.

There was a very important article published in 2001 on “The Semantic Web” by authors including Tim Berners-Lee.

Recent Products

With the web plus today’s faster computers, the semantic information concept has been brought back.

One of the most interesting products in this area is called “Twine”, by Radar Networks. According to their description, Twine is using extremely advanced machine learning and natural-language processing algorithms that give it capabilities beyond anything that relies on manual tagging. The tool uses a combination of natural-language algorithms to automatically extract key concepts from collections of text, essentially automatically tagging them. These algorithms adroitly handle ambiguous sets of words, determining, for example, whether J.P. Morgan is a person or a company, depending on the context. And Twine can find the subject of a text even if a keyword is never mentioned, he says, by using statistical machine learning to compare the text with data sources such as Wikipedia.

See also Powerset’s presentation at IWSC.

Data Sources for a Semantic Processing Application

One of the data sources used is WordNet, which is populated with 137,543 word matching pairs.

These applications require, in part or whole, data that is available for sharing either within or across an enterprise. Represented in RDF, this data can be generated from a standard database, mined from existing Web sources, or produced as markup of document content.

Machine-readable vocabularies for describing these data sets or documents are likewise required. The core of many Semantic Web applications is an ontology, a machine-readable domain description, defined in RDFS or OWL. These vocabularies can range from a simple “thesaurus of terms” to an elaborate expression of the complex relationships among the terms or rule sets for recognizing patterns within the data.

The advent of RDF query languages have made it possible to create three-tiered Semantic Web applications similar to standard Web applications.  These applications have queries being issues from the middle tier to the semantic repositories on the back tier.

However, there is a three-way challenge that is holding up the implementation of semantic web systems:

  • motivating companies or governments to release data
  • motivating ontology designers to build and share domain descriptions
  • motivating Web application developers to explore Semantic-Web-based applications

Web 3.0

We have all heard of Web 2.0, so what would Web 3.0 be?

Some of the best forecasts that I have seen match the above discussion.  A Web 2.0 application is a Web 2.0 application that has knowledge and “thinks”.

Lately the concept of semantic information processing has been appearing in the current IT world.  In one of Bob Parker’s presentations, he describes the role of a “Semantic Information Repository” as:

“essential to improving decision making will be the ability to organize all types of information.  At the heart of the repository for large organizations will be an operational data store that can organize large volumes of transactional data into hierarchical, analytic friendly forms.  The data store should be augmented by effective master data management that can provide a harmonized view of key subject matters like suppliers, products, assets, customers, and employees in the context of the value chain being monitored.  The ability to bring some structure to unstructured content like documents completes the repository”

Some of the places where we are seeing semantic resolution of information being important:

  • Data cleansing – find product name equivalences
  • Business process management – it combines easer to write a business process if all of the terms and references have been reduced to standards by an semantic pre-processing layer in the runtime engine.
  • Business intelligence – it becomes easier to generate intelligence and conclusions if the corresponding data sets and events have been standardized through a semantic processing step.