Sandeepan Banerjee

Subscribe to Sandeepan Banerjee: eMailAlertsEmail Alerts
Get Sandeepan Banerjee: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Virtualization Magazine, Java EE Journal, XML Magazine, SOA & WOA Magazine

Virtualization: Article

The Information Grid - XML and Databases Moving Toward Convergence

XML and databases: moving toward convergence

SOA and Business Process Management
SOAs underpin application virtualization in the Information Grid. The foundation of SOA is a set of independent, well-defined encapsulations of software functionality that can be invoked over a network using heterogeneous platforms and execution environments. SOAs connect these independent services toward a larger purpose, where the services must occur in a particular order. SOAs also orchestrate execution in correct sequence; languages such as Business Process Execution Language (BPEL) provide a standard for orchestrating processes into complex business flow.

SOAs are implemented using XML-based Web services standards. Web services are successful where earlier distributed computing architectures have failed for three reasons: simpler standards, broader adoption, and looser coupling. Web services are not only based on simpler standards (HTTP, SOAP) than, say, CORBA, but they have also been broadly incorporated into packaged software and adopted by companies across many industries.

Repository: Metadata, Schema, and Service Management
The Semantic Web community has found out that the missing link to effectively sharing and reusing data on the Web is the lack of machine-readable standards for semantics (i.e., meanings) of Web (typically HTML-based) content. Thus, the Semantic Web is often associated with specific XML-based standards for semantics, such as Resource Description Framework (RDF) and Web Ontology Language (OWL). Within enterprises, almost every product or service is looking to provide an "XML out" that publishes data in a self-describing, standard way for use by other applications. From financial reporting (XBRL) to Web site feeds (RSS) to legal information exchange (LegalXML), XML is the dominant standard for interchange today. In addition to the exchange format standards, management standards such as Access Control Markup Language (XACML) and digital intellectual property rights management (XRML) are also emerging.

The Information Grid also requires semantic information to make each data resource accessible to any process in the Application Grid without requiring any a priori coupling between the data resources and the Application Grid processes. In practice, this relies on metadata describing the meaning of data and relationships among data elements, as well as the implementation of exchange formats and management standards.

The relational database was one early implementation of metadata technology. Unlike its predecessors - the network and hierarchical databases in which all relationships between data had to be predetermined, the relational database enabled flexible yet predictable access to a general-purpose information resource.

XML is the next evolution in the world of metadata. The brain of an Information Grid is an XML Metadata Repository. This repository (which may be physically distributed across nodes and disks) keeps track of the information. It helps organize all of the resources participating in the Information Grid into hierarchical relationships (the invoice records sitting in database-A logically belong to a folder named Customer sitting on file-system-B, a description of that customer is to be found in CRM-application-C, with the latest interaction recorded in e-mail-server-D, and the connections being automatically deduced from XML tags carried by the data).

The metadata can form different kinds of ontologies. Ontologies specify concepts; ontologies can not only be about a domain (a defect can be an actual failure or a possible failure), but also about tasks (how to compute a possible failure's probability), personalization (different views of a defect, from the legal, support, marketing, or finance perspectives), argumentation (why the defect data was collected, why it was modelled in the way it was, and who agrees to it and who dissents), and so on.

The repository also provides services like event management (what to do if the customer is deleted), business rules (how to determine if the customer qualifies for a volume discount), versioning (issue a new version of the last invoice reflecting the volume discount), access control (who can see the customer's credit card number) and so on. In any distributed system names are used to refer to objects such as computers, services, or data. Typical naming services such as the international X.500 naming scheme or DNS (the Internet's scheme) provide a uniform namespace across the grid. The XML Repository also supports a standard naming service.

The latest generation of relational databases has now evolved to include the XML data model, and several of them support the XML Schema standard for defining exchange formation. Today's best databases, however, also include built-in XML Metadata Repository functionality, thus supporting event management, business rules, versioning, access control, and rights management.

Semantic Crawlers, Search, and Query
On the Web, search engines deploy crawlers or spiders to deduce metadata about HTML pages and index them so that keyword searches can be performed across Web sites. Search servers provide the same opportunity within a grid. On the Information Grid, semantic crawlers extract metadata from the assets as they are crawled (exploiting markup and also employing various heuristics), and the best ones can induce relationships between items through text mining techniques. While crawling across messages in an e-mail server, a semantic crawler might deduce that the presence of the word complaint, refund, or other such term in an e-mail message indicates an unhappy customer. Later, when a query is initiated for a customer name through a keyword search interface, the search server can color-code the search results, thereby indicating how "happy" the customer is. This is the business intelligence value of the Information Grid showing through.

Currently, search engines are poor examples of semantic processing - typically two different users with the same query will get the same result, even if one was searching for an insect (cricket) and the other for a game (cricket). Humans can generally understand which hit is about what, but automations built around search-hit-lists fail due to the high semantic ambiguity. Ideally, the search query will be qualified by the user's context, the data described by the creator's context, and the two matched to give unambiguous results.

Within an intranet, the crawlers also need to be able to respect and enforce security, information lifecycle management (ILM), and privacy policies. The best search servers today combine security, the semantic relevance of returned results, and the ability to intelligently present as much contextual information that exists.

More Stories By Sandeepan Banerjee

Sandeepan Banerjee is director of product management in Oracle's Server Technologies division. He is responsible for SQL, XML, and Text Search infrastructure, and especially their convergence into one platform for all data. Sandeepan has worked with database technologies for over 15 years, and the majority of them have been with Oracle.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Sandeepan Banerjee 06/28/05 12:46:11 PM EDT

The Information Grid - XML and Databases Moving Toward Convergence. Two somewhat contrary-sounding drivers fuel the emerging renaissance in enterprise data management - virtualization and convergence. Virtualization is a framework for dividing up the resources of an organization into multiple execution environments through the application of one or more technologies such as hardware clustering, software partitioning, application modularization, emulation, and so on. Convergence, on the other hand, tries to bring diverse information assets - databases, mail stores, documents - under unified management. The coming Information Grid unites these opposing drivers.