CAM features
Extraction - extract knowledge from content
The information extraction process is broken down into two steps: Split (optional) and Extract. "Split" breaks the input down into multiple parts: a corpus into a set of documents, a single document into separate sections, etc. "Extract" uses information extraction tools to process the input and extract entities.
Consolidation - reconcile extracted knowledge with content metadata
The information consolidation process is broken down into three steps: Merge, Control, and Infer (optional).
"Merge" sends multi-criteria queries to the semantic repository in order to retrieve the URI of an entity or annotation, and subsequently, eliminates duplicates within the Common Analysis Structure (CAS). CAS defines a high-level annotation schema that needs to be specified for each new CA Manager application.
"Control" verifies that the extracted entity or annotation is valid against the ontology model. The verification includes multiple parameters such as domains and ranges, cardinalities, date formats and the temporal information, the number formats and metric systems. For instance, if in the preceding step the extracted entity was merged with an existing instance, the algorithms look at the properties of the extracted entity: are these property types authorized for the entity's class? Do these properties already exist in the merged instance? Do they have the same values? If not, how do we decide which value is the right one, especially when dealing with thesaurus values such as geographical locations, or with time values such as dates? The algorithms try to automatically resolve these issues and when not possible, they also mark the new entity or annotation with the "invalid" metadata. All invalid statements are stored in the semantic repository on the server so that they can be retrieved and manually validated, if required by the target application.
"Infer" uses a reasoning engine to apply inference rules in order to discover new entities or new relations between them, as well as to control the overall coherence and quality of the semantic repository.
Storage - export or store the reconciled knowledge
The information storage process is broken down into two steps: Serialize and Store (optional). "Serialize" parses the enriched and consolidated annotation schema in order to generate an output in the requested application format (XML, RDF, OWL, NewsML, CityGML, etc.). "Store" is optional as it depends if the application processes the serialized format directly or stores the results in a knowledge store (such as ITM) and/or in a dedicated annotation server (such as Sesame).
