A document created with word processors is a collection of character sequences and embedded objects interspersed with formatting information. This makes it difficult to access the content. A large subset of such documents follows an underlying structure, whether the document is a resume created by an individual, for example, or by a team for project documentation.
IBM Content Harvester allows you to harvest such unstructured, formatted documents by:
* Extracting the content
* Cleansing sensitive information
* Tagging based on user-defined names
* Querying for selective tags, and
* Publishing information of interest in any open format.
You simply specify the regions of content that are of interest in terms of textual markers, what tag to assign to the extracted content, and what terms to cleanse off in the extracted content, using rules. The information is then processed for cleansing and tagging. The resulting output is an XML file which can be queried using XQuery for any assigned tag and published in any open format like HTML using XSL transforms.