Mastering Norconex Importer

Written by

in

Norconex Importer is an open-source Java library and command-line application designed to parse, extract, and manipulate data from various file formats (HTML, PDF, Word, etc.) into plain text. Mastering it means understanding its architecture, lifecycle handlers, and how it forms the core processing engine for Norconex Web and Filesystem Crawlers.

You can deploy it as a standalone engine, embed it into a custom Java ETL pipeline, or configure it via XML. ⚙️ Core Architecture & Execution Flow

The Importer processes files using a strict lifecycle split into two phases: Pre-parsing (acting on the raw, native file format) and Post-parsing (acting on the extracted plain text).

[Raw Document File] │ ▼ ┌────────────────────────────────────────┐ │ 1. Pre-Parse Handlers │ ◄── Clean native format (e.g., strip HTML tags) └────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────┐ │ 2. Document Parser (Apache Tika Core) │ ◄── Extract plain text & initial metadata └────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────┐ │ 3. Post-Parse Handlers │ ◄── Manipulate text, regex replace, add tags └────────────────────────────────────────┘ │ ▼ [Clean Text & Metadata Output] 🛠️ The Four Essential Handlers

To master the tool, you must leverage the IImporterHandler implementations. These can be configured natively via XML or written in custom Java classes.

IDocumentFilter: Accepts or rejects a document based on its properties or content. For instance, you can drop files matching a certain regular expression.

IDocumentTagger: Modifies, injects, or deletes metadata fields. Utilities like KeepOnlyTagger and DeleteTagger ensure your output contains only relevant data fields.

IDocumentTransformer: Directly alters the text content. This is used for search-and-replace actions or stripping repetitive strings.

IDocumentSplitter: Breaks a single composite file down into multiple distinct documents (e.g., splitting a multi-page translation document or a large spreadsheet). 🚀 Key Master-Level Features

Advanced XML Flow Control: The configuration schema allows structural logic via , , , , and tags. You can execute specific taggers or transformers only if a document meets specific criteria.

DOM-Based Transformations: When dealing with HTML or XML files, the DOMDeleteTransformer lets you use CSS-like selectors to strip navigation bars, headers, or footers before parsing, ensuring clean text extraction.

Embedded OCR Capabilities: Powered by Apache Tika integrations, the Importer automatically detects and runs optical character recognition (OCR) on images or scanned PDFs to extract text.

Extensibility: You can drop custom compiled .class files directly into the Importer’s /classes or /lib directories. The application dynamically picks them up without requiring complex rebuilds. 📂 Configuration Example

You can manage the engine using an XML file. The snippet below outlines how to use both pre-parse and post-parse steps to isolate and tag incoming files:

1073741824 https?://(.?)(/.|\()</fromValue> <toValue>\)1 title,description,body,SourceDomain Use code with caution. Norconex Importer – Crawlers

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *