Working with a Document

Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.

Creating a Document

The first step in working with Kodexa is typically creating a new document. Let's assume we have a document in PDF format:

from kodexa import Document

my_document = Document.from_file('example.pdf')

This step creates an empty document with a reference to the PDF file. At this point, the document hasn't been parsed, but metadata has been added to allow Kodexa to understand where to find the original document.

Saving a Document

You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:

my_document.to_kddb('my-document.kddb')
my_document.close()

By convention, we use the .kddb extension for Kodexa Document Database files.

Loading a Document

To load a previously saved Kodexa Document:

another_document = Document.from_kddb('my-document.kddb')
another_document.close()

Detached Documents

Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:

detached_document = Document.from_kddb('my-document.kddb', detached=True)

Anatomy of a Kodexa Document

The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features attached to them.

Let's explore the key components of the model.

Core Components

Document Structure

graph TD
    A[Document] --> B[Metadata]
    A --> C[Content Node Tree]
    A --> D[Source Metadata]
    A --> E[Labels]
    C --> F[Root Content Node]
    F --> G[Child Node 1]
    F --> H[Child Node 2]
    G --> I[Grandchild Node]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px
    style F fill:#bfb,stroke:#333,stroke-width:2px

A Kodexa Document consists of:

Document Metadata: Flexible dictionary-based metadata about the document
Content Node Tree: Hierarchical structure of content nodes
Source Metadata: Information about the document's origin
Labels: Document-level labels

Content Nodes

ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties:

graph LR
    A[ContentNode] --> B[Node Type]
    A --> C[Content]
    A --> D[Features]
    A --> E[Children]
    A --> F[UUID]
    A --> G[Index]

    style A fill:#f9f,stroke:#333,stroke-width:2px

Key attributes:

node_type: Identifies the type of node (e.g., 'page', 'line', 'cell')
content: The actual content of the node
features: List of attached features (metadata, tags, etc.)
children: Child nodes in the hierarchy
uuid: Unique identifier
index: Position among siblings

Features

Features are flexible metadata containers attached to ContentNodes. They come in different types:

graph TD
    A[Feature Types] --> B[Tags]
    A --> C[Spatial]
    A --> D[Custom Features]

    B --> E[Named Tags]
    B --> F[Tag Values]

    C --> G[Bounding Boxes]
    C --> H[Coordinates]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px

Each feature has:

feature_type: Category of the feature (e.g., 'tag', 'spatial')
name: Identifier for the feature
value: The feature's data
single: Boolean indicating if it's a single value or collection

Working with Documents

Creating Documents

# Create a new document
doc = Document()

# Create from text
doc = Document.from_text("Some content")

# Create from file
doc = Document.from_file("path/to/file")

Adding Content

# Create a root node
root = doc.create_node(node_type="root")
doc.content_node = root

# Add child nodes
page = doc.create_node(node_type="page", content="Page content")
root.add_child(page)

Working with Features

# Add a tag feature
node.add_feature("tag", "paragraph", "body")

# Add spatial information
node.set_bbox([10, 20, 100, 200])

# Get feature value
value = node.get_feature_value("tag", "paragraph")

Node Navigation and Selection

The document model provides powerful ways to navigate and select nodes:

Direct Navigation:
- get_children(): Get immediate child nodes
- get_parent(): Get parent node
- next_node(): Get next sibling
- previous_node(): Get previous sibling

Selector-based Navigation:

# Select all nodes of type 'page'
pages = document.select("//page")

# Select nodes with specific tags
tagged = document.select("//*[hasTag('paragraph')]")

Best Practices

Node Types: Use consistent node types throughout your document to make selection and processing easier
Features:
- Use features to add metadata rather than modifying node content
- Keep feature names consistent across your application
- Use appropriate feature types for different kinds of metadata
Content Structure:
- Maintain a logical hierarchy that reflects the document's structure
- Use indexes appropriately to maintain node order
- Consider using virtual nodes for sparse content
Performance:
- Use selectors efficiently
- Batch operations when possible
- Consider using KDDB format for large documents

Error Handling

The document model includes robust error handling through the ContentException class:

# Add an exception to the document
doc.add_exception(ContentException(
    exception_type="validation",
    message="Invalid content structure",
    severity="ERROR"
))

Metadata

This is a dictionary containing metadata about the document, such as the source, title, author, etc:

print(my_document.metadata)

SourceMetadata

This contains metadata about the source document and works with connectors to allow you to access the original source document:

print(my_document.source)

Working with Document Content

Kodexa uses a powerful selector syntax to find and manipulate content within documents.

Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.

Basic Selector Example

To find all content nodes with the value "Name":

nodes = document.select('//*[contentRegex("Name")]')

This returns an iterator of the matching content nodes.

Selector Syntax

The selector syntax is composed of several parts:

Axis & Node Type: Defines how to navigate the tree structure.
Predicate: Further filters the selected nodes based on conditions.

Axis Examples

//: Current node and all children
/: Root node
.: Current Node (or root if from the document)
./line/.: All nodes of type line under the current node
parent::line: Any node in the parent structure of this node that is of node type line

Predicate Functions

Predicates can use various functions, such as:

contentRegex: Matches content against a regular expression
typeRegex: Matches node type name against a regular expression
hasTag: Checks if a node has a specific tag
hasFeature: Checks if a node has a specific feature
content: Returns the content of the node
uuid: Returns the UUID of the node

Operators

Operators can be used to combine functions:

|: Union the results of two sides
=: Test that two sides are equal
and: Boolean AND operation