Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.

Creating a Document

The first step in working with Kodexa is typically creating a new document. Let's assume we have a document in PDF format:

from kodexa import Document

my_document = Document.from_file('example.pdf')

This step creates an empty document with a reference to the PDF file. At this point, the document hasn't been parsed, but metadata has been added to allow Kodexa to understand where to find the original document.

Saving a Document

You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:

my_document.to_kddb('my-document.kddb')
my_document.close()

By convention, we use the .kddb extension for Kodexa Document Database files.

Loading a Document

To load a previously saved Kodexa Document:

another_document = Document.from_kddb('my-document.kddb')
another_document.close()

Detached Documents

Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:

detached_document = Document.from_kddb('my-document.kddb', detached=True)

Anatomy of a Kodexa Document

The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features attached to them.

Let's explore the key components of the model.

Core Components

Document Structure

graph TD
    A[Document] --> B[Metadata]
    A --> C[Content Node Tree]
    A --> D[Source Metadata]
    A --> E[Labels]
    C --> F[Root Content Node]
    F --> G[Child Node 1]
    F --> H[Child Node 2]
    G --> I[Grandchild Node]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px
    style F fill:#bfb,stroke:#333,stroke-width:2px

A Kodexa Document consists of:

  1. Document Metadata: Flexible dictionary-based metadata about the document
  2. Content Node Tree: Hierarchical structure of content nodes
  3. Source Metadata: Information about the document's origin
  4. Labels: Document-level labels

Content Nodes

ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties:

graph LR
    A[ContentNode] --> B[Node Type]
    A --> C[Content]
    A --> D[Features]
    A --> E[Children]
    A --> F[UUID]
    A --> G[Index]

    style A fill:#f9f,stroke:#333,stroke-width:2px

Key attributes:

Features

Features are flexible metadata containers attached to ContentNodes. They come in different types:

graph TD
    A[Feature Types] --> B[Tags]
    A --> C[Spatial]
    A --> D[Custom Features]

    B --> E[Named Tags]
    B --> F[Tag Values]

    C --> G[Bounding Boxes]
    C --> H[Coordinates]

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px

Each feature has:

Working with Documents

Creating Documents

# Create a new document
doc = Document()

# Create from text
doc = Document.from_text("Some content")

# Create from file
doc = Document.from_file("path/to/file")

Adding Content

# Create a root node
root = doc.create_node(node_type="root")
doc.content_node = root

# Add child nodes
page = doc.create_node(node_type="page", content="Page content")
root.add_child(page)

Working with Features

# Add a tag feature
node.add_feature("tag", "paragraph", "body")

# Add spatial information
node.set_bbox([10, 20, 100, 200])

# Get feature value
value = node.get_feature_value("tag", "paragraph")

Node Navigation and Selection

The document model provides powerful ways to navigate and select nodes:

  1. Direct Navigation:

  2. Selector-based Navigation:

    # Select all nodes of type 'page'
    pages = document.select("//page")
    
    # Select nodes with specific tags
    tagged = document.select("//*[hasTag('paragraph')]")
    
    

Best Practices

  1. Node Types: Use consistent node types throughout your document to make selection and processing easier
  2. Features:
  3. Content Structure:
  4. Performance:

Error Handling

The document model includes robust error handling through the ContentException class:

# Add an exception to the document
doc.add_exception(ContentException(
    exception_type="validation",
    message="Invalid content structure",
    severity="ERROR"
))

Metadata

This is a dictionary containing metadata about the document, such as the source, title, author, etc:

print(my_document.metadata)

SourceMetadata

This contains metadata about the source document and works with connectors to allow you to access the original source document:

print(my_document.source)

Working with Document Content

Kodexa uses a powerful selector syntax to find and manipulate content within documents.

image.png

Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.

Basic Selector Example

To find all content nodes with the value "Name":

nodes = document.select('//*[contentRegex("Name")]')

This returns an iterator of the matching content nodes.

Selector Syntax

The selector syntax is composed of several parts:

  1. Axis & Node Type: Defines how to navigate the tree structure.
  2. Predicate: Further filters the selected nodes based on conditions.

Axis Examples

Predicate Functions

Predicates can use various functions, such as:

Operators

Operators can be used to combine functions: