Kodexa is a powerful document processing platform that allows developers to work with documents in a structured and efficient manner. This guide will walk you through the basics of creating, saving, loading, and working with Kodexa Documents.
The first step in working with Kodexa is typically creating a new document. Let's assume we have a document in PDF format:
from kodexa import Document
my_document = Document.from_file('example.pdf')
This step creates an empty document with a reference to the PDF file. At this point, the document hasn't been parsed, but metadata has been added to allow Kodexa to understand where to find the original document.
You can save a Kodexa Document to a file or a store. By default, documents are saved in the Kodexa format, which is a SQLite database:
my_document.to_kddb('my-document.kddb')
my_document.close()
By convention, we use the .kddb
extension for Kodexa Document Database files.
To load a previously saved Kodexa Document:
another_document = Document.from_kddb('my-document.kddb')
another_document.close()
Sometimes you may want to make changes to a document without affecting the original file. For this, you can load the document in detached mode:
detached_document = Document.from_kddb('my-document.kddb', detached=True)
The Kodexa Document Model provides a flexible and powerful way to represent structured and unstructured documents. At its core, it consists of a Document object that contains metadata and a hierarchical tree of ContentNodes, each of which can have features attached to them.
Let's explore the key components of the model.
graph TD
A[Document] --> B[Metadata]
A --> C[Content Node Tree]
A --> D[Source Metadata]
A --> E[Labels]
C --> F[Root Content Node]
F --> G[Child Node 1]
F --> H[Child Node 2]
G --> I[Grandchild Node]
style A fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#bfb,stroke:#333,stroke-width:2px
A Kodexa Document consists of:
ContentNodes are the building blocks of the document structure. Each ContentNode represents a logical section of content and has the following properties:
graph LR
A[ContentNode] --> B[Node Type]
A --> C[Content]
A --> D[Features]
A --> E[Children]
A --> F[UUID]
A --> G[Index]
style A fill:#f9f,stroke:#333,stroke-width:2px
Key attributes:
Features are flexible metadata containers attached to ContentNodes. They come in different types:
graph TD
A[Feature Types] --> B[Tags]
A --> C[Spatial]
A --> D[Custom Features]
B --> E[Named Tags]
B --> F[Tag Values]
C --> G[Bounding Boxes]
C --> H[Coordinates]
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#bbf,stroke:#333,stroke-width:2px
style C fill:#bfb,stroke:#333,stroke-width:2px
Each feature has:
# Create a new document
doc = Document()
# Create from text
doc = Document.from_text("Some content")
# Create from file
doc = Document.from_file("path/to/file")
# Create a root node
root = doc.create_node(node_type="root")
doc.content_node = root
# Add child nodes
page = doc.create_node(node_type="page", content="Page content")
root.add_child(page)
# Add a tag feature
node.add_feature("tag", "paragraph", "body")
# Add spatial information
node.set_bbox([10, 20, 100, 200])
# Get feature value
value = node.get_feature_value("tag", "paragraph")
The document model provides powerful ways to navigate and select nodes:
Direct Navigation:
get_children()
: Get immediate child nodesget_parent()
: Get parent nodenext_node()
: Get next siblingprevious_node()
: Get previous siblingSelector-based Navigation:
# Select all nodes of type 'page'
pages = document.select("//page")
# Select nodes with specific tags
tagged = document.select("//*[hasTag('paragraph')]")
The document model includes robust error handling through the ContentException class:
# Add an exception to the document
doc.add_exception(ContentException(
exception_type="validation",
message="Invalid content structure",
severity="ERROR"
))
This is a dictionary containing metadata about the document, such as the source, title, author, etc:
print(my_document.metadata)
This contains metadata about the source document and works with connectors to allow you to access the original source document:
print(my_document.source)
Kodexa uses a powerful selector syntax to find and manipulate content within documents.
Selectors work similarly to CSS selectors or XPath, allowing you to build queries that can be executed on a document instance.
To find all content nodes with the value "Name":
nodes = document.select('//*[contentRegex("Name")]')
This returns an iterator of the matching content nodes.
The selector syntax is composed of several parts:
//
: Current node and all children/
: Root node.
: Current Node (or root if from the document)./line/.
: All nodes of type line under the current nodeparent::line
: Any node in the parent structure of this node that is of node type linePredicates can use various functions, such as:
contentRegex
: Matches content against a regular expressiontypeRegex
: Matches node type name against a regular expressionhasTag
: Checks if a node has a specific taghasFeature
: Checks if a node has a specific featurecontent
: Returns the content of the nodeuuid
: Returns the UUID of the nodeOperators can be used to combine functions:
|
: Union the results of two sides=
: Test that two sides are equaland
: Boolean AND operation