ORM Model Patterns¶

Purpose¶

Define SQLAlchemy ORM conventions for all Nextpoint service modules: naming, column types, composite keys for deduplication, soft deletes, timestamps, JSON columns, and the builder pattern for entity construction.

Model Catalog (documentloader)¶

18 models across three categories:

Core Entities¶

Model	Table	Primary Key	Purpose
Exhibits	`exhibits`	`id` (auto)	Documents in the case
Attachments	`attachments`	`id` (auto)	File pages/versions of exhibits
Batches	`batches`	`id` (auto)	Import batch lifecycle
Npcases	`npcases`	`id` (auto)	Case (tenant) metadata
Labels	`labels`	`id` (auto)	Classification labels
Custodians	`custodians`	`id` (auto)	Document custodians
Tags	`tags`	`id` (auto)	Metadata tags
CustomFields	`custom_fields`	`id` (auto)	User-defined fields

Relationship Tables¶

Model	Table	Purpose
Taggings	`taggings`	Exhibit ↔ Tag junction
CustodianExhibits	`custodian_exhibits`	Custodian ↔ Exhibit junction
BatchSources	`batch_sources`	Batch ↔ Source (polymorphic)
DocumentRelations	`document_relations`	Exhibit parent/child relationships
DocumentRelationsInBatch	`document_relations_in_batch`	Temp table for batch processing
DepositionVolumeExhibits	`deposition_volume_exhibits`	Deposition ↔ Exhibit junction
ExhDesignations	`exh_designations`	Exhibit ↔ Label with metadata

Deduplication & Processing¶

Model	Table	Composite PK	Purpose
ProcessCheckpoints	`process_checkpoints`	(npcase_id, batch_id, nge_document_id)	Pipeline resume state
DocDedupe	`doc_dedupe`	(npcase_id, message_id, bcc, md5, doc_type)	Content deduplication
TagDedupe	`tag_dedupe`	(npcase_id, name, custom_field_id)	Tag deduplication (SHA256 name)

Composite Primary Keys for Idempotency¶

Composite PKs serve as distributed locks and deduplication keys:

class ProcessCheckpoints(Base):
    __tablename__ = "process_checkpoints"

    # Composite PK — only one checkpoint per document per batch
    npcase_id = Column(Integer, primary_key=True)
    batch_id = Column(Integer, primary_key=True)
    nge_document_id = Column(String(255), primary_key=True)

    checkpoint_id = Column(Integer)
    exhibit_ids = Column(String(2000))     # Comma-separated IDs
    index_documents = Column(Boolean)
    created_at_gmt = Column(DateTime)

class DocDedupe(Base):
    __tablename__ = "doc_dedupe"

    # 5-column composite PK for content deduplication
    npcase_id = Column(Integer, primary_key=True, nullable=False)
    message_id = Column(String(255), primary_key=True)
    bcc = Column(String(2000), primary_key=True)
    md5 = Column(String(255), primary_key=True)
    doc_type = Column(String(255), primary_key=True, nullable=False)

    created_at_gmt = Column(DateTime)

    __table_args__ = (
        UniqueConstraint('npcase_id', 'message_id', 'bcc', 'md5', 'doc_type',
                         name='_doc_dedupe_uc'),
    )

class TagDedupe(Base):
    __tablename__ = "tag_dedupe"

    # Name column stores SHA256 hash of full tag content (255 char limit)
    npcase_id = Column(Integer, primary_key=True)
    name = Column(String(256), primary_key=True)
    custom_field_id = Column(Integer, primary_key=True)

    created_at_gmt = Column(DateTime)

Timestamp Conventions¶

Known inconsistency — the codebase has two timestamp naming patterns:

# Pattern 1: _gmt suffix (preferred for new code)
created_at_gmt = Column(DateTime)
updated_at_gmt = Column(DateTime)
deleted_at_gmt = Column(DateTime)        # Soft delete
loader_status_updated_at_gmt = Column(DateTime)

# Pattern 2: Legacy Rails convention (existing tables)
created_on = Column(DateTime)            # Exhibits
updated_on = Column(DateTime)            # ExhDesignations, Attachments
created_at = Column(DateTime)            # Batches (no _gmt suffix)

Rule for new models: Always use _gmt suffix to indicate UTC storage.

Soft Delete Pattern¶

Soft deletes use a nullable DateTime column. NULL = active, timestamp = deleted:

class Exhibits(Base):
    # ...
    delete_at_gmt = Column(DateTime)     # NULL = active

class Npcases(Base):
    # ...
    deleted_at_gmt = Column(DateTime)    # Note: different column name

class Batches(Base):
    # ...
    delete_at_gmt = Column(DateTime)

When querying: Always filter WHERE delete_at_gmt IS NULL unless explicitly including deleted records.

JSON Columns¶

For flexible, schema-less metadata:

class Npcases(Base):
    # ...
    near_dupe_info = Column(JSON)        # Near-duplicate configuration
    bulk_redaction_info = Column(JSON)    # Bulk redaction settings

Usage rules: - JSON columns for configuration/metadata that varies per case - NOT for data that needs to be queried or indexed - Always validate structure in Python before writing

Builder Pattern for Entity Construction¶

Map validated schema data to ORM model instances:

# core/models/db_models.py

def build_exhibit(exhibit_data: ExhibitData) -> Exhibits:
    """Construct an Exhibits instance from validated schema data."""
    return Exhibits(
        prefix=exhibit_data.prefix,
        number=exhibit_data.number,
        npcase_id=exhibit_data.npcase_id,
        user_id=exhibit_data.user_id,
        batch_id=exhibit_data.batch_id,
        notes=exhibit_data.notes,
        author=exhibit_data.author,
        shortcut=exhibit_data.shortcut,
        document_date=exhibit_data.document_date,
        document_type=exhibit_data.document_type,
        # ... all fields mapped
        verified_page_count=None,  # Set during processing, not creation
        created_on=exhibit_data.created_on,
        updated_at_gmt=exhibit_data.updated_at_gmt,
    )

Why a builder? The ExhibitData schema has fields that don't map 1:1 to the ORM model (e.g., file_name, display_name are used for processing but not stored in the exhibits table). The builder function is the single place that handles this mapping.

Validation Schemas¶

Input data validated with dataclasses before ORM construction:

# core/models/schemas.py

@dataclass
class ExhibitData:
    """Validated input for exhibit creation."""
    npcase_id: int
    prefix: Optional[str]
    number: Optional[str]
    user_id: int
    batch_id: int
    notes: Optional[str]
    author: Optional[str]
    # ... 40+ fields covering:
    # - Document metadata (title, date, type)
    # - Email metadata (message_id, author, sent/received dates)
    # - Office metadata (last_author, app_name, created/modified dates)
    # - Processing fields (family_id, nge_document_id, content_hash)
    # - File info (extension, name, display_name)

@dataclass
class ProcessCheckpointParams:
    """Parameters for checkpoint create/update."""
    npcase_id: int
    batch_id: int
    nge_document_id: str
    created_at_gmt: Optional[datetime] = None
    checkpoint_id: Optional[int] = None
    exhibit_ids: Optional[str] = None
    index_documents: Optional[bool] = None

class ExhibitCreateResponse(TypedDict):
    """Return type from exhibit creation."""
    update_es: bool
    exhibit_ids: list[int]

Index Conventions¶

Explicit indexes for query performance:

class DocumentRelations(Base):
    __tablename__ = "document_relations"
    # ...
    __table_args__ = (
        Index('index_document_relations_on_from_id_and_relation', 'from_id', 'relation'),
        Index('index_document_relations_on_to_id_and_relation', 'to_id', 'relation'),
    )

class DocumentRelationsInBatch(Base):
    __tablename__ = "document_relations_in_batch"
    # ...
    __table_args__ = (
        Index('index_document_relations_old_pk',
              'npcase_id', 'batch_id', 'child_document_id', 'parent_document_id',
              unique=True),
    )

class CustodianExhibits(Base):
    __tablename__ = "custodian_exhibits"
    # ...
    __table_args__ = (
        Index('index_custodian_exhibits_on_custodian_id', 'custodian_id'),
        Index('index_custodian_exhibits_on_exhibit_id', 'exhibit_id'),
        Index('index_custodian_exhibits_on_user_id', 'user_id'),
    )

Polymorphic Relationships¶

BatchSources uses a type column for polymorphic associations:

class BatchSources(Base):
    __tablename__ = "batch_sources"

    id = Column(Integer, primary_key=True, autoincrement=True)
    batch_id = Column(Integer)
    item_id = Column(Integer)        # FK to different tables
    item_type = Column(String(255))  # "Exhibit", "Attachment", etc.

Key Rules¶

Composite PKs for deduplication — use database constraints, not application logic
_gmt suffix on new timestamps — indicates UTC, distinguishes from legacy columns
Soft delete = nullable DateTime — never use boolean flags for soft deletes
Builder functions for complex entities — keep mapping logic in one place
Dataclass schemas for validation — validate before constructing ORM instances
Explicit indexes — don't rely on ORM auto-indexing, declare what you need
JSON columns for config only — not for queryable or indexable data
Comma-separated IDs are technical debt — use JSON arrays or junction tables for new models
UniqueConstraint + composite PK — belt and suspenders for deduplication tables

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.