Skip to content

ORM Model Patterns

Purpose

Define SQLAlchemy ORM conventions for all Nextpoint service modules: naming, column types, composite keys for deduplication, soft deletes, timestamps, JSON columns, and the builder pattern for entity construction.

Model Catalog (documentloader)

18 models across three categories:

Core Entities

Model Table Primary Key Purpose
Exhibits exhibits id (auto) Documents in the case
Attachments attachments id (auto) File pages/versions of exhibits
Batches batches id (auto) Import batch lifecycle
Npcases npcases id (auto) Case (tenant) metadata
Labels labels id (auto) Classification labels
Custodians custodians id (auto) Document custodians
Tags tags id (auto) Metadata tags
CustomFields custom_fields id (auto) User-defined fields

Relationship Tables

Model Table Purpose
Taggings taggings Exhibit ↔ Tag junction
CustodianExhibits custodian_exhibits Custodian ↔ Exhibit junction
BatchSources batch_sources Batch ↔ Source (polymorphic)
DocumentRelations document_relations Exhibit parent/child relationships
DocumentRelationsInBatch document_relations_in_batch Temp table for batch processing
DepositionVolumeExhibits deposition_volume_exhibits Deposition ↔ Exhibit junction
ExhDesignations exh_designations Exhibit ↔ Label with metadata

Deduplication & Processing

Model Table Composite PK Purpose
ProcessCheckpoints process_checkpoints (npcase_id, batch_id, nge_document_id) Pipeline resume state
DocDedupe doc_dedupe (npcase_id, message_id, bcc, md5, doc_type) Content deduplication
TagDedupe tag_dedupe (npcase_id, name, custom_field_id) Tag deduplication (SHA256 name)

Composite Primary Keys for Idempotency

Composite PKs serve as distributed locks and deduplication keys:

class ProcessCheckpoints(Base):
    __tablename__ = "process_checkpoints"

    # Composite PK — only one checkpoint per document per batch
    npcase_id = Column(Integer, primary_key=True)
    batch_id = Column(Integer, primary_key=True)
    nge_document_id = Column(String(255), primary_key=True)

    checkpoint_id = Column(Integer)
    exhibit_ids = Column(String(2000))     # Comma-separated IDs
    index_documents = Column(Boolean)
    created_at_gmt = Column(DateTime)
class DocDedupe(Base):
    __tablename__ = "doc_dedupe"

    # 5-column composite PK for content deduplication
    npcase_id = Column(Integer, primary_key=True, nullable=False)
    message_id = Column(String(255), primary_key=True)
    bcc = Column(String(2000), primary_key=True)
    md5 = Column(String(255), primary_key=True)
    doc_type = Column(String(255), primary_key=True, nullable=False)

    created_at_gmt = Column(DateTime)

    __table_args__ = (
        UniqueConstraint('npcase_id', 'message_id', 'bcc', 'md5', 'doc_type',
                         name='_doc_dedupe_uc'),
    )
class TagDedupe(Base):
    __tablename__ = "tag_dedupe"

    # Name column stores SHA256 hash of full tag content (255 char limit)
    npcase_id = Column(Integer, primary_key=True)
    name = Column(String(256), primary_key=True)
    custom_field_id = Column(Integer, primary_key=True)

    created_at_gmt = Column(DateTime)

Timestamp Conventions

Known inconsistency — the codebase has two timestamp naming patterns:

# Pattern 1: _gmt suffix (preferred for new code)
created_at_gmt = Column(DateTime)
updated_at_gmt = Column(DateTime)
deleted_at_gmt = Column(DateTime)        # Soft delete
loader_status_updated_at_gmt = Column(DateTime)

# Pattern 2: Legacy Rails convention (existing tables)
created_on = Column(DateTime)            # Exhibits
updated_on = Column(DateTime)            # ExhDesignations, Attachments
created_at = Column(DateTime)            # Batches (no _gmt suffix)

Rule for new models: Always use _gmt suffix to indicate UTC storage.

Soft Delete Pattern

Soft deletes use a nullable DateTime column. NULL = active, timestamp = deleted:

class Exhibits(Base):
    # ...
    delete_at_gmt = Column(DateTime)     # NULL = active

class Npcases(Base):
    # ...
    deleted_at_gmt = Column(DateTime)    # Note: different column name

class Batches(Base):
    # ...
    delete_at_gmt = Column(DateTime)

When querying: Always filter WHERE delete_at_gmt IS NULL unless explicitly including deleted records.

JSON Columns

For flexible, schema-less metadata:

class Npcases(Base):
    # ...
    near_dupe_info = Column(JSON)        # Near-duplicate configuration
    bulk_redaction_info = Column(JSON)    # Bulk redaction settings

Usage rules: - JSON columns for configuration/metadata that varies per case - NOT for data that needs to be queried or indexed - Always validate structure in Python before writing

Builder Pattern for Entity Construction

Map validated schema data to ORM model instances:

# core/models/db_models.py

def build_exhibit(exhibit_data: ExhibitData) -> Exhibits:
    """Construct an Exhibits instance from validated schema data."""
    return Exhibits(
        prefix=exhibit_data.prefix,
        number=exhibit_data.number,
        npcase_id=exhibit_data.npcase_id,
        user_id=exhibit_data.user_id,
        batch_id=exhibit_data.batch_id,
        notes=exhibit_data.notes,
        author=exhibit_data.author,
        shortcut=exhibit_data.shortcut,
        document_date=exhibit_data.document_date,
        document_type=exhibit_data.document_type,
        # ... all fields mapped
        verified_page_count=None,  # Set during processing, not creation
        created_on=exhibit_data.created_on,
        updated_at_gmt=exhibit_data.updated_at_gmt,
    )

Why a builder? The ExhibitData schema has fields that don't map 1:1 to the ORM model (e.g., file_name, display_name are used for processing but not stored in the exhibits table). The builder function is the single place that handles this mapping.

Validation Schemas

Input data validated with dataclasses before ORM construction:

# core/models/schemas.py

@dataclass
class ExhibitData:
    """Validated input for exhibit creation."""
    npcase_id: int
    prefix: Optional[str]
    number: Optional[str]
    user_id: int
    batch_id: int
    notes: Optional[str]
    author: Optional[str]
    # ... 40+ fields covering:
    # - Document metadata (title, date, type)
    # - Email metadata (message_id, author, sent/received dates)
    # - Office metadata (last_author, app_name, created/modified dates)
    # - Processing fields (family_id, nge_document_id, content_hash)
    # - File info (extension, name, display_name)

@dataclass
class ProcessCheckpointParams:
    """Parameters for checkpoint create/update."""
    npcase_id: int
    batch_id: int
    nge_document_id: str
    created_at_gmt: Optional[datetime] = None
    checkpoint_id: Optional[int] = None
    exhibit_ids: Optional[str] = None
    index_documents: Optional[bool] = None

class ExhibitCreateResponse(TypedDict):
    """Return type from exhibit creation."""
    update_es: bool
    exhibit_ids: list[int]

Index Conventions

Explicit indexes for query performance:

class DocumentRelations(Base):
    __tablename__ = "document_relations"
    # ...
    __table_args__ = (
        Index('index_document_relations_on_from_id_and_relation', 'from_id', 'relation'),
        Index('index_document_relations_on_to_id_and_relation', 'to_id', 'relation'),
    )

class DocumentRelationsInBatch(Base):
    __tablename__ = "document_relations_in_batch"
    # ...
    __table_args__ = (
        Index('index_document_relations_old_pk',
              'npcase_id', 'batch_id', 'child_document_id', 'parent_document_id',
              unique=True),
    )

class CustodianExhibits(Base):
    __tablename__ = "custodian_exhibits"
    # ...
    __table_args__ = (
        Index('index_custodian_exhibits_on_custodian_id', 'custodian_id'),
        Index('index_custodian_exhibits_on_exhibit_id', 'exhibit_id'),
        Index('index_custodian_exhibits_on_user_id', 'user_id'),
    )

Polymorphic Relationships

BatchSources uses a type column for polymorphic associations:

class BatchSources(Base):
    __tablename__ = "batch_sources"

    id = Column(Integer, primary_key=True, autoincrement=True)
    batch_id = Column(Integer)
    item_id = Column(Integer)        # FK to different tables
    item_type = Column(String(255))  # "Exhibit", "Attachment", etc.

Key Rules

  1. Composite PKs for deduplication — use database constraints, not application logic
  2. _gmt suffix on new timestamps — indicates UTC, distinguishes from legacy columns
  3. Soft delete = nullable DateTime — never use boolean flags for soft deletes
  4. Builder functions for complex entities — keep mapping logic in one place
  5. Dataclass schemas for validation — validate before constructing ORM instances
  6. Explicit indexes — don't rely on ORM auto-indexing, declare what you need
  7. JSON columns for config only — not for queryable or indexable data
  8. Comma-separated IDs are technical debt — use JSON arrays or junction tables for new models
  9. UniqueConstraint + composite PK — belt and suspenders for deduplication tables
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.