Skip to content

Reference Implementation: shared_libs (Legacy)

Overview

The shared_libs repo is a flat collection of 77+ Ruby modules providing the foundational infrastructure used by both the Rails monolith and the workers repo. It is not a gem — files are loaded via require_relative or symlinked into consuming repos as lib/shared/.

This library provides: the Nextpoint API client (HMAC-SHA1 authenticated), S3 operations with local caching, AWS credential management, document format parsers (DAT, LEF, CMS, MBOX), file conversion tools (Ghostscript, TIFF), connection pooling, token encryption, email sending, and Nutrient (PSPDFKit) integration.

Architecture

shared_libs/
├── Gemfile                          # aws-sdk-s3, charlock_holmes, gd2-ffij, nokogiri, oj
├── nextpoint.rb                     # Core module — environment detection, config access
├── nextpoint_api.rb                 # HMAC-SHA1 API client (1127 lines, XML-based RPC)
├── nextpoint_s3.rb                  # S3 operations (933 lines, caching, multipart, dedup)
├── nextpoint_s3_client.rb           # Low-level S3 client wrapper (list, copy, download)
├── nextpoint_s3_deleter.rb          # S3 deletion wrapper
├── aws_multipart_data_upload.rb     # Multipart upload from StringIO (in-memory data)
├── nextpoint_aws_credentials.rb     # AWS credential strategy (dev/test/prod)
├── nextpoint_sqs.rb                 # SQS message sending (small/large queue routing)
├── nextpoint_sqs_client.rb          # Aws::SQS::Client wrapper
├── nextpoint_ecs.rb                 # ECS service scaling
├── nextpoint_ecs_client.rb          # Aws::ECS::Client wrapper
├── nextpoint_ssm.rb                 # SSM Parameter Store access
├── nextpoint_sm_client.rb           # Secrets Manager client
├── nextpoint_nutrient.rb            # Nutrient (PSPDFKit) PDF rendering (25KB)
├── nextpoint_emailer.rb             # SES/SMTP email with dedup and pirate mode
├── nextpoint_zendesk.rb             # Zendesk ticket creation
├── resource_pool.rb                 # Thread-safe connection pooling
├── locksmith.rb                     # AES-128-CBC token encryption for URLs
├── hash_compactor.rb                # Hash → encrypted token serialization
├── global_npcase_id_handler.rb      # Per-EC2-instance case ID file locking
├── trapped_shell.rb                 # Safe forked process execution with timeout/memory
├── fasthttp.rb                      # Ruby Net::HTTP performance patches
├── load_file.rb                     # Litigation load file parser (CSV/DAT)
├── csv_parser.rb                    # CSV encoding detection (CharlockHolmes)
├── dat.rb                           # Concordance DAT format parser
├── lef_converter.rb                 # LiveNote LEF archive extraction
├── cms_converter.rb                 # Microsoft JET/Access CMS transcript extraction
├── livenote_parser.rb               # LiveNote PTF binary format parser
├── ghostscript_converter.rb         # PDF → TIFF/PNG via Ghostscript
├── tiff_converter.rb                # TIFF manipulation (split, convert, compress)
├── exhibit_image_helper.rb          # Image file management and format routing (471 lines)
├── gd2_extras.rb                    # GD2 image library extensions (quantization, palettes)
├── png_info.rb                      # PNG metadata extraction
├── utf8_encoder.rb                  # UTF-8 conversion with CharlockHolmes detection
├── expansive_hash_calculator.rb     # MD5 hash calculation for deduplication
├── huffman_decoder.rb               # Generic Huffman code decoding
├── denist.rb                        # NIST NSRL hash database lookup (de-NISTing)
├── thread_safe_singleton.rb         # Mutex-protected singleton mixin
├── server_beacon.rb                 # Server identity tracking
├── pirate.rb                        # Pirate-speak for staging email subjects
├── roman_numerals.rb                # Roman numeral conversion
├── extensioned_tempfile.rb          # Tempfile with custom extensions
├── shared_constants.rb              # Production template enums
├── production_custom_delimiters.rb  # Load file delimiter config
├── preserve_logs_on_s3.rb           # Log rotation to S3
├── dump_debug_info_on_alarm.rb      # Debug info on SIGALRM
├── email/
│   └── mbox.rb                      # MBOX format parser (Enumerable)
├── disk/
│   └── image.rb                     # Disk image mounting (raw, EWF forensic)
└── test/
    └── unit/                        # 20 Minitest files

Pattern Mapping

Pattern shared_libs Implementation NGE Equivalent
API authentication HMAC-SHA1 signing (Date + method + path → Base64 digest) Direct DB access; HMAC-SHA1 for documentpageservice API
API protocol XML-based RPC via Net::HTTP with XmlSimple parsing N/A — NGE uses direct database access
Connection pooling ResourcePool — mutex-based, configurable limit, block-based release SQLAlchemy session pools, Lambda connection reuse
S3 operations NextPointS3 — local file cache (MD5 filenames), ETag freshness, mkdir locking shell/utils/s3_ops.py with boto3
S3 path convention /bucket/case_{npcase_id}/{model_type}/{unique_id}/{filename} Same convention preserved in NGE
S3 upload Server-side AES256 encryption, 5 retries with 5*N backoff, multipart for StringIO boto3 upload with SSE
S3 deletion prevention delete raises error — must use PendingDelete table N/A — S3 lifecycle policies
AWS credentials Dev: ~/.aws/credentials; Test: YAML; Prod: instance profile Lambda execution role; ECS task role
Multi-tenant case ID global_npcase_id_handler.rb — file-based per-EC2 case locking Per-case DB schema: {base}_case_{case_id}
Token encryption Locksmith — AES-128-CBC with URL-safe Base64 encoding JWT tokens for Nutrient; Secrets Manager for keys
Process execution TrappedShell — fork/exec with signal trapping, timeout, memory limit Lambda timeout; ECS resource limits
Encoding detection CharlockHolmes (ICU-based) with UTF-8 fallback Same approach in documentloader
Document format parsing DAT (Concordance), LEF (LiveNote), CMS (Access DB), MBOX documentloader handles these via Hyland Filters
File conversion Ghostscript (PDF→image), TIFF tools, GD2/ImageMagick Hyland Filters, Apache PDFBox, Nutrient
Email sending SES SMTP via MailFactory, dedup (10-min window), pirate mode SNS notifications; no direct email in NGE
Content hashing MD5 for deduplication (expansive_hash_calculator.rb) SHA256 in documentloader
NIST deduplication denist.rb — SQLite3 NSRL hash lookup from S3 Part of documentloader dedup pipeline
Structured logging $nextpoint_global_logger with environment prefixes JSON structured logging in CloudWatch
Configuration YAML files per environment (nextpoint_global.yml, etc.) Environment variables + Secrets Manager

Key Components

NextPointAPI Client (nextpoint_api.rb)

The central API client used by workers to communicate with the Rails app. 1127 lines.

Authentication: HMAC-SHA1 signing - Signs: "#{date}#{method}#{path}" with shared secret key - Header: API-Authorization: #{Base64.encode64(HMAC-SHA1(key, string))} - Also supports AES-128-CBC encryption/decryption for sensitive payload data

Protocol: XML-based RPC - All requests: POST with XML body via XmlSimple.xml_out() - All responses: XML parsed via XmlSimple.xml_in() - Response wrapped in NextPointAPI::Record — dynamic attribute access via method_missing - Type casting: XML types (integer, boolean, datetime, yaml, binary) → Ruby types

Connection management: ResourcePool with mutex synchronization - Retry: 5 attempts with exponential backoff for 502/503/504 responses - Max response time: 180 seconds - Background pinger thread for keepalive

Domain methods: - Worker lifecycle: register_as_worker, shutdown_worker, ping - Jobs: get_next_job, create_job, update_job, buffer_box_work_request - Documents: create_exhibit, update_exhibit, create_attachment, update_attachment - Batches: create_batch, update_batch, add_batch_part - Search: OCR text limit of 8MB (MAX_ALLOWED_SEARCH_TEXT)

NextPointS3 (nextpoint_s3.rb)

The S3 operations layer. 933 lines.

S3 path convention (s3_path_for(info) method):

/{bucket}/case_{npcase_id}/{model_type}/{unique_id}/{filename}

Detailed model type routing: - attachment/exhibits/{exhibit_id}/ or /wire/document_{id}/ or /zips/ - native_placeholder/exhibits/{id}/native-placeholder-{uid}/ - deposition/depositions/{uid}/ - transcript/transcripts/{uid}/ - document/wire/wire/document_{id}/{uid}/ - export, video, docshare, batch, ai_assistant, ai_summary

Filename sanitization: strips non-alphanumeric chars, prefixes reserved names (current, source, preview, presentation) with original_

Local caching: Downloads cached in /tmp/np_caches/np_s3_cache/ using MD5-based filenames. Cross-process safety via mkdir-based locking (atomic on POSIX). ETag-based freshness checks.

Upload patterns: - Server-side AES256 encryption on all uploads - 5 retries with 5 * attempt second backoff - Multipart upload for large files (automatic via SDK) - aws_multipart_data_upload.rb extends SDK for StringIO (in-memory data from ZIP entries)

Deletion prevention: delete and delete_objects methods raise errors. All deletions must go through the PendingDelete database table — a safety mechanism to prevent accidental data loss.

IAM user management: Creates/deletes per-case IAM users with scoped S3 policies for direct case folder access.

Configuration Pattern

Three-tier configuration hierarchy: 1. nextpoint.rb — Core module: Nextpoint.config(), Nextpoint.domain, Nextpoint.deployment_id, Nextpoint.deployment_name 2. nextpoint_shared_globals.rb — Loads nextpoint_shared_global.yml per environment 3. YAML config filesnextpoint_api.yml, nextpoint_s3.yml, nextpoint_mail.yml (per-environment settings: host, port, credentials, buckets)

AWS region → deployment mapping (in nextpoint_ssm.rb): - us-east-1c2 - us-west-1c4 - ca-central-1c5

Process Execution (trapped_shell.rb)

Safe external process execution for document conversion tools:

  • Fork/exec with signal trapping (TERM/INT ignored in child)
  • Configurable timeout enforcement
  • Memory limit enforcement (checks RSS via ps)
  • Non-blocking output reading
  • Status callbacks for progress reporting

Used by workers for LibreOffice, Ghostscript, FFmpeg, Tesseract, etc.

Document Format Parsers

eDiscovery-specific format support:

Format File Purpose
Concordance DAT dat.rb Field separator \x14, quote char þ (thorn) → standard CSV
LiveNote LEF lef_converter.rb Password-protected ZIP (livenote), extracts PTF/TXT/VID/XML
LiveNote PTF livenote_parser.rb Binary format parsing (blocks, values, QuickMarks)
CMS (Access DB) cms_converter.rb JET database via extract_cms_transcript binary
MBOX email/mbox.rb Standard MBOX with From line splitting
CSV csv_parser.rb Encoding detection via CharlockHolmes
Load files load_file.rb Unified CSV/DAT parser with field mapping

Integration Points with NGE

S3 Path Convention (Shared)

Both Legacy and NGE use the same S3 path structure. This is a critical integration point — NGE modules read/write files at paths that Legacy code also accesses:

s3://{bucket}/case_{npcase_id}/attachment/{unique_id}/{filename}
s3://{bucket}/case_{npcase_id}/export/{unique_id}/{filename}

HMAC-SHA1 Authentication (Shared)

The nextpoint_api.rb HMAC-SHA1 pattern is also used by NGE's documentpageservice when calling back to the Rails API. The signing algorithm is: Base64(HMAC-SHA1(secret_key, date + method + path))

AWS Credential Strategy

Legacy uses instance profiles on EC2. NGE uses Lambda execution roles and ECS task roles. Both rely on IAM for S3/SQS/SNS access — no hardcoded credentials.

Patterns to Preserve vs Deprecate

Preserve

  • S3 path convention — consistent across Legacy and NGE
  • HMAC-SHA1 auth — still used by documentpageservice API calls
  • Content hashing for dedup — MD5 in Legacy, SHA256 in NGE (same concept)
  • Encoding detection — CharlockHolmes approach carried forward
  • Connection pooling (ResourcePool) — pattern is sound, implementation differs

Deprecate

  • XML-based API protocol — replaced by direct DB access in NGE
  • YAML configuration files — replaced by env vars + Secrets Manager
  • File-based case locking — replaced by per-case DB schema naming
  • Local file caching with mkdir locks — Lambda has ephemeral /tmp; ECS uses EFS
  • PendingDelete table for S3 — NGE uses S3 lifecycle policies
  • IAM user per case — legacy pattern for direct S3 access; NGE uses presigned URLs
  • GD2 image library — replaced by Nutrient (PSPDFKit)
  • Ruby 1.8 compatibility patches (fasthttp.rb) — dead code

Key File Locations

File Purpose
nextpoint_api.rb HMAC-SHA1 API client (1127 lines)
nextpoint_s3.rb S3 operations with caching (933 lines)
nextpoint_aws_credentials.rb AWS credential strategy
resource_pool.rb Thread-safe connection pooling
global_npcase_id_handler.rb Per-EC2 case ID locking
locksmith.rb AES-128-CBC token encryption
trapped_shell.rb Safe process execution
exhibit_image_helper.rb Image format routing (471 lines)
nextpoint_nutrient.rb Nutrient/PSPDFKit integration (25KB)
load_file.rb Load file parser
dat.rb Concordance DAT format
lef_converter.rb LiveNote LEF extraction
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.