Skip to content

Reference Implementation: eda (Data Mining Backend)

Overview

The EDA (Electronic Data Analysis) backend — internally called Data Mining (DM) — is a separate product from the Nextpoint eDiscovery/Litigation platform. It is a serverless data processing platform that ingests documents into S3, processes/extracts metadata (fingerprinting, text extraction, ancestry tracking), runs search scans (including full-text search via dtSearch), generates reports, and exports results.

This is NOT an NGE service module. It has its own architecture, deployment system, and AWS accounts. It predates the architecture patterns defined in this repo.

EDRM Stages: 4 (Collection), 5 (Processing), 7 (Analysis), 8 (Production).

Architecture

eda/
├── src/
│   ├── shared/                          # Shared Ruby library
│   │   ├── handler.rb                   # Lambda handler framework
│   │   ├── trackable/                   # S3-based process tracking with Step Functions
│   │   ├── s3_helper.rb                 # S3 operations
│   │   ├── search/                      # Search/scan logic
│   │   ├── metadata/                    # Document metadata extraction
│   │   ├── parsers/                     # File format parsers
│   │   ├── fingerprinting/              # Document fingerprinting
│   │   ├── athena/                      # Athena query helpers
│   │   ├── nge/                         # NGE extractor integration
│   │   ├── reports/                     # Report generation
│   │   ├── exports/                     # Export logic
│   │   ├── custodians/                  # Custodian assignment
│   │   └── thread_pool.rb              # Multi-threaded deployment
│   ├── lambda/functions/                # ~39 Lambda functions
│   │   ├── api/                         # API Gateway handlers
│   │   │   ├── batch/                   # Batch CRUD
│   │   │   ├── scan/                    # Scan CRUD
│   │   │   ├── report/                  # Report CRUD
│   │   │   ├── export/                  # Export CRUD
│   │   │   ├── etl/                     # ETL operations
│   │   │   ├── custodian-assignment/    # Custodian CRUD
│   │   │   ├── project/                 # Project management
│   │   │   ├── machine-learning/        # ML operations
│   │   │   └── info/                    # System info
│   │   ├── scan-files/                  # File-level scanning (Lambda)
│   │   ├── scan-batches/                # Batch-level scanning (Lambda)
│   │   ├── textract/                    # AWS Textract integration
│   │   └── process-completion/          # Step Functions completion handler
│   └── batch/                           # AWS Batch job containers
│       ├── batch-processor/             # Main batch processing
│       ├── copier/                      # File copying
│       ├── exporter/                    # Export generation
│       ├── reporter/                    # Report generation
│       ├── scan-batch/                  # Batch scanning
│       └── ml-processor/               # Machine learning
├── rakelib/                             # 18 Rake files — full infrastructure management
│   ├── lambda.rake                      # 39+ Lambda function definitions and deployment
│   ├── batch.rake                       # AWS Batch job definitions
│   ├── s3.rake                          # S3 bucket management
│   ├── sqs.rake                         # SQS queue management
│   ├── sns.rake                         # SNS topic management
│   ├── iam.rake                         # IAM role management
│   ├── ec2.rake                         # VPC/networking
│   ├── glue.rake                        # Glue catalog/crawlers
│   ├── step_functions.rake              # Step Functions state machines
│   ├── api_gateway.rake                 # API Gateway management
│   ├── eventbridge.rake                 # EventBridge rules
│   ├── ses.rake                         # SES email
│   └── service_quotas.rake              # Service limit management
├── config.toml.example                  # Environment configuration
├── Gemfile                              # Ruby 3.4, aws-sdk, etc.
└── Rakefile                             # Entry point

Stack

  • Language: Ruby 3.4 (Lambda), Python (Glue ETL), Java (dtSearch Lambda layer)
  • Compute: Lambda (API + event-driven), AWS Batch (heavy processing)
  • Search: dtSearch (full-text, via Java Lambda layer)
  • AI/ML: AWS Textract (OCR), Rekognition (image), Transcribe (audio), Translate
  • Data: S3 (primary), Glue catalog, Athena
  • Orchestration: Step Functions (process tracking), EventBridge
  • API: API Gateway (REST)
  • Infrastructure: Custom Rake-based deployment (NOT CDK/CloudFormation)

Key Differences from NGE Architecture

Aspect EDA/Data Mining NGE Service Modules
Product Separate Data Mining product Nextpoint eDiscovery/Litigation
Architecture shared/ + lambda/ + batch/ Hexagonal core/ + shell/
Infrastructure Rake tasks calling AWS SDK AWS CDK (TypeScript)
Compute Lambda + AWS Batch hybrid Lambda or ECS Fargate
Database S3-based (no MySQL) Per-case MySQL
Search dtSearch (Java native) Elasticsearch 7.x
Process tracking S3-based Trackable + Step Functions Checkpoint pipeline (DB-based)
Auth API Gateway + IAM HMAC-SHA1 / IAM roles
AWS accounts Own org (nextpoint-data-mining-*) Shared Nextpoint accounts
Multi-tenant Per-client AWS accounts Per-case MySQL databases

Integration with Nextpoint Platform

  • NGE extractor: Queries Athena/Glue catalog for extraction job statuses (EXTRACTOR_ENV links to extractor environment)
  • S3 buckets: nextpoint-dm-data-{env}, nextpoint-dm-system-{env}, nextpoint-dm-user-{env}
  • Front-end: Coupled to eda-front-end via CORS_ALLOWED_ORIGIN and shared API Gateway

Key File Locations

File Purpose
src/shared/handler.rb Lambda handler framework
src/shared/trackable/process.rb S3-based process tracking
rakelib/lambda.rake All Lambda function definitions
src/lambda/functions/ ~39 Lambda function source files
src/batch/ AWS Batch job containers
config.toml.example Environment config template
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.