Reference Implementation: eda (Data Mining Backend)¶

Overview¶

The EDA (Electronic Data Analysis) backend — internally called Data Mining (DM) — is a separate product from the Nextpoint eDiscovery/Litigation platform. It is a serverless data processing platform that ingests documents into S3, processes/extracts metadata (fingerprinting, text extraction, ancestry tracking), runs search scans (including full-text search via dtSearch), generates reports, and exports results.

This is NOT an NGE service module. It has its own architecture, deployment system, and AWS accounts. It predates the architecture patterns defined in this repo.

EDRM Stages: 4 (Collection), 5 (Processing), 7 (Analysis), 8 (Production).

Architecture¶

eda/
├── src/
│   ├── shared/                          # Shared Ruby library
│   │   ├── handler.rb                   # Lambda handler framework
│   │   ├── trackable/                   # S3-based process tracking with Step Functions
│   │   ├── s3_helper.rb                 # S3 operations
│   │   ├── search/                      # Search/scan logic
│   │   ├── metadata/                    # Document metadata extraction
│   │   ├── parsers/                     # File format parsers
│   │   ├── fingerprinting/              # Document fingerprinting
│   │   ├── athena/                      # Athena query helpers
│   │   ├── nge/                         # NGE extractor integration
│   │   ├── reports/                     # Report generation
│   │   ├── exports/                     # Export logic
│   │   ├── custodians/                  # Custodian assignment
│   │   └── thread_pool.rb              # Multi-threaded deployment
│   ├── lambda/functions/                # ~39 Lambda functions
│   │   ├── api/                         # API Gateway handlers
│   │   │   ├── batch/                   # Batch CRUD
│   │   │   ├── scan/                    # Scan CRUD
│   │   │   ├── report/                  # Report CRUD
│   │   │   ├── export/                  # Export CRUD
│   │   │   ├── etl/                     # ETL operations
│   │   │   ├── custodian-assignment/    # Custodian CRUD
│   │   │   ├── project/                 # Project management
│   │   │   ├── machine-learning/        # ML operations
│   │   │   └── info/                    # System info
│   │   ├── scan-files/                  # File-level scanning (Lambda)
│   │   ├── scan-batches/                # Batch-level scanning (Lambda)
│   │   ├── textract/                    # AWS Textract integration
│   │   └── process-completion/          # Step Functions completion handler
│   └── batch/                           # AWS Batch job containers
│       ├── batch-processor/             # Main batch processing
│       ├── copier/                      # File copying
│       ├── exporter/                    # Export generation
│       ├── reporter/                    # Report generation
│       ├── scan-batch/                  # Batch scanning
│       └── ml-processor/               # Machine learning
├── rakelib/                             # 18 Rake files — full infrastructure management
│   ├── lambda.rake                      # 39+ Lambda function definitions and deployment
│   ├── batch.rake                       # AWS Batch job definitions
│   ├── s3.rake                          # S3 bucket management
│   ├── sqs.rake                         # SQS queue management
│   ├── sns.rake                         # SNS topic management
│   ├── iam.rake                         # IAM role management
│   ├── ec2.rake                         # VPC/networking
│   ├── glue.rake                        # Glue catalog/crawlers
│   ├── step_functions.rake              # Step Functions state machines
│   ├── api_gateway.rake                 # API Gateway management
│   ├── eventbridge.rake                 # EventBridge rules
│   ├── ses.rake                         # SES email
│   └── service_quotas.rake              # Service limit management
├── config.toml.example                  # Environment configuration
├── Gemfile                              # Ruby 3.4, aws-sdk, etc.
└── Rakefile                             # Entry point

Stack¶

Language: Ruby 3.4 (Lambda), Python (Glue ETL), Java (dtSearch Lambda layer)
Compute: Lambda (API + event-driven), AWS Batch (heavy processing)
Search: dtSearch (full-text, via Java Lambda layer)
AI/ML: AWS Textract (OCR), Rekognition (image), Transcribe (audio), Translate
Data: S3 (primary), Glue catalog, Athena
Orchestration: Step Functions (process tracking), EventBridge
API: API Gateway (REST)
Infrastructure: Custom Rake-based deployment (NOT CDK/CloudFormation)

Key Differences from NGE Architecture¶

Aspect	EDA/Data Mining	NGE Service Modules
Product	Separate Data Mining product	Nextpoint eDiscovery/Litigation
Architecture	`shared/` + `lambda/` + `batch/`	Hexagonal `core/` + `shell/`
Infrastructure	Rake tasks calling AWS SDK	AWS CDK (TypeScript)
Compute	Lambda + AWS Batch hybrid	Lambda or ECS Fargate
Database	S3-based (no MySQL)	Per-case MySQL
Search	dtSearch (Java native)	Elasticsearch 7.x
Process tracking	S3-based Trackable + Step Functions	Checkpoint pipeline (DB-based)
Auth	API Gateway + IAM	HMAC-SHA1 / IAM roles
AWS accounts	Own org (`nextpoint-data-mining-*`)	Shared Nextpoint accounts
Multi-tenant	Per-client AWS accounts	Per-case MySQL databases

Integration with Nextpoint Platform¶

NGE extractor: Queries Athena/Glue catalog for extraction job statuses (EXTRACTOR_ENV links to extractor environment)
S3 buckets: nextpoint-dm-data-{env}, nextpoint-dm-system-{env}, nextpoint-dm-user-{env}
Front-end: Coupled to eda-front-end via CORS_ALLOWED_ORIGIN and shared API Gateway

Key File Locations¶

File	Purpose
`src/shared/handler.rb`	Lambda handler framework
`src/shared/trackable/process.rb`	S3-based process tracking
`rakelib/lambda.rake`	All Lambda function definitions
`src/lambda/functions/`	~39 Lambda function source files
`src/batch/`	AWS Batch job containers
`config.toml.example`	Environment config template

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.