Reference Implementation: eda (Data Mining Backend)¶
Overview¶
The EDA (Electronic Data Analysis) backend — internally called Data Mining (DM) — is a separate product from the Nextpoint eDiscovery/Litigation platform. It is a serverless data processing platform that ingests documents into S3, processes/extracts metadata (fingerprinting, text extraction, ancestry tracking), runs search scans (including full-text search via dtSearch), generates reports, and exports results.
This is NOT an NGE service module. It has its own architecture, deployment system, and AWS accounts. It predates the architecture patterns defined in this repo.
EDRM Stages: 4 (Collection), 5 (Processing), 7 (Analysis), 8 (Production).
Architecture¶
eda/
├── src/
│ ├── shared/ # Shared Ruby library
│ │ ├── handler.rb # Lambda handler framework
│ │ ├── trackable/ # S3-based process tracking with Step Functions
│ │ ├── s3_helper.rb # S3 operations
│ │ ├── search/ # Search/scan logic
│ │ ├── metadata/ # Document metadata extraction
│ │ ├── parsers/ # File format parsers
│ │ ├── fingerprinting/ # Document fingerprinting
│ │ ├── athena/ # Athena query helpers
│ │ ├── nge/ # NGE extractor integration
│ │ ├── reports/ # Report generation
│ │ ├── exports/ # Export logic
│ │ ├── custodians/ # Custodian assignment
│ │ └── thread_pool.rb # Multi-threaded deployment
│ ├── lambda/functions/ # ~39 Lambda functions
│ │ ├── api/ # API Gateway handlers
│ │ │ ├── batch/ # Batch CRUD
│ │ │ ├── scan/ # Scan CRUD
│ │ │ ├── report/ # Report CRUD
│ │ │ ├── export/ # Export CRUD
│ │ │ ├── etl/ # ETL operations
│ │ │ ├── custodian-assignment/ # Custodian CRUD
│ │ │ ├── project/ # Project management
│ │ │ ├── machine-learning/ # ML operations
│ │ │ └── info/ # System info
│ │ ├── scan-files/ # File-level scanning (Lambda)
│ │ ├── scan-batches/ # Batch-level scanning (Lambda)
│ │ ├── textract/ # AWS Textract integration
│ │ └── process-completion/ # Step Functions completion handler
│ └── batch/ # AWS Batch job containers
│ ├── batch-processor/ # Main batch processing
│ ├── copier/ # File copying
│ ├── exporter/ # Export generation
│ ├── reporter/ # Report generation
│ ├── scan-batch/ # Batch scanning
│ └── ml-processor/ # Machine learning
├── rakelib/ # 18 Rake files — full infrastructure management
│ ├── lambda.rake # 39+ Lambda function definitions and deployment
│ ├── batch.rake # AWS Batch job definitions
│ ├── s3.rake # S3 bucket management
│ ├── sqs.rake # SQS queue management
│ ├── sns.rake # SNS topic management
│ ├── iam.rake # IAM role management
│ ├── ec2.rake # VPC/networking
│ ├── glue.rake # Glue catalog/crawlers
│ ├── step_functions.rake # Step Functions state machines
│ ├── api_gateway.rake # API Gateway management
│ ├── eventbridge.rake # EventBridge rules
│ ├── ses.rake # SES email
│ └── service_quotas.rake # Service limit management
├── config.toml.example # Environment configuration
├── Gemfile # Ruby 3.4, aws-sdk, etc.
└── Rakefile # Entry point
Stack¶
- Language: Ruby 3.4 (Lambda), Python (Glue ETL), Java (dtSearch Lambda layer)
- Compute: Lambda (API + event-driven), AWS Batch (heavy processing)
- Search: dtSearch (full-text, via Java Lambda layer)
- AI/ML: AWS Textract (OCR), Rekognition (image), Transcribe (audio), Translate
- Data: S3 (primary), Glue catalog, Athena
- Orchestration: Step Functions (process tracking), EventBridge
- API: API Gateway (REST)
- Infrastructure: Custom Rake-based deployment (NOT CDK/CloudFormation)
Key Differences from NGE Architecture¶
| Aspect | EDA/Data Mining | NGE Service Modules |
|---|---|---|
| Product | Separate Data Mining product | Nextpoint eDiscovery/Litigation |
| Architecture | shared/ + lambda/ + batch/ |
Hexagonal core/ + shell/ |
| Infrastructure | Rake tasks calling AWS SDK | AWS CDK (TypeScript) |
| Compute | Lambda + AWS Batch hybrid | Lambda or ECS Fargate |
| Database | S3-based (no MySQL) | Per-case MySQL |
| Search | dtSearch (Java native) | Elasticsearch 7.x |
| Process tracking | S3-based Trackable + Step Functions | Checkpoint pipeline (DB-based) |
| Auth | API Gateway + IAM | HMAC-SHA1 / IAM roles |
| AWS accounts | Own org (nextpoint-data-mining-*) |
Shared Nextpoint accounts |
| Multi-tenant | Per-client AWS accounts | Per-case MySQL databases |
Integration with Nextpoint Platform¶
- NGE extractor: Queries Athena/Glue catalog for extraction job statuses
(
EXTRACTOR_ENVlinks to extractor environment) - S3 buckets:
nextpoint-dm-data-{env},nextpoint-dm-system-{env},nextpoint-dm-user-{env} - Front-end: Coupled to
eda-front-endviaCORS_ALLOWED_ORIGINand shared API Gateway
Key File Locations¶
| File | Purpose |
|---|---|
src/shared/handler.rb |
Lambda handler framework |
src/shared/trackable/process.rb |
S3-based process tracking |
rakelib/lambda.rake |
All Lambda function definitions |
src/lambda/functions/ |
~39 Lambda function source files |
src/batch/ |
AWS Batch job containers |
config.toml.example |
Environment config template |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.