Reference Implementation: unzipservice¶

Archive extraction service for the Nextpoint eDiscovery platform. Extracts ZIP, RAR, 7Z, TAR, GZIP, and BZIP2 archives as an ECS Fargate task, uploading extracted contents to S3 with status reporting and error notifications.

Architecture Overview¶

unzipservice is a standalone Java ECS Fargate task — not an event-driven SNS/SQS service module. It receives a single extraction job via environment variables, processes one archive file, uploads extracted contents to S3, and reports status. This is the simplest module in the NGE platform: pure data-in → data-out with S3 and SES as the only AWS dependencies.

Why ECS Instead of Lambda?¶

Concern	Lambda	ECS Fargate (chosen)
Archive size	512 MB /tmp	Configurable ephemeral (up to 200 GB)
Execution time	15 min max	No timeout limit
Native tools	Lambda layers only	Full OS (`unar` for RAR)
Memory	10 GB max	Up to 30 GB
Nested extraction	Would require chaining	Single process handles all

Architecture Tree¶

unzipservice/
├── src/main/java/com/nextpoint/unzipservice/
│   ├── UnzipserviceApplication.java          # Entry point (Spring Boot main)
│   ├── service/
│   │   ├── ArchiveHandler.java               # Orchestrator: download → detect → extract → upload
│   │   ├── UnArchiverService.java            # Interface for format extractors
│   │   ├── MailingService.java               # Error notifications via AWS SES
│   │   ├── ReportingService.java             # Status JSON updates to S3
│   │   └── implementation/
│   │       ├── ZipManagerServiceImpl.java    # ZIP extraction (zip4j)
│   │       ├── RarManagerServiceImpl.java    # RAR extraction (unar CLI)
│   │       ├── S7ZipManagerServiceImpl.java  # 7Z extraction (commons-compress)
│   │       ├── TarManagerServiceImpl.java    # TAR/TAR.GZ/TAR.BZ2 (commons-compress)
│   │       ├── GzipManagerServiceImpl.java   # GZIP single-file (commons-compress)
│   │       └── BzipManagerServiceImpl.java   # BZIP2 single-file (commons-compress)
│   └── utils/
│       └── S3Connector.java                  # S3 download/upload via TransferManager
├── src/test/java/.../service/
│   ├── ArchiveHandlerTest.java               # Commented out (build failures)
│   ├── MailingServiceTest.java               # Mock SES client
│   ├── ReportingServiceTest.java             # Mock S3 + file I/O
│   └── implementation/
│       ├── ZipManagerServiceImplTest.java    # Real test archives
│       ├── RarManagerServiceImplTest.java    # Command builder tests only
│       ├── TarManagerServiceImplTest.java    # TempDir fixtures
│       └── S7ZipManagerServiceImplTest.java  # Real test archives
├── pom.xml                                   # Maven: Java 21, Spring Boot 3.2.4
├── Dockerfile                                # amazoncorretto-21 + unar + locale
└── buildScript.sh                            # Maven → Docker → ECR push

Language Stack¶

Component	Language	Runtime	Purpose
Archive Service	Java 21	ECS Fargate (Spring Boot)	Archive extraction orchestration
No Lambda	—	—	No API/trigger layer (direct ECS RunTask)
No CDK	—	—	No infrastructure-as-code in this repo

Request Flow¶

Upstream Service (documentloader or manual trigger)
  │
  ▼
ECS RunTask (environment variable overrides)
  ├── unzipFileList = JSON array of archive descriptors
  └── userDetails = JSON with notification/reporting config
  │
  ▼
UnzipserviceApplication.main()
  │
  ▼
ArchiveHandler.handleArchive()
  ├── 1. Parse env vars → extract first file descriptor
  ├── 2. Download archive from S3 → local filesystem
  ├── 3. Detect MIME type (Apache Tika)
  ├── 4. Route to format-specific extractor
  ├── 5. Extract to local destination directory
  ├── 6. Delete original archive from local
  ├── 7. Upload all extracted files to S3 (recursive)
  ├── 8. Update status report JSON on S3
  ├── 9. Send error email (if applicable) via SES
  └── 10. Cleanup local directory + close S3 connections

Supported Archive Formats¶

Format	MIME Type	Library	Password Support	Notes
ZIP	`application/zip`	zip4j 2.11.5	✅	Custom retry logic (MAX_ATTEMPTS=5 for null LocalFileHeader)
RAR	`application/x-rar-compressed`	`unar` CLI	✅	Shell exec: `Runtime.getRuntime().exec()`
7Z	`application/x-7z-compressed`	commons-compress 1.25.0	✅	SevenZFile with password char array
TAR	`application/x-tar`	commons-compress	❌	Detects .tar.gz and .tar.bz2 by extension
GZIP	`application/gzip`	commons-compress	❌	Single-file decompression only
BZIP2	`application/x-bzip2`	commons-compress	❌	Single-file decompression only

File type detection: Apache Tika (tika-core 2.8.0) with special-case override for RAR files (Tika sometimes misdetects RAR).

Excluded entries: __MACOSX/ directories and .DS_Store files filtered during extraction.

Error Classification¶

String-based error classification (not exception hierarchy):

Status	Detection	User Message
`complete`	Extraction succeeded	(no email)
`no_password`	Exception contains "password" + empty password field	"A password is required, but was not provided."
`bad_password`	Exception contains "password" + password provided	"The password entered is incorrect."
`partial_extracted`	"Unexpected end of input stream"	"Some files might not be extracted properly."
`error`	Any other exception	Generic error message

Password detection: Regex (?i).*\bpassword\b.* applied to exception message string — fragile but covers zip4j, unar, and commons-compress messages.

Status Reporting¶

JSON file updated on S3 at userDetails.reportKey:

{
  "status": "complete|error|no_password|bad_password|partial_extracted",
  "percentDone": 100
}

Status updated at two points: 1. On success: complete with percentDone: 100 2. On error: error type with percentDone: 0

Security Patterns¶

Path Traversal Protection (Zip Slip)¶

Every extractor validates extracted file paths against the destination directory:

private File newFile(File destDir, String entryName) throws IOException {
    File destFile = new File(destDir, entryName);
    String destDirPath = destDir.getCanonicalPath();
    String destFilePath = destFile.getCanonicalPath();
    if (!destFilePath.startsWith(destDirPath + File.separator)) {
        throw new IOException("Entry is outside target dir: " + entryName);
    }
    return destFile;
}

This protects against malicious archive entries like ../../etc/passwd.

Password Handling¶

Passwords are Base64-encoded in the environment variable
Decoded at runtime before passing to extractors
Not logged in structured output (though all env vars are dumped at startup — see review below)

S3 Operations¶

S3Connector wraps AWS SDK v1 TransferManager:

Operation	Method	Behavior
Download	`objectPuller()`	TransferManager.download() + waitForCompletion() (blocking)
Upload file	`objectPusher(isFile=true)`	TransferManager.upload() (single file)
Upload directory	`objectPusher(isFile=false)`	TransferManager.uploadDirectory() (recursive)

TransferManager lifecycle: Created once per ArchiveHandler invocation, shutdownNow() in finally block.

Pattern Mapping¶

Architecture Pattern	unzipservice Implementation	Notes
SNS Event Publishing	Not used	No events published or consumed
SQS Handler	Not used	Triggered via ECS env vars
Exception Hierarchy	Not used	String-based error classification
Hexagonal core/shell	Not followed	Monolithic ArchiveHandler
Multi-tenancy	Case ID in strings only	No database access
Idempotent Handlers	Not implemented	No dedup; same task can re-extract
Checkpoint Pipeline	Not used	Single-shot execution
Config Management	Environment variables (JSON)	No config files beyond application.properties
CDK Infrastructure	Not in this repo	Deployed via buildScript.sh + manual ECR push

Divergences from Standard Architecture¶

Standard Pattern	unzipservice Divergence	Reason
Python 3.10+	Java 21 (Spring Boot 3.2.4)	Mature archive libraries (zip4j, commons-compress)
Event-driven SNS/SQS	ECS RunTask with env var overrides	One-shot job, no message flow
Lambda handlers	ECS Fargate task	Archive size/time exceeds Lambda limits
Hexagonal core/shell	Monolithic orchestrator class	Small scope; single responsibility
Exception hierarchy	String pattern matching on error messages	No SQS requeue semantics to control
Multi-tenant MySQL	No database	Status via S3 JSON files
CDK infrastructure	Manual Docker build + ECR push	No IaC in repo
Structured JSON logging	Log4j2 console pattern	No cross-service correlation fields
AWS SDK v2	AWS SDK v1 (1.11.1000)	Legacy; should migrate

Pre-Deployment Architectural Review¶

P0 — Blockers¶

1. Environment Variable Dump at Startup¶

UnzipserviceApplication.main() dumps ALL environment variables to stdout, including Base64-encoded passwords. In ECS Fargate, stdout goes to CloudWatch Logs, exposing credentials.

Fix: Remove env var dump or redact sensitive fields.

2. Only First File Processed¶

unzipFileList.getFirst() — array input but only first element is ever processed. If caller sends multiple archives, the rest are silently ignored.

Fix: Either iterate over all files or validate input is single-element.

3. Hardcoded Local Path¶

Extraction directory is src/main/unziproom/ — a relative path that only works during development. In ECS Fargate, this resolves to the container's working directory, which may have limited space.

Fix: Use /tmp or a configurable ephemeral storage path.

P1 — High Priority¶

4. No CDK Infrastructure¶

No infrastructure-as-code. Deployment is via buildScript.sh (manual Docker build + ECR push). No reproducible stack definition.

Fix: Add CDK stack defining ECS task definition, IAM roles, ECR repository, and CloudWatch log group.

5. AWS SDK v1 (Legacy)¶

Uses AWS SDK 1.11.1000 — end-of-support. Missing: connection pooling, async clients, credential provider chain improvements.

Fix: Migrate to AWS SDK v2 (software.amazon.awssdk).

6. Hardcoded SES Region¶

MailingService uses Regions.US_EAST_1 hardcoded — won't work for ca-central-1 deployments if SES is region-specific.

Fix: Read region from environment or use default provider chain.

7. Shell Command Injection Risk (RAR)¶

RAR extraction uses Runtime.getRuntime().exec() with constructed command string. If file paths contain special characters, this could be exploited.

Fix: Use ProcessBuilder with argument array (already partially done but verify escaping).

8. ArchiveHandler Tests Commented Out¶

Main orchestration test is disabled with "TODO: failing while building image." No integration test coverage for the primary flow.

9. Spring Boot Unused¶

Spring Boot 3.2.4 is included as parent POM but not actually used — no @SpringBootApplication scan, no dependency injection, no web server. Adds ~15 MB to the container image for no benefit.

Fix: Remove Spring Boot dependency; use plain Java main class.

P2 — Medium Priority¶

10. No Retry or Recovery¶

Single-shot execution with no retry. If S3 upload fails mid-way, extracted files may be partially uploaded with no way to resume.

11. No Nested Archive Support¶

If a ZIP contains another ZIP, the inner archive is uploaded as-is (not recursively extracted). May be intentional for eDiscovery (preserving original structure).

12. No File Size Limits¶

No guard against extracting a zip bomb (small archive → massive expansion). Could exhaust ECS ephemeral storage.

13. Blocking S3 Operations¶

TransferManager download/upload are synchronous blocking calls. For large archives with many files, this could be optimized with async uploads.

Architecture Compliance Summary¶

Requirement	Status	Details
Event-driven via SNS	❌ N/A	ECS task, env var triggered — by design
Idempotent handlers	❌ N/A	One-shot ECS task, not event handler
Exception hierarchy	❌ FAIL	String matching instead of typed exceptions
Multi-tenant DB	❌ N/A	No database interaction
No secrets in code	⚠️ WARN	Passwords in env vars dumped to CloudWatch
IAM least privilege	⚠️ UNKNOWN	No CDK/IAM definitions in repo
Testing	⚠️ WARN	Core tests commented out; format tests exist
CDK infrastructure	❌ FAIL	No IaC — manual build script only
Structured logging	❌ FAIL	Log4j2 pattern format, no JSON structure

Lessons Learned¶

ECS Fargate is the right choice for archive extraction — Lambda's 512 MB /tmp and 15-minute timeout are insufficient for production archive workloads (100 GB+ archives with tens of thousands of files).
Path traversal protection is essential — zip-slip attacks are a real threat when extracting user-uploaded archives in legal proceedings. Every extractor must validate output paths.
Format-specific libraries outperform generic solutions — zip4j handles ZIP passwords and encryption better than commons-compress; unar handles RAR better than any Java library.
Status reporting via S3 JSON is simple but limited — works for basic pass/fail, but lacks granular progress tracking (e.g., "extracted 500 of 10,000 files"). For long-running extractions, consider DynamoDB or EventBridge for real-time status updates.
Spring Boot overhead is unjustified for ECS tasks — a plain Java main class with explicit dependency wiring would reduce image size and startup time without losing any functionality.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.