Reference Implementation: unzipservice¶
Archive extraction service for the Nextpoint eDiscovery platform. Extracts ZIP, RAR, 7Z, TAR, GZIP, and BZIP2 archives as an ECS Fargate task, uploading extracted contents to S3 with status reporting and error notifications.
Architecture Overview¶
unzipservice is a standalone Java ECS Fargate task — not an event-driven SNS/SQS service module. It receives a single extraction job via environment variables, processes one archive file, uploads extracted contents to S3, and reports status. This is the simplest module in the NGE platform: pure data-in → data-out with S3 and SES as the only AWS dependencies.
Why ECS Instead of Lambda?¶
| Concern | Lambda | ECS Fargate (chosen) |
|---|---|---|
| Archive size | 512 MB /tmp | Configurable ephemeral (up to 200 GB) |
| Execution time | 15 min max | No timeout limit |
| Native tools | Lambda layers only | Full OS (unar for RAR) |
| Memory | 10 GB max | Up to 30 GB |
| Nested extraction | Would require chaining | Single process handles all |
Architecture Tree¶
unzipservice/
├── src/main/java/com/nextpoint/unzipservice/
│ ├── UnzipserviceApplication.java # Entry point (Spring Boot main)
│ ├── service/
│ │ ├── ArchiveHandler.java # Orchestrator: download → detect → extract → upload
│ │ ├── UnArchiverService.java # Interface for format extractors
│ │ ├── MailingService.java # Error notifications via AWS SES
│ │ ├── ReportingService.java # Status JSON updates to S3
│ │ └── implementation/
│ │ ├── ZipManagerServiceImpl.java # ZIP extraction (zip4j)
│ │ ├── RarManagerServiceImpl.java # RAR extraction (unar CLI)
│ │ ├── S7ZipManagerServiceImpl.java # 7Z extraction (commons-compress)
│ │ ├── TarManagerServiceImpl.java # TAR/TAR.GZ/TAR.BZ2 (commons-compress)
│ │ ├── GzipManagerServiceImpl.java # GZIP single-file (commons-compress)
│ │ └── BzipManagerServiceImpl.java # BZIP2 single-file (commons-compress)
│ └── utils/
│ └── S3Connector.java # S3 download/upload via TransferManager
├── src/test/java/.../service/
│ ├── ArchiveHandlerTest.java # Commented out (build failures)
│ ├── MailingServiceTest.java # Mock SES client
│ ├── ReportingServiceTest.java # Mock S3 + file I/O
│ └── implementation/
│ ├── ZipManagerServiceImplTest.java # Real test archives
│ ├── RarManagerServiceImplTest.java # Command builder tests only
│ ├── TarManagerServiceImplTest.java # TempDir fixtures
│ └── S7ZipManagerServiceImplTest.java # Real test archives
├── pom.xml # Maven: Java 21, Spring Boot 3.2.4
├── Dockerfile # amazoncorretto-21 + unar + locale
└── buildScript.sh # Maven → Docker → ECR push
Language Stack¶
| Component | Language | Runtime | Purpose |
|---|---|---|---|
| Archive Service | Java 21 | ECS Fargate (Spring Boot) | Archive extraction orchestration |
| No Lambda | — | — | No API/trigger layer (direct ECS RunTask) |
| No CDK | — | — | No infrastructure-as-code in this repo |
Request Flow¶
Upstream Service (documentloader or manual trigger)
│
▼
ECS RunTask (environment variable overrides)
├── unzipFileList = JSON array of archive descriptors
└── userDetails = JSON with notification/reporting config
│
▼
UnzipserviceApplication.main()
│
▼
ArchiveHandler.handleArchive()
├── 1. Parse env vars → extract first file descriptor
├── 2. Download archive from S3 → local filesystem
├── 3. Detect MIME type (Apache Tika)
├── 4. Route to format-specific extractor
├── 5. Extract to local destination directory
├── 6. Delete original archive from local
├── 7. Upload all extracted files to S3 (recursive)
├── 8. Update status report JSON on S3
├── 9. Send error email (if applicable) via SES
└── 10. Cleanup local directory + close S3 connections
Supported Archive Formats¶
| Format | MIME Type | Library | Password Support | Notes |
|---|---|---|---|---|
| ZIP | application/zip |
zip4j 2.11.5 | ✅ | Custom retry logic (MAX_ATTEMPTS=5 for null LocalFileHeader) |
| RAR | application/x-rar-compressed |
unar CLI |
✅ | Shell exec: Runtime.getRuntime().exec() |
| 7Z | application/x-7z-compressed |
commons-compress 1.25.0 | ✅ | SevenZFile with password char array |
| TAR | application/x-tar |
commons-compress | ❌ | Detects .tar.gz and .tar.bz2 by extension |
| GZIP | application/gzip |
commons-compress | ❌ | Single-file decompression only |
| BZIP2 | application/x-bzip2 |
commons-compress | ❌ | Single-file decompression only |
File type detection: Apache Tika (tika-core 2.8.0) with special-case
override for RAR files (Tika sometimes misdetects RAR).
Excluded entries: __MACOSX/ directories and .DS_Store files filtered
during extraction.
Error Classification¶
String-based error classification (not exception hierarchy):
| Status | Detection | User Message |
|---|---|---|
complete |
Extraction succeeded | (no email) |
no_password |
Exception contains "password" + empty password field | "A password is required, but was not provided." |
bad_password |
Exception contains "password" + password provided | "The password entered is incorrect." |
partial_extracted |
"Unexpected end of input stream" | "Some files might not be extracted properly." |
error |
Any other exception | Generic error message |
Password detection: Regex (?i).*\bpassword\b.* applied to exception
message string — fragile but covers zip4j, unar, and commons-compress messages.
Status Reporting¶
JSON file updated on S3 at userDetails.reportKey:
Status updated at two points:
1. On success: complete with percentDone: 100
2. On error: error type with percentDone: 0
Security Patterns¶
Path Traversal Protection (Zip Slip)¶
Every extractor validates extracted file paths against the destination directory:
private File newFile(File destDir, String entryName) throws IOException {
File destFile = new File(destDir, entryName);
String destDirPath = destDir.getCanonicalPath();
String destFilePath = destFile.getCanonicalPath();
if (!destFilePath.startsWith(destDirPath + File.separator)) {
throw new IOException("Entry is outside target dir: " + entryName);
}
return destFile;
}
This protects against malicious archive entries like ../../etc/passwd.
Password Handling¶
- Passwords are Base64-encoded in the environment variable
- Decoded at runtime before passing to extractors
- Not logged in structured output (though all env vars are dumped at startup — see review below)
S3 Operations¶
S3Connector wraps AWS SDK v1 TransferManager:
| Operation | Method | Behavior |
|---|---|---|
| Download | objectPuller() |
TransferManager.download() + waitForCompletion() (blocking) |
| Upload file | objectPusher(isFile=true) |
TransferManager.upload() (single file) |
| Upload directory | objectPusher(isFile=false) |
TransferManager.uploadDirectory() (recursive) |
TransferManager lifecycle: Created once per ArchiveHandler invocation,
shutdownNow() in finally block.
Pattern Mapping¶
| Architecture Pattern | unzipservice Implementation | Notes |
|---|---|---|
| SNS Event Publishing | Not used | No events published or consumed |
| SQS Handler | Not used | Triggered via ECS env vars |
| Exception Hierarchy | Not used | String-based error classification |
| Hexagonal core/shell | Not followed | Monolithic ArchiveHandler |
| Multi-tenancy | Case ID in strings only | No database access |
| Idempotent Handlers | Not implemented | No dedup; same task can re-extract |
| Checkpoint Pipeline | Not used | Single-shot execution |
| Config Management | Environment variables (JSON) | No config files beyond application.properties |
| CDK Infrastructure | Not in this repo | Deployed via buildScript.sh + manual ECR push |
Divergences from Standard Architecture¶
| Standard Pattern | unzipservice Divergence | Reason |
|---|---|---|
| Python 3.10+ | Java 21 (Spring Boot 3.2.4) | Mature archive libraries (zip4j, commons-compress) |
| Event-driven SNS/SQS | ECS RunTask with env var overrides | One-shot job, no message flow |
| Lambda handlers | ECS Fargate task | Archive size/time exceeds Lambda limits |
| Hexagonal core/shell | Monolithic orchestrator class | Small scope; single responsibility |
| Exception hierarchy | String pattern matching on error messages | No SQS requeue semantics to control |
| Multi-tenant MySQL | No database | Status via S3 JSON files |
| CDK infrastructure | Manual Docker build + ECR push | No IaC in repo |
| Structured JSON logging | Log4j2 console pattern | No cross-service correlation fields |
| AWS SDK v2 | AWS SDK v1 (1.11.1000) | Legacy; should migrate |
Pre-Deployment Architectural Review¶
P0 — Blockers¶
1. Environment Variable Dump at Startup¶
UnzipserviceApplication.main() dumps ALL environment variables to stdout,
including Base64-encoded passwords. In ECS Fargate, stdout goes to CloudWatch
Logs, exposing credentials.
Fix: Remove env var dump or redact sensitive fields.
2. Only First File Processed¶
unzipFileList.getFirst() — array input but only first element is ever
processed. If caller sends multiple archives, the rest are silently ignored.
Fix: Either iterate over all files or validate input is single-element.
3. Hardcoded Local Path¶
Extraction directory is src/main/unziproom/ — a relative path that only
works during development. In ECS Fargate, this resolves to the container's
working directory, which may have limited space.
Fix: Use /tmp or a configurable ephemeral storage path.
P1 — High Priority¶
4. No CDK Infrastructure¶
No infrastructure-as-code. Deployment is via buildScript.sh (manual Docker
build + ECR push). No reproducible stack definition.
Fix: Add CDK stack defining ECS task definition, IAM roles, ECR repository, and CloudWatch log group.
5. AWS SDK v1 (Legacy)¶
Uses AWS SDK 1.11.1000 — end-of-support. Missing: connection pooling, async clients, credential provider chain improvements.
Fix: Migrate to AWS SDK v2 (software.amazon.awssdk).
6. Hardcoded SES Region¶
MailingService uses Regions.US_EAST_1 hardcoded — won't work for
ca-central-1 deployments if SES is region-specific.
Fix: Read region from environment or use default provider chain.
7. Shell Command Injection Risk (RAR)¶
RAR extraction uses Runtime.getRuntime().exec() with constructed command
string. If file paths contain special characters, this could be exploited.
Fix: Use ProcessBuilder with argument array (already partially done but verify escaping).
8. ArchiveHandler Tests Commented Out¶
Main orchestration test is disabled with "TODO: failing while building image." No integration test coverage for the primary flow.
9. Spring Boot Unused¶
Spring Boot 3.2.4 is included as parent POM but not actually used — no
@SpringBootApplication scan, no dependency injection, no web server.
Adds ~15 MB to the container image for no benefit.
Fix: Remove Spring Boot dependency; use plain Java main class.
P2 — Medium Priority¶
10. No Retry or Recovery¶
Single-shot execution with no retry. If S3 upload fails mid-way, extracted files may be partially uploaded with no way to resume.
11. No Nested Archive Support¶
If a ZIP contains another ZIP, the inner archive is uploaded as-is (not recursively extracted). May be intentional for eDiscovery (preserving original structure).
12. No File Size Limits¶
No guard against extracting a zip bomb (small archive → massive expansion). Could exhaust ECS ephemeral storage.
13. Blocking S3 Operations¶
TransferManager download/upload are synchronous blocking calls. For large archives with many files, this could be optimized with async uploads.
Architecture Compliance Summary¶
| Requirement | Status | Details |
|---|---|---|
| Event-driven via SNS | ❌ N/A | ECS task, env var triggered — by design |
| Idempotent handlers | ❌ N/A | One-shot ECS task, not event handler |
| Exception hierarchy | ❌ FAIL | String matching instead of typed exceptions |
| Multi-tenant DB | ❌ N/A | No database interaction |
| No secrets in code | ⚠️ WARN | Passwords in env vars dumped to CloudWatch |
| IAM least privilege | ⚠️ UNKNOWN | No CDK/IAM definitions in repo |
| Testing | ⚠️ WARN | Core tests commented out; format tests exist |
| CDK infrastructure | ❌ FAIL | No IaC — manual build script only |
| Structured logging | ❌ FAIL | Log4j2 pattern format, no JSON structure |
Lessons Learned¶
-
ECS Fargate is the right choice for archive extraction — Lambda's 512 MB /tmp and 15-minute timeout are insufficient for production archive workloads (100 GB+ archives with tens of thousands of files).
-
Path traversal protection is essential — zip-slip attacks are a real threat when extracting user-uploaded archives in legal proceedings. Every extractor must validate output paths.
-
Format-specific libraries outperform generic solutions — zip4j handles ZIP passwords and encryption better than commons-compress;
unarhandles RAR better than any Java library. -
Status reporting via S3 JSON is simple but limited — works for basic pass/fail, but lacks granular progress tracking (e.g., "extracted 500 of 10,000 files"). For long-running extractions, consider DynamoDB or EventBridge for real-time status updates.
-
Spring Boot overhead is unjustified for ECS tasks — a plain Java main class with explicit dependency wiring would reduce image size and startup time without losing any functionality.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.