Skip to content

Reference Implementation: unzipservice

Archive extraction service for the Nextpoint eDiscovery platform. Extracts ZIP, RAR, 7Z, TAR, GZIP, and BZIP2 archives as an ECS Fargate task, uploading extracted contents to S3 with status reporting and error notifications.

Architecture Overview

unzipservice is a standalone Java ECS Fargate task — not an event-driven SNS/SQS service module. It receives a single extraction job via environment variables, processes one archive file, uploads extracted contents to S3, and reports status. This is the simplest module in the NGE platform: pure data-in → data-out with S3 and SES as the only AWS dependencies.

Why ECS Instead of Lambda?

Concern Lambda ECS Fargate (chosen)
Archive size 512 MB /tmp Configurable ephemeral (up to 200 GB)
Execution time 15 min max No timeout limit
Native tools Lambda layers only Full OS (unar for RAR)
Memory 10 GB max Up to 30 GB
Nested extraction Would require chaining Single process handles all

Architecture Tree

unzipservice/
├── src/main/java/com/nextpoint/unzipservice/
│   ├── UnzipserviceApplication.java          # Entry point (Spring Boot main)
│   ├── service/
│   │   ├── ArchiveHandler.java               # Orchestrator: download → detect → extract → upload
│   │   ├── UnArchiverService.java            # Interface for format extractors
│   │   ├── MailingService.java               # Error notifications via AWS SES
│   │   ├── ReportingService.java             # Status JSON updates to S3
│   │   └── implementation/
│   │       ├── ZipManagerServiceImpl.java    # ZIP extraction (zip4j)
│   │       ├── RarManagerServiceImpl.java    # RAR extraction (unar CLI)
│   │       ├── S7ZipManagerServiceImpl.java  # 7Z extraction (commons-compress)
│   │       ├── TarManagerServiceImpl.java    # TAR/TAR.GZ/TAR.BZ2 (commons-compress)
│   │       ├── GzipManagerServiceImpl.java   # GZIP single-file (commons-compress)
│   │       └── BzipManagerServiceImpl.java   # BZIP2 single-file (commons-compress)
│   └── utils/
│       └── S3Connector.java                  # S3 download/upload via TransferManager
├── src/test/java/.../service/
│   ├── ArchiveHandlerTest.java               # Commented out (build failures)
│   ├── MailingServiceTest.java               # Mock SES client
│   ├── ReportingServiceTest.java             # Mock S3 + file I/O
│   └── implementation/
│       ├── ZipManagerServiceImplTest.java    # Real test archives
│       ├── RarManagerServiceImplTest.java    # Command builder tests only
│       ├── TarManagerServiceImplTest.java    # TempDir fixtures
│       └── S7ZipManagerServiceImplTest.java  # Real test archives
├── pom.xml                                   # Maven: Java 21, Spring Boot 3.2.4
├── Dockerfile                                # amazoncorretto-21 + unar + locale
└── buildScript.sh                            # Maven → Docker → ECR push

Language Stack

Component Language Runtime Purpose
Archive Service Java 21 ECS Fargate (Spring Boot) Archive extraction orchestration
No Lambda No API/trigger layer (direct ECS RunTask)
No CDK No infrastructure-as-code in this repo

Request Flow

Upstream Service (documentloader or manual trigger)
ECS RunTask (environment variable overrides)
  ├── unzipFileList = JSON array of archive descriptors
  └── userDetails = JSON with notification/reporting config
UnzipserviceApplication.main()
ArchiveHandler.handleArchive()
  ├── 1. Parse env vars → extract first file descriptor
  ├── 2. Download archive from S3 → local filesystem
  ├── 3. Detect MIME type (Apache Tika)
  ├── 4. Route to format-specific extractor
  ├── 5. Extract to local destination directory
  ├── 6. Delete original archive from local
  ├── 7. Upload all extracted files to S3 (recursive)
  ├── 8. Update status report JSON on S3
  ├── 9. Send error email (if applicable) via SES
  └── 10. Cleanup local directory + close S3 connections

Supported Archive Formats

Format MIME Type Library Password Support Notes
ZIP application/zip zip4j 2.11.5 Custom retry logic (MAX_ATTEMPTS=5 for null LocalFileHeader)
RAR application/x-rar-compressed unar CLI Shell exec: Runtime.getRuntime().exec()
7Z application/x-7z-compressed commons-compress 1.25.0 SevenZFile with password char array
TAR application/x-tar commons-compress Detects .tar.gz and .tar.bz2 by extension
GZIP application/gzip commons-compress Single-file decompression only
BZIP2 application/x-bzip2 commons-compress Single-file decompression only

File type detection: Apache Tika (tika-core 2.8.0) with special-case override for RAR files (Tika sometimes misdetects RAR).

Excluded entries: __MACOSX/ directories and .DS_Store files filtered during extraction.

Error Classification

String-based error classification (not exception hierarchy):

Status Detection User Message
complete Extraction succeeded (no email)
no_password Exception contains "password" + empty password field "A password is required, but was not provided."
bad_password Exception contains "password" + password provided "The password entered is incorrect."
partial_extracted "Unexpected end of input stream" "Some files might not be extracted properly."
error Any other exception Generic error message

Password detection: Regex (?i).*\bpassword\b.* applied to exception message string — fragile but covers zip4j, unar, and commons-compress messages.

Status Reporting

JSON file updated on S3 at userDetails.reportKey:

{
  "status": "complete|error|no_password|bad_password|partial_extracted",
  "percentDone": 100
}

Status updated at two points: 1. On success: complete with percentDone: 100 2. On error: error type with percentDone: 0

Security Patterns

Path Traversal Protection (Zip Slip)

Every extractor validates extracted file paths against the destination directory:

private File newFile(File destDir, String entryName) throws IOException {
    File destFile = new File(destDir, entryName);
    String destDirPath = destDir.getCanonicalPath();
    String destFilePath = destFile.getCanonicalPath();
    if (!destFilePath.startsWith(destDirPath + File.separator)) {
        throw new IOException("Entry is outside target dir: " + entryName);
    }
    return destFile;
}

This protects against malicious archive entries like ../../etc/passwd.

Password Handling

  • Passwords are Base64-encoded in the environment variable
  • Decoded at runtime before passing to extractors
  • Not logged in structured output (though all env vars are dumped at startup — see review below)

S3 Operations

S3Connector wraps AWS SDK v1 TransferManager:

Operation Method Behavior
Download objectPuller() TransferManager.download() + waitForCompletion() (blocking)
Upload file objectPusher(isFile=true) TransferManager.upload() (single file)
Upload directory objectPusher(isFile=false) TransferManager.uploadDirectory() (recursive)

TransferManager lifecycle: Created once per ArchiveHandler invocation, shutdownNow() in finally block.

Pattern Mapping

Architecture Pattern unzipservice Implementation Notes
SNS Event Publishing Not used No events published or consumed
SQS Handler Not used Triggered via ECS env vars
Exception Hierarchy Not used String-based error classification
Hexagonal core/shell Not followed Monolithic ArchiveHandler
Multi-tenancy Case ID in strings only No database access
Idempotent Handlers Not implemented No dedup; same task can re-extract
Checkpoint Pipeline Not used Single-shot execution
Config Management Environment variables (JSON) No config files beyond application.properties
CDK Infrastructure Not in this repo Deployed via buildScript.sh + manual ECR push

Divergences from Standard Architecture

Standard Pattern unzipservice Divergence Reason
Python 3.10+ Java 21 (Spring Boot 3.2.4) Mature archive libraries (zip4j, commons-compress)
Event-driven SNS/SQS ECS RunTask with env var overrides One-shot job, no message flow
Lambda handlers ECS Fargate task Archive size/time exceeds Lambda limits
Hexagonal core/shell Monolithic orchestrator class Small scope; single responsibility
Exception hierarchy String pattern matching on error messages No SQS requeue semantics to control
Multi-tenant MySQL No database Status via S3 JSON files
CDK infrastructure Manual Docker build + ECR push No IaC in repo
Structured JSON logging Log4j2 console pattern No cross-service correlation fields
AWS SDK v2 AWS SDK v1 (1.11.1000) Legacy; should migrate

Pre-Deployment Architectural Review

P0 — Blockers

1. Environment Variable Dump at Startup

UnzipserviceApplication.main() dumps ALL environment variables to stdout, including Base64-encoded passwords. In ECS Fargate, stdout goes to CloudWatch Logs, exposing credentials.

Fix: Remove env var dump or redact sensitive fields.

2. Only First File Processed

unzipFileList.getFirst() — array input but only first element is ever processed. If caller sends multiple archives, the rest are silently ignored.

Fix: Either iterate over all files or validate input is single-element.

3. Hardcoded Local Path

Extraction directory is src/main/unziproom/ — a relative path that only works during development. In ECS Fargate, this resolves to the container's working directory, which may have limited space.

Fix: Use /tmp or a configurable ephemeral storage path.

P1 — High Priority

4. No CDK Infrastructure

No infrastructure-as-code. Deployment is via buildScript.sh (manual Docker build + ECR push). No reproducible stack definition.

Fix: Add CDK stack defining ECS task definition, IAM roles, ECR repository, and CloudWatch log group.

5. AWS SDK v1 (Legacy)

Uses AWS SDK 1.11.1000 — end-of-support. Missing: connection pooling, async clients, credential provider chain improvements.

Fix: Migrate to AWS SDK v2 (software.amazon.awssdk).

6. Hardcoded SES Region

MailingService uses Regions.US_EAST_1 hardcoded — won't work for ca-central-1 deployments if SES is region-specific.

Fix: Read region from environment or use default provider chain.

7. Shell Command Injection Risk (RAR)

RAR extraction uses Runtime.getRuntime().exec() with constructed command string. If file paths contain special characters, this could be exploited.

Fix: Use ProcessBuilder with argument array (already partially done but verify escaping).

8. ArchiveHandler Tests Commented Out

Main orchestration test is disabled with "TODO: failing while building image." No integration test coverage for the primary flow.

9. Spring Boot Unused

Spring Boot 3.2.4 is included as parent POM but not actually used — no @SpringBootApplication scan, no dependency injection, no web server. Adds ~15 MB to the container image for no benefit.

Fix: Remove Spring Boot dependency; use plain Java main class.

P2 — Medium Priority

10. No Retry or Recovery

Single-shot execution with no retry. If S3 upload fails mid-way, extracted files may be partially uploaded with no way to resume.

11. No Nested Archive Support

If a ZIP contains another ZIP, the inner archive is uploaded as-is (not recursively extracted). May be intentional for eDiscovery (preserving original structure).

12. No File Size Limits

No guard against extracting a zip bomb (small archive → massive expansion). Could exhaust ECS ephemeral storage.

13. Blocking S3 Operations

TransferManager download/upload are synchronous blocking calls. For large archives with many files, this could be optimized with async uploads.

Architecture Compliance Summary

Requirement Status Details
Event-driven via SNS ❌ N/A ECS task, env var triggered — by design
Idempotent handlers ❌ N/A One-shot ECS task, not event handler
Exception hierarchy ❌ FAIL String matching instead of typed exceptions
Multi-tenant DB ❌ N/A No database interaction
No secrets in code ⚠️ WARN Passwords in env vars dumped to CloudWatch
IAM least privilege ⚠️ UNKNOWN No CDK/IAM definitions in repo
Testing ⚠️ WARN Core tests commented out; format tests exist
CDK infrastructure ❌ FAIL No IaC — manual build script only
Structured logging ❌ FAIL Log4j2 pattern format, no JSON structure

Lessons Learned

  1. ECS Fargate is the right choice for archive extraction — Lambda's 512 MB /tmp and 15-minute timeout are insufficient for production archive workloads (100 GB+ archives with tens of thousands of files).

  2. Path traversal protection is essential — zip-slip attacks are a real threat when extracting user-uploaded archives in legal proceedings. Every extractor must validate output paths.

  3. Format-specific libraries outperform generic solutions — zip4j handles ZIP passwords and encryption better than commons-compress; unar handles RAR better than any Java library.

  4. Status reporting via S3 JSON is simple but limited — works for basic pass/fail, but lacks granular progress tracking (e.g., "extracted 500 of 10,000 files"). For long-running extractions, consider DynamoDB or EventBridge for real-time status updates.

  5. Spring Boot overhead is unjustified for ECS tasks — a plain Java main class with explicit dependency wiring would reduce image size and startup time without losing any functionality.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.