Skip to content

Reference Implementation: neardupe (POC)

Overview

Near-duplicate detection POC using Locality-Sensitive Hashing (LSH) on AWS EMR. Migrates the existing Databricks-based near-dupe workflow to EMR for ~52% cost savings. Identifies document pairs with Jaccard distance below a configurable threshold.

EDRM Stage: 7 (Analysis) — near-duplicate detection for document review. Suite: Common (both Discovery and Litigation). Status: POC — not yet integrated into the platform.

Current Production Process (Databricks)

1. Manually push search text files to S3
2. Run Databricks notebook (LSH algorithm)
3. Calculate Jaccard distance between documents
4. Filter pairs under threshold
5. Copy results (sim_final.csv) to S3 per case

POC Architecture (EMR)

neardupe/
├── scripts/
│   ├── lsh_core.py              # Core LSH algorithm (PySpark, Databricks-free)
│   └── run_neardupe.py          # EMR job runner (CLI argument parsing)
├── config/
│   └── emr-cluster.json         # EMR cluster config (16 workers, c5.4xlarge)
├── deployment/
│   └── deploy.sh                # Upload scripts → create cluster → submit job
└── README.md

Algorithm Pipeline

S3 text files → Read documents
Tokenize (RegexTokenizer) → Remove stopwords → N-grams (3)
Vectorize (HashingTF, 4M features) → Filter empty vectors
MinHashLSH (3 hash tables) → Approximate nearest neighbors
Jaccard distance calculation → Filter pairs < threshold (0.2)
Optional: MD5 exact dedup (removes identical docs before LSH)
Output: sim_final.csv — (doc1, doc2, distance) pairs

Stack

  • Language: Python (PySpark)
  • Compute: AWS EMR (Spark on YARN)
  • ML: PySpark MLlib (MinHashLSH, HashingTF, RegexTokenizer, StopWordsRemover, NGram)
  • Storage: S3 (input text files, intermediate Parquet, output CSV)
  • Deploy: Shell script (deploy.sh)

Configuration

Parameter Default Description
--sim-threshold 0.2 Jaccard distance threshold (lower = more similar)
--num-tables 3 Number of LSH hash tables (more = better recall, slower)
--ngram-order 3 N-gram size for tokenization
--vocab-size 262144 Vocabulary size for count vectorizer
--num-features 4194304 Feature space for hashing vectorizer
--skew-threshold 500 Bucket size for salted joins (handles data skew)
--vectorizer-type hashing count or hashing
--dedupe-exact-md5 true MD5 exact dedup before LSH

EMR Cluster

Component Spec
Master m5.4xlarge (16 vCPUs, 64 GB)
Workers 16 × c5.4xlarge (16 vCPUs, 32 GB) — compute optimized
Total cores 240 (16 workers × 3 executors × 5 cores)
Executor memory 8g + 1g overhead + 1g off-heap
Driver memory 20g, 8 cores

S3 Layout

s3://{bucket}/
├── data/{case}/                  # Input text files + output sim_final.csv
├── scratch/{case}/               # Intermediate Parquet files
│   ├── raw/                      # Raw document data
│   ├── tokens/                   # Tokenized text
│   ├── stopwords/                # After stopword removal
│   ├── ngrams/                   # N-gram features
│   ├── vectorizer/               # TF vectors
│   ├── lsh/                      # LSH hash buckets
│   └── sim/                      # Similarity pairs
├── neardupe/scripts/             # PySpark code
└── emr-logs/                     # Cluster logs

Cost Comparison

Dataset Documents Time Databricks EMR Savings
Small 76K 10 min ~$4.10 $1.94 52%
Medium 273K 23 min ~$9.40 $4.47 52%
Large 1M 83 min ~$33.80 $16.10 52%

Key Design Decisions

LSH over Brute-Force Comparison

Brute-force pairwise comparison is O(n²). LSH reduces this to approximate nearest neighbors by hashing similar documents to the same buckets. Multiple hash tables (default 3) improve recall while keeping cost manageable.

Salted Joins for Data Skew

When LSH buckets are highly skewed (one bucket has many documents), joins become slow. The --skew-threshold parameter triggers salted joins — adding random salt to large buckets to distribute the join across partitions.

Hashing Vectorizer over Count Vectorizer

Default is hashing (HashingTF) rather than count (CountVectorizer) because: - No vocabulary fitting step needed (faster) - Fixed feature space (4M) scales to any corpus size - Slight accuracy trade-off acceptable for near-dupe detection

EMR over Databricks

Migrating from Databricks to EMR for: - 52% cost reduction (no DBU markup) - Compute-optimized instances (c5.4xlarge vs memory-optimized) - Native AWS integration (S3, CloudWatch, IAM) - Command-line automation (no notebook UI dependency)

Integration with Nextpoint Platform

Current: Manual process — search text pushed to S3, Databricks notebook run, results copied back.

Future: Could be triggered by Rails (similar to search-hit-report-backend) via Lambda → EMR step submission, with results written to near_dupe_info JSON column on exhibits table.

Key File Locations

File Purpose
scripts/lsh_core.py Core LSH algorithm — LSH class with full pipeline
scripts/run_neardupe.py EMR job runner with CLI argument parsing
config/emr-cluster.json EMR cluster specification
deployment/deploy.sh Automated deployment script
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.