Reference Implementation: neardupe (POC)¶
Overview¶
Near-duplicate detection POC using Locality-Sensitive Hashing (LSH) on AWS EMR. Migrates the existing Databricks-based near-dupe workflow to EMR for ~52% cost savings. Identifies document pairs with Jaccard distance below a configurable threshold.
EDRM Stage: 7 (Analysis) — near-duplicate detection for document review. Suite: Common (both Discovery and Litigation). Status: POC — not yet integrated into the platform.
Current Production Process (Databricks)¶
1. Manually push search text files to S3
2. Run Databricks notebook (LSH algorithm)
3. Calculate Jaccard distance between documents
4. Filter pairs under threshold
5. Copy results (sim_final.csv) to S3 per case
POC Architecture (EMR)¶
neardupe/
├── scripts/
│ ├── lsh_core.py # Core LSH algorithm (PySpark, Databricks-free)
│ └── run_neardupe.py # EMR job runner (CLI argument parsing)
├── config/
│ └── emr-cluster.json # EMR cluster config (16 workers, c5.4xlarge)
├── deployment/
│ └── deploy.sh # Upload scripts → create cluster → submit job
└── README.md
Algorithm Pipeline¶
S3 text files → Read documents
│
▼
Tokenize (RegexTokenizer) → Remove stopwords → N-grams (3)
│
▼
Vectorize (HashingTF, 4M features) → Filter empty vectors
│
▼
MinHashLSH (3 hash tables) → Approximate nearest neighbors
│
▼
Jaccard distance calculation → Filter pairs < threshold (0.2)
│
▼
Optional: MD5 exact dedup (removes identical docs before LSH)
│
▼
Output: sim_final.csv — (doc1, doc2, distance) pairs
Stack¶
- Language: Python (PySpark)
- Compute: AWS EMR (Spark on YARN)
- ML: PySpark MLlib (MinHashLSH, HashingTF, RegexTokenizer, StopWordsRemover, NGram)
- Storage: S3 (input text files, intermediate Parquet, output CSV)
- Deploy: Shell script (
deploy.sh)
Configuration¶
| Parameter | Default | Description |
|---|---|---|
--sim-threshold |
0.2 | Jaccard distance threshold (lower = more similar) |
--num-tables |
3 | Number of LSH hash tables (more = better recall, slower) |
--ngram-order |
3 | N-gram size for tokenization |
--vocab-size |
262144 | Vocabulary size for count vectorizer |
--num-features |
4194304 | Feature space for hashing vectorizer |
--skew-threshold |
500 | Bucket size for salted joins (handles data skew) |
--vectorizer-type |
hashing | count or hashing |
--dedupe-exact-md5 |
true | MD5 exact dedup before LSH |
EMR Cluster¶
| Component | Spec |
|---|---|
| Master | m5.4xlarge (16 vCPUs, 64 GB) |
| Workers | 16 × c5.4xlarge (16 vCPUs, 32 GB) — compute optimized |
| Total cores | 240 (16 workers × 3 executors × 5 cores) |
| Executor memory | 8g + 1g overhead + 1g off-heap |
| Driver memory | 20g, 8 cores |
S3 Layout¶
s3://{bucket}/
├── data/{case}/ # Input text files + output sim_final.csv
├── scratch/{case}/ # Intermediate Parquet files
│ ├── raw/ # Raw document data
│ ├── tokens/ # Tokenized text
│ ├── stopwords/ # After stopword removal
│ ├── ngrams/ # N-gram features
│ ├── vectorizer/ # TF vectors
│ ├── lsh/ # LSH hash buckets
│ └── sim/ # Similarity pairs
├── neardupe/scripts/ # PySpark code
└── emr-logs/ # Cluster logs
Cost Comparison¶
| Dataset | Documents | Time | Databricks | EMR | Savings |
|---|---|---|---|---|---|
| Small | 76K | 10 min | ~$4.10 | $1.94 | 52% |
| Medium | 273K | 23 min | ~$9.40 | $4.47 | 52% |
| Large | 1M | 83 min | ~$33.80 | $16.10 | 52% |
Key Design Decisions¶
LSH over Brute-Force Comparison¶
Brute-force pairwise comparison is O(n²). LSH reduces this to approximate nearest neighbors by hashing similar documents to the same buckets. Multiple hash tables (default 3) improve recall while keeping cost manageable.
Salted Joins for Data Skew¶
When LSH buckets are highly skewed (one bucket has many documents), joins
become slow. The --skew-threshold parameter triggers salted joins — adding
random salt to large buckets to distribute the join across partitions.
Hashing Vectorizer over Count Vectorizer¶
Default is hashing (HashingTF) rather than count (CountVectorizer) because:
- No vocabulary fitting step needed (faster)
- Fixed feature space (4M) scales to any corpus size
- Slight accuracy trade-off acceptable for near-dupe detection
EMR over Databricks¶
Migrating from Databricks to EMR for: - 52% cost reduction (no DBU markup) - Compute-optimized instances (c5.4xlarge vs memory-optimized) - Native AWS integration (S3, CloudWatch, IAM) - Command-line automation (no notebook UI dependency)
Integration with Nextpoint Platform¶
Current: Manual process — search text pushed to S3, Databricks notebook run, results copied back.
Future: Could be triggered by Rails (similar to search-hit-report-backend) via
Lambda → EMR step submission, with results written to near_dupe_info JSON column
on exhibits table.
Key File Locations¶
| File | Purpose |
|---|---|
scripts/lsh_core.py |
Core LSH algorithm — LSH class with full pipeline |
scripts/run_neardupe.py |
EMR job runner with CLI argument parsing |
config/emr-cluster.json |
EMR cluster specification |
deployment/deploy.sh |
Automated deployment script |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.