Reference Implementation: neardupe (POC)¶

Overview¶

Near-duplicate detection POC using Locality-Sensitive Hashing (LSH) on AWS EMR. Migrates the existing Databricks-based near-dupe workflow to EMR for ~52% cost savings. Identifies document pairs with Jaccard distance below a configurable threshold.

EDRM Stage: 7 (Analysis) — near-duplicate detection for document review. Suite: Common (both Discovery and Litigation). Status: POC — not yet integrated into the platform.

Current Production Process (Databricks)¶

1. Manually push search text files to S3
2. Run Databricks notebook (LSH algorithm)
3. Calculate Jaccard distance between documents
4. Filter pairs under threshold
5. Copy results (sim_final.csv) to S3 per case

POC Architecture (EMR)¶

neardupe/
├── scripts/
│   ├── lsh_core.py              # Core LSH algorithm (PySpark, Databricks-free)
│   └── run_neardupe.py          # EMR job runner (CLI argument parsing)
├── config/
│   └── emr-cluster.json         # EMR cluster config (16 workers, c5.4xlarge)
├── deployment/
│   └── deploy.sh                # Upload scripts → create cluster → submit job
└── README.md

Algorithm Pipeline¶

S3 text files → Read documents
    │
    ▼
Tokenize (RegexTokenizer) → Remove stopwords → N-grams (3)
    │
    ▼
Vectorize (HashingTF, 4M features) → Filter empty vectors
    │
    ▼
MinHashLSH (3 hash tables) → Approximate nearest neighbors
    │
    ▼
Jaccard distance calculation → Filter pairs < threshold (0.2)
    │
    ▼
Optional: MD5 exact dedup (removes identical docs before LSH)
    │
    ▼
Output: sim_final.csv — (doc1, doc2, distance) pairs

Stack¶

Language: Python (PySpark)
Compute: AWS EMR (Spark on YARN)
ML: PySpark MLlib (MinHashLSH, HashingTF, RegexTokenizer, StopWordsRemover, NGram)
Storage: S3 (input text files, intermediate Parquet, output CSV)
Deploy: Shell script (deploy.sh)

Configuration¶

Parameter	Default	Description
`--sim-threshold`	0.2	Jaccard distance threshold (lower = more similar)
`--num-tables`	3	Number of LSH hash tables (more = better recall, slower)
`--ngram-order`	3	N-gram size for tokenization
`--vocab-size`	262144	Vocabulary size for count vectorizer
`--num-features`	4194304	Feature space for hashing vectorizer
`--skew-threshold`	500	Bucket size for salted joins (handles data skew)
`--vectorizer-type`	hashing	`count` or `hashing`
`--dedupe-exact-md5`	true	MD5 exact dedup before LSH

EMR Cluster¶

Component	Spec
Master	m5.4xlarge (16 vCPUs, 64 GB)
Workers	16 × c5.4xlarge (16 vCPUs, 32 GB) — compute optimized
Total cores	240 (16 workers × 3 executors × 5 cores)
Executor memory	8g + 1g overhead + 1g off-heap
Driver memory	20g, 8 cores

S3 Layout¶

s3://{bucket}/
├── data/{case}/                  # Input text files + output sim_final.csv
├── scratch/{case}/               # Intermediate Parquet files
│   ├── raw/                      # Raw document data
│   ├── tokens/                   # Tokenized text
│   ├── stopwords/                # After stopword removal
│   ├── ngrams/                   # N-gram features
│   ├── vectorizer/               # TF vectors
│   ├── lsh/                      # LSH hash buckets
│   └── sim/                      # Similarity pairs
├── neardupe/scripts/             # PySpark code
└── emr-logs/                     # Cluster logs

Cost Comparison¶

Dataset	Documents	Time	Databricks	EMR	Savings
Small	76K	10 min	~$4.10	$1.94	52%
Medium	273K	23 min	~$9.40	$4.47	52%
Large	1M	83 min	~$33.80	$16.10	52%

Key Design Decisions¶

LSH over Brute-Force Comparison¶

Brute-force pairwise comparison is O(n²). LSH reduces this to approximate nearest neighbors by hashing similar documents to the same buckets. Multiple hash tables (default 3) improve recall while keeping cost manageable.

Salted Joins for Data Skew¶

When LSH buckets are highly skewed (one bucket has many documents), joins become slow. The --skew-threshold parameter triggers salted joins — adding random salt to large buckets to distribute the join across partitions.

Hashing Vectorizer over Count Vectorizer¶

Default is hashing (HashingTF) rather than count (CountVectorizer) because: - No vocabulary fitting step needed (faster) - Fixed feature space (4M) scales to any corpus size - Slight accuracy trade-off acceptable for near-dupe detection

EMR over Databricks¶

Migrating from Databricks to EMR for: - 52% cost reduction (no DBU markup) - Compute-optimized instances (c5.4xlarge vs memory-optimized) - Native AWS integration (S3, CloudWatch, IAM) - Command-line automation (no notebook UI dependency)

Integration with Nextpoint Platform¶

Current: Manual process — search text pushed to S3, Databricks notebook run, results copied back.

Future: Could be triggered by Rails (similar to search-hit-report-backend) via Lambda → EMR step submission, with results written to near_dupe_info JSON column on exhibits table.

Key File Locations¶

File	Purpose
`scripts/lsh_core.py`	Core LSH algorithm — `LSH` class with full pipeline
`scripts/run_neardupe.py`	EMR job runner with CLI argument parsing
`config/emr-cluster.json`	EMR cluster specification
`deployment/deploy.sh`	Automated deployment script

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.