Reference Implementation: search-hit-report-backend¶
Overview¶
The Search Hit Report Backend is a serverless service that generates search hit reports for the Nextpoint platform. Given a set of search terms, it executes them against Elasticsearch, identifies matching documents, converts results to Parquet for analytics, sets up Athena tables, and calls back to Rails to apply bulk actions.
EDRM Stage: 7 (Analysis) — search hit analysis and bulk tagging. Suite: Common (both Discovery and Litigation). Maturity: Early/prototype — TODOs, hardcoded staging values, shell-script deploy.
Architecture¶
search-hit-report-backend/
├── lambda-functions/
│ └── query-function/
│ ├── lambda_function.rb # Main Lambda handler (Ruby 3.3.7)
│ ├── lib/
│ │ ├── elasticsearch.rb # ES client — multi-search with PIT deep pagination
│ │ ├── elasticsearch/
│ │ │ └── results.rb # ES hits → CSV conversion
│ │ ├── athena.rb # Athena client — create DB, tables, run queries
│ │ ├── athena/
│ │ │ ├── query.rb # Query execution with polling
│ │ │ └── query/results.rb # Result fetching from S3
│ │ ├── glue/
│ │ │ ├── etl.rb # Glue ETL job launcher
│ │ │ └── etl/job.rb # Job status polling and await
│ │ ├── nextpoint_api.rb # HMAC-SHA1 XML API client for Rails callback
│ │ └── nextpoint_api/
│ │ └── request.rb # API request builder with HMAC signing
│ ├── deploy.sh # Full deploy script (Docker, ECR, IAM, Lambda)
│ ├── Dockerfile # Ruby 3.3 Lambda container
│ ├── Gemfile # aws-sdk-lambda, elasticsearch ~> 7.0, xml-simple
│ └── run_shr.rb # Rails console test harness
├── glue/
│ ├── scripts/
│ │ ├── convert_search_hits_to_parquet.py # PySpark: CSV → Parquet (search hits)
│ │ └── convert_exhibits_to_parquet.py # PySpark: CSV → Parquet (exhibits)
│ └── glue.sh # Glue deployment script
└── utils/
└── nextpoint_s3.sh # Region/env S3 bucket name resolution
Processing Pipeline¶
Single Lambda invocation (15-minute timeout, 4GB memory):
1. Setup Athena database + external tables (Parquet → SQL)
│
2. Start Glue ETL job: exhibits CSV → Parquet (runs in parallel with 3-5)
│
3. Open Elasticsearch Point-in-Time (PIT) cursor on case exhibit index
│
4. Execute all search queries via ES _msearch API
(deep pagination: 10K per page, search_after cursor)
Collect matching exhibit IDs per search term
│
5. Write search hit CSVs to S3: case_{id}/search_hit_reporting/csv/search_hits/
│
6. Start Glue ETL job: search hit CSVs → Parquet
│
7. Await both Glue jobs completion
│
8. POST to Rails /documents/bulk_update (HMAC-SHA1 signed XML)
with S3 location of results + pass-through params (tags, folders)
Pattern Mapping¶
| Pattern | search-hit-report Implementation | Standard NGE Pattern |
|---|---|---|
| Language | Ruby 3.3.7 (Lambda) + Python (Glue) | Python 3.10+ |
| Invocation | Lambda async invoke (Event) from Rails | SNS/SQS event-driven |
| Search | ES 7.x multi-search with PIT deep pagination | N/A |
| ETL | AWS Glue PySpark (CSV → Parquet) | AWS Glue (documentexchanger uses for DB ETL) |
| Analytics | Athena external tables over Parquet | Athena (PSM uses for event queries) |
| Rails callback | HMAC-SHA1 XML API (same as Legacy workers) | Direct DB access |
| Infrastructure | Shell-script deploy (no CDK) | AWS CDK (TypeScript) |
| Architecture | Monolithic lambda_function.rb | Hexagonal core/shell |
Integration with Rails¶
Invoked from Rails (SearchHitReportSearcherJob):
# Rails Sidekiq job invokes Lambda asynchronously
Aws::Lambda::Client.new.invoke(
function_name: "search-hit-report-query-function-#{env}",
invocation_type: 'Event',
payload: {
npcase_id: case_id,
search_terms: [...], # Pre-parsed ES queries
search_hit_report_id: report_id,
search_hit_params: { tags: [...], folders: [...] }
}.to_json
)
Callbacks to Rails:
- POST /documents/bulk_update with HMAC-SHA1 signed XML body
- Passes S3 path to results + search_hit_params for bulk tagging/foldering
Shared infrastructure:
- Same ES cluster (index: {env}_{npcase_id}_exhibits)
- Same S3 bucket ({prefix}-trialmanager-{env})
- S3 path convention: case_{npcase_id}/search_hit_reporting/
Divergences from Standard NGE Patterns¶
| Aspect | search-hit-report | Standard NGE |
|---|---|---|
| Language | Ruby (matches Rails, not NGE Python) | Python |
| Deploy | Shell scripts + AWS CLI | CDK |
| Architecture | No core/shell boundary | Hexagonal |
| API communication | HMAC-SHA1 XML (Legacy pattern) | Direct DB or SNS events |
| Testing | No automated tests | pytest |
| Maturity | Prototype (TODOs, hardcoded values) | Production |
Key File Locations¶
| File | Purpose |
|---|---|
lambda-functions/query-function/lambda_function.rb |
Main Lambda handler |
lambda-functions/query-function/lib/elasticsearch.rb |
ES multi-search with PIT |
lambda-functions/query-function/lib/nextpoint_api.rb |
HMAC-SHA1 Rails callback |
lambda-functions/query-function/lib/athena.rb |
Athena database/table management |
lambda-functions/query-function/lib/glue/etl.rb |
Glue job launcher |
glue/scripts/convert_search_hits_to_parquet.py |
PySpark ETL script |
lambda-functions/query-function/run_shr.rb |
Rails invocation example |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.