Skip to content

Reference Implementation: search-hit-report-backend

Overview

The Search Hit Report Backend is a serverless service that generates search hit reports for the Nextpoint platform. Given a set of search terms, it executes them against Elasticsearch, identifies matching documents, converts results to Parquet for analytics, sets up Athena tables, and calls back to Rails to apply bulk actions.

EDRM Stage: 7 (Analysis) — search hit analysis and bulk tagging. Suite: Common (both Discovery and Litigation). Maturity: Early/prototype — TODOs, hardcoded staging values, shell-script deploy.

Architecture

search-hit-report-backend/
├── lambda-functions/
│   └── query-function/
│       ├── lambda_function.rb           # Main Lambda handler (Ruby 3.3.7)
│       ├── lib/
│       │   ├── elasticsearch.rb         # ES client — multi-search with PIT deep pagination
│       │   ├── elasticsearch/
│       │   │   └── results.rb           # ES hits → CSV conversion
│       │   ├── athena.rb               # Athena client — create DB, tables, run queries
│       │   ├── athena/
│       │   │   ├── query.rb            # Query execution with polling
│       │   │   └── query/results.rb    # Result fetching from S3
│       │   ├── glue/
│       │   │   ├── etl.rb             # Glue ETL job launcher
│       │   │   └── etl/job.rb         # Job status polling and await
│       │   ├── nextpoint_api.rb        # HMAC-SHA1 XML API client for Rails callback
│       │   └── nextpoint_api/
│       │       └── request.rb          # API request builder with HMAC signing
│       ├── deploy.sh                    # Full deploy script (Docker, ECR, IAM, Lambda)
│       ├── Dockerfile                   # Ruby 3.3 Lambda container
│       ├── Gemfile                      # aws-sdk-lambda, elasticsearch ~> 7.0, xml-simple
│       └── run_shr.rb                  # Rails console test harness
├── glue/
│   ├── scripts/
│   │   ├── convert_search_hits_to_parquet.py  # PySpark: CSV → Parquet (search hits)
│   │   └── convert_exhibits_to_parquet.py     # PySpark: CSV → Parquet (exhibits)
│   └── glue.sh                         # Glue deployment script
└── utils/
    └── nextpoint_s3.sh                  # Region/env S3 bucket name resolution

Processing Pipeline

Single Lambda invocation (15-minute timeout, 4GB memory):

1. Setup Athena database + external tables (Parquet → SQL)
2. Start Glue ETL job: exhibits CSV → Parquet (runs in parallel with 3-5)
3. Open Elasticsearch Point-in-Time (PIT) cursor on case exhibit index
4. Execute all search queries via ES _msearch API
   (deep pagination: 10K per page, search_after cursor)
   Collect matching exhibit IDs per search term
5. Write search hit CSVs to S3: case_{id}/search_hit_reporting/csv/search_hits/
6. Start Glue ETL job: search hit CSVs → Parquet
7. Await both Glue jobs completion
8. POST to Rails /documents/bulk_update (HMAC-SHA1 signed XML)
   with S3 location of results + pass-through params (tags, folders)

Pattern Mapping

Pattern search-hit-report Implementation Standard NGE Pattern
Language Ruby 3.3.7 (Lambda) + Python (Glue) Python 3.10+
Invocation Lambda async invoke (Event) from Rails SNS/SQS event-driven
Search ES 7.x multi-search with PIT deep pagination N/A
ETL AWS Glue PySpark (CSV → Parquet) AWS Glue (documentexchanger uses for DB ETL)
Analytics Athena external tables over Parquet Athena (PSM uses for event queries)
Rails callback HMAC-SHA1 XML API (same as Legacy workers) Direct DB access
Infrastructure Shell-script deploy (no CDK) AWS CDK (TypeScript)
Architecture Monolithic lambda_function.rb Hexagonal core/shell

Integration with Rails

Invoked from Rails (SearchHitReportSearcherJob):

# Rails Sidekiq job invokes Lambda asynchronously
Aws::Lambda::Client.new.invoke(
  function_name: "search-hit-report-query-function-#{env}",
  invocation_type: 'Event',
  payload: {
    npcase_id: case_id,
    search_terms: [...],  # Pre-parsed ES queries
    search_hit_report_id: report_id,
    search_hit_params: { tags: [...], folders: [...] }
  }.to_json
)

Callbacks to Rails: - POST /documents/bulk_update with HMAC-SHA1 signed XML body - Passes S3 path to results + search_hit_params for bulk tagging/foldering

Shared infrastructure: - Same ES cluster (index: {env}_{npcase_id}_exhibits) - Same S3 bucket ({prefix}-trialmanager-{env}) - S3 path convention: case_{npcase_id}/search_hit_reporting/

Divergences from Standard NGE Patterns

Aspect search-hit-report Standard NGE
Language Ruby (matches Rails, not NGE Python) Python
Deploy Shell scripts + AWS CLI CDK
Architecture No core/shell boundary Hexagonal
API communication HMAC-SHA1 XML (Legacy pattern) Direct DB or SNS events
Testing No automated tests pytest
Maturity Prototype (TODOs, hardcoded values) Production

Key File Locations

File Purpose
lambda-functions/query-function/lambda_function.rb Main Lambda handler
lambda-functions/query-function/lib/elasticsearch.rb ES multi-search with PIT
lambda-functions/query-function/lib/nextpoint_api.rb HMAC-SHA1 Rails callback
lambda-functions/query-function/lib/athena.rb Athena database/table management
lambda-functions/query-function/lib/glue/etl.rb Glue job launcher
glue/scripts/convert_search_hits_to_parquet.py PySpark ETL script
lambda-functions/query-function/run_shr.rb Rails invocation example
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.