Reference Implementation: search-hit-report-backend¶

Overview¶

The Search Hit Report Backend is a serverless service that generates search hit reports for the Nextpoint platform. Given a set of search terms, it executes them against Elasticsearch, identifies matching documents, converts results to Parquet for analytics, sets up Athena tables, and calls back to Rails to apply bulk actions.

EDRM Stage: 7 (Analysis) — search hit analysis and bulk tagging. Suite: Common (both Discovery and Litigation). Maturity: Early/prototype — TODOs, hardcoded staging values, shell-script deploy.

Architecture¶

search-hit-report-backend/
├── lambda-functions/
│   └── query-function/
│       ├── lambda_function.rb           # Main Lambda handler (Ruby 3.3.7)
│       ├── lib/
│       │   ├── elasticsearch.rb         # ES client — multi-search with PIT deep pagination
│       │   ├── elasticsearch/
│       │   │   └── results.rb           # ES hits → CSV conversion
│       │   ├── athena.rb               # Athena client — create DB, tables, run queries
│       │   ├── athena/
│       │   │   ├── query.rb            # Query execution with polling
│       │   │   └── query/results.rb    # Result fetching from S3
│       │   ├── glue/
│       │   │   ├── etl.rb             # Glue ETL job launcher
│       │   │   └── etl/job.rb         # Job status polling and await
│       │   ├── nextpoint_api.rb        # HMAC-SHA1 XML API client for Rails callback
│       │   └── nextpoint_api/
│       │       └── request.rb          # API request builder with HMAC signing
│       ├── deploy.sh                    # Full deploy script (Docker, ECR, IAM, Lambda)
│       ├── Dockerfile                   # Ruby 3.3 Lambda container
│       ├── Gemfile                      # aws-sdk-lambda, elasticsearch ~> 7.0, xml-simple
│       └── run_shr.rb                  # Rails console test harness
├── glue/
│   ├── scripts/
│   │   ├── convert_search_hits_to_parquet.py  # PySpark: CSV → Parquet (search hits)
│   │   └── convert_exhibits_to_parquet.py     # PySpark: CSV → Parquet (exhibits)
│   └── glue.sh                         # Glue deployment script
└── utils/
    └── nextpoint_s3.sh                  # Region/env S3 bucket name resolution

Processing Pipeline¶

Single Lambda invocation (15-minute timeout, 4GB memory):

1. Setup Athena database + external tables (Parquet → SQL)
          │
2. Start Glue ETL job: exhibits CSV → Parquet (runs in parallel with 3-5)
          │
3. Open Elasticsearch Point-in-Time (PIT) cursor on case exhibit index
          │
4. Execute all search queries via ES _msearch API
   (deep pagination: 10K per page, search_after cursor)
   Collect matching exhibit IDs per search term
          │
5. Write search hit CSVs to S3: case_{id}/search_hit_reporting/csv/search_hits/
          │
6. Start Glue ETL job: search hit CSVs → Parquet
          │
7. Await both Glue jobs completion
          │
8. POST to Rails /documents/bulk_update (HMAC-SHA1 signed XML)
   with S3 location of results + pass-through params (tags, folders)

Pattern Mapping¶

Pattern	search-hit-report Implementation	Standard NGE Pattern
Language	Ruby 3.3.7 (Lambda) + Python (Glue)	Python 3.10+
Invocation	Lambda async invoke (Event) from Rails	SNS/SQS event-driven
Search	ES 7.x multi-search with PIT deep pagination	N/A
ETL	AWS Glue PySpark (CSV → Parquet)	AWS Glue (documentexchanger uses for DB ETL)
Analytics	Athena external tables over Parquet	Athena (PSM uses for event queries)
Rails callback	HMAC-SHA1 XML API (same as Legacy workers)	Direct DB access
Infrastructure	Shell-script deploy (no CDK)	AWS CDK (TypeScript)
Architecture	Monolithic lambda_function.rb	Hexagonal core/shell

Integration with Rails¶

Invoked from Rails (SearchHitReportSearcherJob):

# Rails Sidekiq job invokes Lambda asynchronously
Aws::Lambda::Client.new.invoke(
  function_name: "search-hit-report-query-function-#{env}",
  invocation_type: 'Event',
  payload: {
    npcase_id: case_id,
    search_terms: [...],  # Pre-parsed ES queries
    search_hit_report_id: report_id,
    search_hit_params: { tags: [...], folders: [...] }
  }.to_json
)

Callbacks to Rails: - POST /documents/bulk_update with HMAC-SHA1 signed XML body - Passes S3 path to results + search_hit_params for bulk tagging/foldering

Shared infrastructure: - Same ES cluster (index: {env}_{npcase_id}_exhibits) - Same S3 bucket ({prefix}-trialmanager-{env}) - S3 path convention: case_{npcase_id}/search_hit_reporting/

Divergences from Standard NGE Patterns¶

Aspect	search-hit-report	Standard NGE
Language	Ruby (matches Rails, not NGE Python)	Python
Deploy	Shell scripts + AWS CLI	CDK
Architecture	No core/shell boundary	Hexagonal
API communication	HMAC-SHA1 XML (Legacy pattern)	Direct DB or SNS events
Testing	No automated tests	pytest
Maturity	Prototype (TODOs, hardcoded values)	Production

Key File Locations¶

File	Purpose
`lambda-functions/query-function/lambda_function.rb`	Main Lambda handler
`lambda-functions/query-function/lib/elasticsearch.rb`	ES multi-search with PIT
`lambda-functions/query-function/lib/nextpoint_api.rb`	HMAC-SHA1 Rails callback
`lambda-functions/query-function/lib/athena.rb`	Athena database/table management
`lambda-functions/query-function/lib/glue/etl.rb`	Glue job launcher
`glue/scripts/convert_search_hits_to_parquet.py`	PySpark ETL script
`lambda-functions/query-function/run_shr.rb`	Rails invocation example

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.