Skip to content

Fix NeMo Microservices Installation Issues and Add Helm Hooks#22

Open
Hadar301 wants to merge 15 commits intomainfrom
fix/nemo-deployment
Open

Fix NeMo Microservices Installation Issues and Add Helm Hooks#22
Hadar301 wants to merge 15 commits intomainfrom
fix/nemo-deployment

Conversation

@Hadar301
Copy link
Copy Markdown
Contributor

This PR fixes critical installation issues in the NeMo Microservices deployment and adds automated Helm hooks to replace manual bash script operations, enabling a cleaner declarative installation process.

Problems Solved

  1. Evaluator CrashLoopBackOff Issue
    Problem: NemoEvaluator pods fail with Init:CrashLoopBackOff due to the missing EVALUATOR_IMAGE environment variable in the evaluator-db-migration init container
    Root Cause: NIM Operator v3.0.2 bug - fails to propagate EVALUATOR_IMAGE from NemoEvaluator CRD to deployment init containers
    Solution: Added Helm post-install hook that patches the evaluator deployment to add the missing env var
  2. Evaluator NIM Proxy Port Mismatch
    Problem: Evaluator cannot connect to NIM models - default port changed from 8000 to 80
    Root Cause: Recent changes defaulted to port 80 (for KServe InferenceService), breaking NIMPipeline deployments (port 8000)
    Solution: Changed default back to 8000 in deploy/nemo-instances/values.yaml:239
  3. Optional Model Deployments
    Problem: Embedding and retriever models are deployed by default, consuming 2 unnecessary GPUs
    Solution: Added deployment control flags (deployEmbeddingModel, deployRetrieverModel) - disabled by default
  4. Manual Installation Steps
    Problem: Installation required running bash functions manually for CRD adoption, RBAC patching, and SCC binding
    Solution: Converted bash functions to Helm hooks with proper execution ordering

Changes Made
Files Modified

  1. deploy/nemo-instances/values.yaml (renamed from values.yaml.sample)
    Line 239: Changed evaluator NIM proxy port default from "80" to "8000"
    Added lines 281-284: New deployment control flags

deployEmbeddingModel: false # Controls nimCacheEmbedding + nimPipelineEmbedding (1 GPU)
deployRetrieverModel: false # Controls nimCacheRetriever + nimPipelineRetriever (1 GPU)
2. deploy/nemo-infra/values.yaml (renamed from values.yaml.sample)
Renamed from .sample to standard Helm pattern
3. deploy/nemo-instances/templates/ - New Helm Hook Files
Evaluator Init Container Patch:

evaluator-init-patch-job.yaml - Post-install hook (weight 5) that adds EVALUATOR_IMAGE env var to evaluator deployment
evaluator-patch-rbac.yaml - RBAC resources (weight 0) for patch job
SCC Binding:

post-install-bind-scc-job.yaml - Post-install hook (weight 10) that binds nemocustomizer-sample SA to nemo-customizer-scc
post-install-bind-scc-rbac.yaml - RBAC resources (weight 5)
Resource Adoption:

pre-install-adopt-resources-job.yaml - Pre-install hook (weight -15) that adopts existing NeMo CRDs and cluster resources
pre-install-adopt-resources-rbac.yaml - RBAC resources (weight -20)
4. deploy/nemo-infra/templates/ - New Helm Hook Files
CRD Adoption:

pre-install-adopt-crds-job.yaml - Pre-install hook (weight -15) that adopts Argo Workflows and Volcano CRDs
pre-install-adopt-crds-rbac.yaml - RBAC resources (weight -20)
SCC Binding:

post-install-bind-scc-job.yaml - Post-install hook (weight 5) for SCC binding
post-install-bind-scc-rbac.yaml - RBAC resources (weight 0)
5. Conditional Wrapping - Model Templates
Wrapped with deployment control flags:

nimcache-embedding.yaml - {{- if .Values.deployEmbeddingModel }}
nimpipeline-embedding.yaml - {{- if .Values.deployEmbeddingModel }}
nimcache-retriever.yaml - {{- if .Values.deployRetrieverModel }}
nimpipeline-retriever.yaml - {{- if .Values.deployRetrieverModel }}

Helm Hook Execution Order

  • nemo-infra

    • Pre-install (weight -20): Create RBAC resources
      
    • Pre-install (weight -15): Adopt CRDs job
      
    • Main install (weight 0): Deploy regular resources
      
    • Post-install (weight 0): Create SCC binding RBAC
      
    • Post-install (weight 5): Run SCC binding job
      
  • nemo-instances

    • Pre-install (weight -20): Create RBAC resources
    • Pre-install (weight -15): Adopt resources job (deletes existing NeMo CRDs, adopts cluster resources)
    • Main install (weight 0): Deploy NeMo CRDs and resources
    • Post-install (weight 0): Create evaluator patch RBAC
    • Post-install (weight 5): Run evaluator patch job (adds EVALUATOR_IMAGE)
    • Post-install (weight 5): Create SCC binding RBAC
    • Post-install (weight 10): Run SCC binding job

@Hadar301 Hadar301 self-assigned this Apr 15, 2026
@Hadar301 Hadar301 requested review from Prudhvivuda, rhkp and swati-kale and removed request for Prudhvivuda April 16, 2026 15:30
@Hadar301 Hadar301 marked this pull request as ready for review April 16, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant