Feat/bge m3 migration#184
Open
ODenteAzul wants to merge 8 commits into
Open
Conversation
added 8 commits
June 16, 2026 10:44
Add support for migrating from mpnet-768d to BGE-M3-1024d embeddings with zero-downtime dual-index strategy. Database (PostgreSQL): - Migration 004: Add content_embedding (1024d) + embedding_model_version - Rename existing content_embedding → content_embedding_legacy (768d) - Create HNSW index for new BGE-M3 embeddings - Add migration tracking index for batch processing Typesense: - Update collection schema to support dual embeddings (768d + 1024d) - Add embedding_model_version field for tracking - Update indexer to sync both embedding fields Models: - Add content_embedding_legacy, content_embedding, embedding_model_version - Update News and NewsInsert Pydantic models Migration Strategy: - New articles: use BGE-M3 (1024d) immediately - Existing articles: gradual migration via DAG (10k/day) - Collection will be recreated with new schema (requires manual step) Rollback: - scripts/migrations/004_rollback.sql to revert if needed Related: - destaquesgovbr/embeddings#1 (API changes) - destaquesgovbr/data-science#1 (model validation) - #175
Add complete offline migration pipeline for mpnet → BGE-M3 embeddings
using GPU (EC2 L4).
Scripts:
- migrate_to_bge_m3.py: Main migration script with GPU support
- generate: Create embeddings from dump
- upload: Bulk upload to PostgreSQL
- full: Complete pipeline
Features: checkpoints, resume, progress bars, error handling
- dump_articles_for_migration.sql: Export articles from PostgreSQL
- csv_to_parquet.py: Convert CSV → Parquet (compression + speed)
- test_local.sh: Local validation script with sample data
- requirements.txt: Python dependencies
- README.md: Complete documentation (30+ pages)
Architecture:
- Offline processing (zero impact on production API)
- GPU L4: ~200-300 articles/s (vs ~1-2 on CPU)
- Total time: 15-25h for 300k articles (vs 30 days with DAG)
- Cost: $0 (EC2 already exists)
Usage:
# Quick test
./test_local.sh
# Full pipeline
python migrate_to_bge_m3.py full \
--input artigos_para_migrar.parquet \
--database-url $DATABASE_URL
Related: #175
Add scripts to facilitate local testing of embeddings migration: - setup_local_test.sh: Automated setup script - Creates test database (govbrnews_test) - Restores SQL dump - Applies migration 004 - Shows statistics - QUICKSTART.md: Step-by-step guide - Option 1: Automated script - Option 2: Manual steps - End-to-end test - Troubleshooting Workflow: 1. ./setup_local_test.sh (restore dump + apply migration) 2. Export articles for migration 3. Test embedding generation with GPU/CPU 4. Upload back to local DB 5. Validate results This allows testing the complete pipeline locally before running on EC2 L4 with production data. Related: #175
Fix paths to: - Dump file: ../data_dump → ../../data_dump - Migration: ../migrations → ../../scripts/migrations
Detect available PostgreSQL user (current user or postgres) and use it for all psql/createdb commands. Fixes: 'role lpmoraes does not exist' error
- Remove hard dependency on pv (progress viewer) - Fallback to direct psql < dump.sql when pv not available - Add exit code checking and error handling - Provide manual command if restore fails Fixes: dump silently failing when pv command not found
Fix TypeError when uploading embeddings: convert numpy.ndarray to list before psql insert. Tested with 100 articles: all uploaded successfully.
Increase content preview from 500 chars to 24,000 chars to better utilize BGE-M3's 8192 token capacity. Changes: - Add MAX_CHARS = 24000 constant - Use available_chars calculation for content - Add safety truncation at the end - Update docstring with BGE-M3 limits Rationale: - BGE-M3 supports 8192 tokens (~32k chars) - Database has articles with content up to 7MB - Previous 500 char limit was too conservative - 24k chars ≈ 8k tokens (conservative estimate) Impact: - Better embeddings for long articles without summary - No change for articles with summary (already good) - Stays within model limits (safety truncation)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Título:
Descrição:
2. Rodar Migração na EC2 L4
Tempo estimado: ~20-30 minutos para 334k artigos
3. Recriar Typesense Collection
Seguir passos em
PLANO_MIGRACAO_BGE_M3.md(repo infra)Breaking Changes
Nenhum. Estratégia dual-index:
Arquivos Principais
Performance Esperada
Documentação
Ver documentação completa em:
scripts/embeddings-migration/README.mdscripts/embeddings-migration/QUICKSTART.mdPLANO_MIGRACAO_BGE_M3.md(repo infra)