Skip to content

Feature: Automatic Document Language Detection and Locale Parametrization #11

@dgomesbr

Description

@dgomesbr

Summary

Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.

🎯 Motivation

Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.

✨ Features Implemented

1. Automatic Language Detection

  • AWS Comprehend Integration: Utilizes AWS Comprehend's DetectDominantLanguage API to analyze document content
  • Smart Text Sampling: Extracts text from the first 5 pages of the PDF for language analysis
  • Confidence Thresholding: Only applies detected language if confidence score ≥ 70%
  • Graceful Fallbacks: Defaults to English (en-US) for low-confidence detections or errors

2. Comprehensive Language Support

Supports 30+ languages with proper locale mapping:

Language AWS Code Adobe Locale Region
English en en-US United States
Spanish es es-ES Spain
Catalan ca ca-ES Spain
French fr fr-FR France
German de de-DE Germany
Italian it it-IT Italy
Portuguese pt pt-BR Brazil
Japanese ja ja-JP Japan
Chinese zh zh-CN China (Simplified)
And 20+ more...

3. Integrated Processing Pipeline

  • Autotagging: Applies detected locale to AutotagPDFParams for language-aware accessibility tagging
  • Text Extraction: Uses detected locale in ExtractPDFParams for improved text and table extraction
  • PDF Metadata: Sets document language metadata consistently across the pipeline

4. Enhanced Error Handling & Logging

  • Comprehensive logging of detection process and confidence scores
  • Handles AWS Comprehend API limits (5000 bytes max text)
  • Manages insufficient text scenarios gracefully
  • Detailed error reporting for troubleshooting

🔧 Technical Implementation

Core Components Added:

1. Language Detection Function

def detect_document_language(pdf_path, filename):
    """
    Detect the dominant language in a PDF document using AWS Comprehend.
    Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US')
    """

2. Updated API Functions

  • autotag_pdf_with_options() - Now accepts detected_locale parameter
  • extract_api() - Now accepts detected_locale parameter
  • set_language_comprehend() - Enhanced to use detected locale for PDF metadata

3. Language-to-Locale Mapping

Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.

Infrastructure Changes:

AWS CDK Updates (app.py):

  • IAM Permissions: Added comprehend:DetectDominantLanguage permission to ECS task role
  • Environment Variables: Removed hardcoded PDF_LOCALE environment variable
  • Backward Compatibility: Maintains support for manual locale override via environment variable

📊 Processing Flow

graph TD
    A[PDF Upload] --> B[Download from S3]
    B --> C[Extract Text from First 5 Pages]
    C --> D[AWS Comprehend Language Detection]
    D --> E{Confidence ≥ 70%?}
    E -->|Yes| F[Map to Adobe Locale]
    E -->|No| G[Default to en-US]
    F --> H[Apply Locale to Adobe APIs]
    G --> H
    H --> I[Autotagging with Locale]
    H --> J[Text Extraction with Locale]
    I --> K[Set PDF Language Metadata]
    J --> K
    K --> L[Upload Processed PDF]
Loading

🧪 Testing Scenarios

Test Cases to Validate:

  1. Spanish Documents: Verify es-ES locale detection and application
  2. Catalan Documents: Verify ca-ES locale detection and application
  3. Mixed Language Documents: Test confidence thresholding
  4. Scanned/Image PDFs: Handle insufficient text scenarios
  5. Very Short Documents: Test minimum text requirements
  6. Error Scenarios: AWS Comprehend API failures, network issues
  7. Backward Compatibility: Manual locale override still works

Expected Improvements:

  • Better Accessibility Tagging: Language-specific heading detection and structure analysis
  • Improved Text Extraction: Better handling of language-specific characters and formatting
  • Enhanced Metadata: Proper language metadata in final PDF documents
  • Compliance: Better WCAG 2.1 compliance for non-English documents

📈 Benefits

For Users:

  • Automatic Processing: No manual language configuration required
  • Better Accuracy: Language-aware processing improves accessibility tagging quality
  • Multi-language Support: Seamless handling of documents in 30+ languages
  • Consistent Results: Standardized locale application across all processing steps

For Developers:

  • Maintainable Code: Clean separation of language detection logic
  • Extensible Design: Easy to add new language mappings
  • Comprehensive Logging: Detailed insights into language detection process
  • Error Resilience: Robust fallback mechanisms

🔍 Monitoring & Observability

Key Metrics to Track:

  • Language detection confidence scores
  • Distribution of detected languages
  • Fallback to default locale frequency
  • AWS Comprehend API usage and costs
  • Processing time impact

Log Messages Added:

  • Detected language: {code} (confidence: {score})
  • Using locale for autotagging: {locale}
  • Using locale for extraction: {locale}
  • Language set to {code} (from detected locale: {locale})

🚀 Deployment Notes

Prerequisites:

  • AWS Comprehend service availability in deployment region
  • Updated IAM permissions for ECS task role
  • No additional environment variables required

Rollback Plan:

  • Set PDF_LOCALE environment variable to force specific locale
  • Previous hardcoded behavior can be restored by setting PDF_LOCALE=en-US

🔮 Future Enhancements

Potential Improvements:

  1. Language Detection Caching: Cache results for similar documents
  2. Multi-language Documents: Handle documents with multiple languages
  3. Custom Language Models: Support for domain-specific language detection
  4. User Override Interface: Allow manual language selection in frontend
  5. Language-specific Processing Rules: Customize processing based on detected language
  6. Analytics Dashboard: Visualize language distribution and processing metrics

📝 Files Modified

Core Changes:

  • docker_autotag/autotag.py: Added language detection and locale parametrization
  • app.py: Updated IAM permissions and removed hardcoded locale

Key Functions Added/Modified:

  • detect_document_language() - New function for language detection
  • autotag_pdf_with_options() - Added locale parameter
  • extract_api() - Added locale parameter
  • set_language_comprehend() - Enhanced with locale support
  • main() - Integrated language detection workflow

🏷️ Labels

enhancement language-support aws-comprehend adobe-pdf-services accessibility internationalization i18n

🔗 Related Issues

  • Addresses need for multi-language document processing
  • Improves accessibility compliance for non-English documents
  • Enhances Adobe PDF Services API utilization

Priority: High
Complexity: Medium
Impact: High - Significantly improves processing quality for non-English documents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions