Feature: Automatic Document Language Detection and Locale Parametrization

## Summary
Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.

## 🎯 Motivation
Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (`en-US`) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.

## ✨ Features Implemented

### 1. **Automatic Language Detection**
- **AWS Comprehend Integration**: Utilizes AWS Comprehend's `DetectDominantLanguage` API to analyze document content
- **Smart Text Sampling**: Extracts text from the first 5 pages of the PDF for language analysis
- **Confidence Thresholding**: Only applies detected language if confidence score ≥ 70%
- **Graceful Fallbacks**: Defaults to English (`en-US`) for low-confidence detections or errors

### 2. **Comprehensive Language Support**
Supports 30+ languages with proper locale mapping:

| Language | AWS Code | Adobe Locale | Region |
|----------|----------|--------------|---------|
| English | `en` | `en-US` | United States |
| Spanish | `es` | `es-ES` | Spain |
| Catalan | `ca` | `ca-ES` | Spain |
| French | `fr` | `fr-FR` | France |
| German | `de` | `de-DE` | Germany |
| Italian | `it` | `it-IT` | Italy |
| Portuguese | `pt` | `pt-BR` | Brazil |
| Japanese | `ja` | `ja-JP` | Japan |
| Chinese | `zh` | `zh-CN` | China (Simplified) |
| And 20+ more... | | | |

### 3. **Integrated Processing Pipeline**
- **Autotagging**: Applies detected locale to `AutotagPDFParams` for language-aware accessibility tagging
- **Text Extraction**: Uses detected locale in `ExtractPDFParams` for improved text and table extraction
- **PDF Metadata**: Sets document language metadata consistently across the pipeline

### 4. **Enhanced Error Handling & Logging**
- Comprehensive logging of detection process and confidence scores
- Handles AWS Comprehend API limits (5000 bytes max text)
- Manages insufficient text scenarios gracefully
- Detailed error reporting for troubleshooting

## 🔧 Technical Implementation

### Core Components Added:

#### 1. Language Detection Function
```python
def detect_document_language(pdf_path, filename):
    """
    Detect the dominant language in a PDF document using AWS Comprehend.
    Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US')
    """
```

#### 2. Updated API Functions
- `autotag_pdf_with_options()` - Now accepts `detected_locale` parameter
- `extract_api()` - Now accepts `detected_locale` parameter  
- `set_language_comprehend()` - Enhanced to use detected locale for PDF metadata

#### 3. Language-to-Locale Mapping
Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.

### Infrastructure Changes:

#### AWS CDK Updates (`app.py`):
- **IAM Permissions**: Added `comprehend:DetectDominantLanguage` permission to ECS task role
- **Environment Variables**: Removed hardcoded `PDF_LOCALE` environment variable
- **Backward Compatibility**: Maintains support for manual locale override via environment variable

## 📊 Processing Flow

```mermaid
graph TD
    A[PDF Upload] --> B[Download from S3]
    B --> C[Extract Text from First 5 Pages]
    C --> D[AWS Comprehend Language Detection]
    D --> E{Confidence ≥ 70%?}
    E -->|Yes| F[Map to Adobe Locale]
    E -->|No| G[Default to en-US]
    F --> H[Apply Locale to Adobe APIs]
    G --> H
    H --> I[Autotagging with Locale]
    H --> J[Text Extraction with Locale]
    I --> K[Set PDF Language Metadata]
    J --> K
    K --> L[Upload Processed PDF]
```

## 🧪 Testing Scenarios

### Test Cases to Validate:
1. **Spanish Documents**: Verify `es-ES` locale detection and application
2. **Catalan Documents**: Verify `ca-ES` locale detection and application  
3. **Mixed Language Documents**: Test confidence thresholding
4. **Scanned/Image PDFs**: Handle insufficient text scenarios
5. **Very Short Documents**: Test minimum text requirements
6. **Error Scenarios**: AWS Comprehend API failures, network issues
7. **Backward Compatibility**: Manual locale override still works

### Expected Improvements:
- **Better Accessibility Tagging**: Language-specific heading detection and structure analysis
- **Improved Text Extraction**: Better handling of language-specific characters and formatting
- **Enhanced Metadata**: Proper language metadata in final PDF documents
- **Compliance**: Better WCAG 2.1 compliance for non-English documents

## 📈 Benefits

### For Users:
- **Automatic Processing**: No manual language configuration required
- **Better Accuracy**: Language-aware processing improves accessibility tagging quality
- **Multi-language Support**: Seamless handling of documents in 30+ languages
- **Consistent Results**: Standardized locale application across all processing steps

### For Developers:
- **Maintainable Code**: Clean separation of language detection logic
- **Extensible Design**: Easy to add new language mappings
- **Comprehensive Logging**: Detailed insights into language detection process
- **Error Resilience**: Robust fallback mechanisms

## 🔍 Monitoring & Observability

### Key Metrics to Track:
- Language detection confidence scores
- Distribution of detected languages
- Fallback to default locale frequency
- AWS Comprehend API usage and costs
- Processing time impact

### Log Messages Added:
- `Detected language: {code} (confidence: {score})`
- `Using locale for autotagging: {locale}`
- `Using locale for extraction: {locale}`
- `Language set to {code} (from detected locale: {locale})`

## 🚀 Deployment Notes

### Prerequisites:
- AWS Comprehend service availability in deployment region
- Updated IAM permissions for ECS task role
- No additional environment variables required

### Rollback Plan:
- Set `PDF_LOCALE` environment variable to force specific locale
- Previous hardcoded behavior can be restored by setting `PDF_LOCALE=en-US`

## 🔮 Future Enhancements

### Potential Improvements:
1. **Language Detection Caching**: Cache results for similar documents
2. **Multi-language Documents**: Handle documents with multiple languages
3. **Custom Language Models**: Support for domain-specific language detection
4. **User Override Interface**: Allow manual language selection in frontend
5. **Language-specific Processing Rules**: Customize processing based on detected language
6. **Analytics Dashboard**: Visualize language distribution and processing metrics

## 📝 Files Modified

### Core Changes:
- `docker_autotag/autotag.py`: Added language detection and locale parametrization
- `app.py`: Updated IAM permissions and removed hardcoded locale

### Key Functions Added/Modified:
- `detect_document_language()` - New function for language detection
- `autotag_pdf_with_options()` - Added locale parameter
- `extract_api()` - Added locale parameter
- `set_language_comprehend()` - Enhanced with locale support
- `main()` - Integrated language detection workflow

---

## 🏷️ Labels
`enhancement` `language-support` `aws-comprehend` `adobe-pdf-services` `accessibility` `internationalization` `i18n`

## 🔗 Related Issues
- Addresses need for multi-language document processing
- Improves accessibility compliance for non-English documents
- Enhances Adobe PDF Services API utilization

---

**Priority**: High  
**Complexity**: Medium  
**Impact**: High - Significantly improves processing quality for non-English documents

Language	AWS Code	Adobe Locale	Region
English	`en`	`en-US`	United States
Spanish	`es`	`es-ES`	Spain
Catalan	`ca`	`ca-ES`	Spain
French	`fr`	`fr-FR`	France
German	`de`	`de-DE`	Germany
Italian	`it`	`it-IT`	Italy
Portuguese	`pt`	`pt-BR`	Brazil
Japanese	`ja`	`ja-JP`	Japan
Chinese	`zh`	`zh-CN`	China (Simplified)
And 20+ more...

Feature: Automatic Document Language Detection and Locale Parametrization #11

Description

Summary

🎯 Motivation

✨ Features Implemented

1. Automatic Language Detection

2. Comprehensive Language Support

3. Integrated Processing Pipeline

4. Enhanced Error Handling & Logging

🔧 Technical Implementation

Core Components Added:

1. Language Detection Function

2. Updated API Functions

3. Language-to-Locale Mapping

Infrastructure Changes:

AWS CDK Updates (app.py):

📊 Processing Flow

🧪 Testing Scenarios

Test Cases to Validate:

Expected Improvements:

📈 Benefits

For Users:

For Developers:

🔍 Monitoring & Observability

Key Metrics to Track:

Log Messages Added:

🚀 Deployment Notes

Prerequisites:

Rollback Plan:

🔮 Future Enhancements

Potential Improvements:

📝 Files Modified

Core Changes:

Key Functions Added/Modified:

🏷️ Labels

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

AWS CDK Updates (`app.py`):