Skip to content

feat(vision): add auxiliary vision model for image description fallback#157

Open
Johell1NS wants to merge 1 commit into
grinev:mainfrom
Johell1NS:feat/vision-model-fallback
Open

feat(vision): add auxiliary vision model for image description fallback#157
Johell1NS wants to merge 1 commit into
grinev:mainfrom
Johell1NS:feat/vision-model-fallback

Conversation

@Johell1NS

Copy link
Copy Markdown

What

Adds an auxiliary vision model fallback. When the user sends an image to a model that does not support image input, the bot:

  1. Automatically detects the current model lacks vision capability
  2. Sends the image to a vision model configured by the user (via /setvision)
  3. The vision model describes the image (with dynamic prompt: if a caption is provided, describes in context of the user question)
  4. The textual description is forwarded to the main model, which continues processing

Everything is automatic — no user confirmation required. If no vision model is configured, the bot behaves exactly as before (error message + text-only fallback).

Closes #151

Commands

  • /setvision — shows a menu to select a vision model. If already set, shows the current model with "Change" and "Clear" options
  • The vision model selection persists in settings.json across restarts

Technical changes

3 new files:

  • src/app/services/vision-model-service.ts — service: describeImage() creates a temporary session, calls session.prompt() synchronously with the vision model, extracts the text description, deletes the session
  • src/bot/commands/setvision-command.ts/setvision command handler
  • src/bot/callbacks/vision-model-callback-handler.ts — inline menu callback for vision model selection

17 modified files:

  • src/app/types/settings.ts / src/app/stores/settings-store.ts — new currentVisionModel field
  • src/bot/handlers/photo-handler.ts — vision fallback in !supportsImage branch
  • src/bot/handlers/document-handler.ts — fallback for image documents
  • src/bot/handlers/media-group-handler.ts — fallback for media groups with images
  • src/bot/routers/command-router.ts / callback-router.ts — command and callback registration
  • src/bot/menus/inline-menu.ts — added "vision" menu kind
  • src/bot/commands/definitions.ts — command list registration
  • src/i18n/*.ts — 11 new i18n keys across 7 languages

Notes

  • The temporary vision session is created as a child of the foreground session to suppress background session notifications
  • The vision model call uses session.prompt() (synchronous) rather than promptAsync(), since the full response is needed to continue the flow
  • The image description prompt is dynamic: if the user wrote a caption, the prompt is "The user asked: [caption]. Please describe the image focusing on the most relevant aspects." — otherwise "Please describe this image in detail."

When the selected model lacks image input capability, the bot now
automatically uses a user-configurable auxiliary vision model to
describe images. The textual description is then fed to the main
model, enabling non-vision models to process image-based queries.

Changes:
- Add /setvision command to select a vision-capable model via inline menu
- Add vision-model-service with synchronous describeImage() using
  session.prompt() for temporary vision sessions
- Modify photo, document, and media-group handlers to fall back
  to vision model when main model lacks image support
- Vision sessions are created as children of the foreground session
  to suppress background session notifications
- Add 11 new i18n keys across all 7 supported languages
- Persist vision model selection in settings.json

Closes grinev#151
Johell1NS pushed a commit to Johell1NS/opencode-telegram-bot that referenced this pull request Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: automatic vision fallback — use an auxiliary model to describe images when the main model lacks vision support

1 participant