Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Add image multimodal support for LLMNode #17372

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

QuantumGhost
Copy link
Collaborator

@QuantumGhost QuantumGhost commented Apr 3, 2025

Summary

Enhance LLMNode with multimodal capability, introducing support for
image outputs.

This implementation extracts base64-encoded images from LLM responses,
saves them to the storage service, and records the file metadata in the
ToolFile table. In conversations, these images are rendered as
markdown-based inline images.
Additionally, the images are included in the LLMNode's output as
file variables, enabling subsequent nodes in the workflow to utilize them.

To integrate file outputs into workflows, adjustments to the frontend code
are necessary.

For multimodal output functionality, updates to related model configurations
are required. Currently, this capability has been applied exclusively to
Google's Gemini models.

Close #15814.

Screenshots

Before After
image image

The image is showed twice. I don't know why. (maybe some issues in frontend code?)

To utilize multimodal output capability, updating to Gemini models is required. Related PR will be submitted later.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

laipz8200 and others added 7 commits March 27, 2025 11:23
Signed-off-by: -LAN- <[email protected]>

# Conflicts:
#	api/core/model_runtime/entities/message_entities.py
Enhance `LLMNode` with multimodal capability, introducing support for
image outputs.

This implementation extracts base64-encoded images from LLM responses,
saves them to the storage service, and records the file metadata in the
`ToolFile` table. In conversations, these images are rendered as
markdown-based inline images.
Additionally, the images are included in the LLMNode's output as
file variables, enabling subsequent nodes in the workflow to utilize them.

To integrate file outputs into workflows, adjustments to the frontend code
are necessary.

For multimodal output functionality, updates to related model configurations
are required. Currently, this capability has been applied exclusively to
Google's Gemini models.
Add a detailed notice to guide contributors on avoiding direct usage
of the global variable `models.engine.db`. Instead, they are encouraged
to apply dependency injection to improve code readability, testability,
and maintainability.
Clarify that `ToolFile` now stores not only metadata for files generated
by agents but also metadata for files produced by various nodes in a workflow.

For instance, it includes metadata for multimodal output files generated
by an `LLMNode`.
@QuantumGhost QuantumGhost force-pushed the feat/support-image-generate-for-gemini branch 2 times, most recently from f78a7db to b8672a6 Compare April 9, 2025 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support Gemini’s New Multimodal Output
2 participants