Skip to content

.Net: Add support for audio, pdf, doc, and docx to chat prompt parser #11919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

glorious-beard
Copy link
Contributor

@glorious-beard glorious-beard commented May 6, 2025

Motivation and Context

Why is this change required?

This template parsers like the YAML parser to embed content types other than just text and images for LLMs that support additional content types, like PDFs for OpenAI and DOCXs for Claude. Without this capability, functions with prompts that have attachments would have to manually build it's chat history in code.

What problem does it solve?

See above

What scenario does it contribute to?

Usage additional content types beyond visuals and audio for user messages

Open Issues Addressed

Description

Chat Prompt Parser

To preserve backward compatibility, rather than consolidating binary content types, I chose to go with adding additional content types so that LLM chat service providers could opt-in to new content types. It also reduces the chances of breaking existing code.

3 new content types are created:

  • PdfContent for PDF files. Uses the tag "<pdf>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.
  • DocContent for MS Word .doc files. Uses the tag "<doc>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.
  • DocxContent for MS Word .docx files. Uses the tag "<docx>". Allows for Base64 data URIs or standard URIs, similar to ImageContent.

(NOTE: DocContent and DocxContent are mainly separate because they have different MIME types and different content formats, though they could easily be consolidated into a single tag and just let the LLM provider handle distinguishing between "doc" and "docx" files. Alternately, I could also see the case for dropping ".doc" support and requiring the caller to only use ".docx".)

In addition, the following 2 contents are now parsed from the XML:

  • AudioContent - Parses the tag "<audio>" with either Base64 data URIs or standard URIs, similar to ImageContent.
  • BinaryContent - Parses the tag "<file>" with either Base64 data URIs or standard URIs, similar to ImageContent.

Here is a sample:

            
<message role='user'>
  This part will be discarded upon parsing
  <text>Make sense of this random assortment of stuff.</text>
  <image>https://fake-link-to-image/</image>
  <audio>data:audio/wav;base64,UklGRiQAAABXQVZFZm10IBAAAAABAAEAIlYAAACABAAZGF0YVgAAAAA</audio>
  <pdf>data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9UeXBlL1hSZWYvUGFnZXMgNiAwIFIKL1R5cGUvUGFnZS9NZWRpYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9GMiA8PC9GMyA8PC9GNCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GNSA8PC9GNiA8PC9GNyBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GOCAvPj4KZW5kb2JqCjEwIDAgb2JqCjw8L1R5cGUvUGFnZS9NYWRlYUJveCBbMCAwIDQ4MCA1MF0KL0NvbnRlbnRzIDw8L0V4dEdTdGF0ZSA8PC9JRCBbPDwvTGVuZ3RoIDQ4XQovRm9udCA8PC9GMSA8PC9G</pdf>
  <pdf>https://fake-link-to-pdf/</pdf>  
 
 <doc>data:application/msword;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</doc>
  <doc>https://fake-link-to-doc/</doc>
  <docx>data:application/vnd.openxmlformats-officedocument.wordprocessingml.document;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</docx>
  <docx>https://fake-link-to-docx/</docx>
  <file>data:application/octet-stream;base64,UEsDBBQAAAAIAI+Q1k5a2gAAABQAAAAIAAAAbmFtZS5kb2N4VVQJAAD9AAAACwAAAB4AAAAAA==</file>
  <file>https://fake-link-to-binary/</file>
  This part will also be discarded upon parsing
</message>

Amazon Bedrock

Modified the Converse API request generator to handle the subset of binary content supported by Amazon Bedrock (PDF, DOC, DOCX, and Image), as documented here.

OpenAI

Modified the client to handle PDF content, audio content, and file references when generating a request to an OpenAI (or OpenAI compatible) client.

Contribution Checklist

@glorious-beard glorious-beard requested a review from a team as a code owner May 6, 2025 20:38
@glorious-beard glorious-beard changed the title Glorious-beard/11044-expand-chat-prompt-parser .Net: Add support for audio, pdf, doc, and docx to chat prompt parser May 6, 2025
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel kernel.core labels May 8, 2025
@RogerBarreto RogerBarreto removed the needs discussion Issues that require discussion by the internal Semantic Kernel team before proceeding label May 30, 2025
@RogerBarreto
Copy link
Member

@glorious-beard I updated the proposal to be abstract as this is applied to the SemanticKernel.Abstraction package.

As we will have many different types of documents and binary files, to be more abroad and less specific, is better not introduce any special content types and use the existing ones we already have that works.

Given that updated the logic to accept a mimetype attribute as part of the <binary mimetype="type/subtype"/> to solve the scenarios where you provide a Uri.

For dataUri content, the mimeType is picked automatically from the data:mimeType schema.

@RogerBarreto
Copy link
Member

Updated PR Description

Motivation and Context

Enhance the Chat Prompt XML parsing capability to also support audio and documents.

Description

The following 2 contents are now supported from the Chat Prompt XML:

  • AudioContent - Parses the tag <audio mimetype="type/subtype"> with either Base64 data URIs or standard URIs, similar to ImageContent.
  • BinaryContent - Parses the tag <binary mimetype="type/subtype"> with either Base64 data URIs or standard URIs, similar to ImageContent.

The mimetype attribute is optional, and can be omitted for Base64 data URIs.

Here is a sample:

<message role='user'>
  This part will be discarded upon parsing
  <text>Summarize all the contents I provided in this message.</text>
  <image mimetype="image/png">https://fake-link-to-image/</image>
  <audio>data:audio/wav;base64,UklGRiQAAAB...</audio>
  <binary>data:application/pdf;base64,UklGRiQAAAB...</binary>
  <binary mimetype="application/pdf">https://fake-link-to-pdf/</binary>  
  <binary>data:application/msword;base64,UklGRiQAAAB...</binary>
  <binary mimetype="octet/stream">https://fake-link-to-binary/</binary>
</message>

Contribution Checklist

/// <param name="content">Base64 encoded content or URI.</param>
/// <param name="mimeType">Optional MIME type of the content.</param>
/// <returns>A new instance of <typeparamref name="T"/> with <paramref name="content"/></returns>
private static T CreateBinaryContent<T>(string content, string? mimeType) where T : BinaryContent, new()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this private method down the file after the public one so that all private methods are grouped together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai connector Anything related to AI connectors kernel.core kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expanding ChatPromptParser to handle other content types
4 participants