-
Notifications
You must be signed in to change notification settings - Fork 3.9k
.Net: Add support for audio, pdf, doc, and docx to chat prompt parser #11919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
.Net: Add support for audio, pdf, doc, and docx to chat prompt parser #11919
Conversation
@glorious-beard I updated the proposal to be abstract as this is applied to the As we will have many different types of documents and binary files, to be more abroad and less specific, is better not introduce any special content types and use the existing ones we already have that works. Given that updated the logic to accept a For |
Updated PR DescriptionMotivation and ContextEnhance the Chat Prompt XML parsing capability to also support audio and documents. DescriptionThe following 2 contents are now supported from the Chat Prompt XML:
The Here is a sample: <message role='user'>
This part will be discarded upon parsing
<text>Summarize all the contents I provided in this message.</text>
<image mimetype="image/png">https://fake-link-to-image/</image>
<audio>data:audio/wav;base64,UklGRiQAAAB...</audio>
<binary>data:application/pdf;base64,UklGRiQAAAB...</binary>
<binary mimetype="application/pdf">https://fake-link-to-pdf/</binary>
<binary>data:application/msword;base64,UklGRiQAAAB...</binary>
<binary mimetype="octet/stream">https://fake-link-to-binary/</binary>
</message> Contribution Checklist
|
/// <param name="content">Base64 encoded content or URI.</param> | ||
/// <param name="mimeType">Optional MIME type of the content.</param> | ||
/// <returns>A new instance of <typeparamref name="T"/> with <paramref name="content"/></returns> | ||
private static T CreateBinaryContent<T>(string content, string? mimeType) where T : BinaryContent, new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this private method down the file after the public one so that all private methods are grouped together.
Motivation and Context
Why is this change required?
This template parsers like the YAML parser to embed content types other than just text and images for LLMs that support additional content types, like PDFs for OpenAI and DOCXs for Claude. Without this capability, functions with prompts that have attachments would have to manually build it's chat history in code.
What problem does it solve?
See above
What scenario does it contribute to?
Usage additional content types beyond visuals and audio for user messages
Open Issues Addressed
Description
Chat Prompt Parser
To preserve backward compatibility, rather than consolidating binary content types, I chose to go with adding additional content types so that LLM chat service providers could opt-in to new content types. It also reduces the chances of breaking existing code.
3 new content types are created:
PdfContent
for PDF files. Uses the tag "<pdf>". Allows for Base64 data URIs or standard URIs, similar toImageContent
.DocContent
for MS Word .doc files. Uses the tag "<doc>". Allows for Base64 data URIs or standard URIs, similar toImageContent
.DocxContent
for MS Word .docx files. Uses the tag "<docx>". Allows for Base64 data URIs or standard URIs, similar toImageContent
.(NOTE:
DocContent
andDocxContent
are mainly separate because they have different MIME types and different content formats, though they could easily be consolidated into a single tag and just let the LLM provider handle distinguishing between "doc" and "docx" files. Alternately, I could also see the case for dropping ".doc" support and requiring the caller to only use ".docx".)In addition, the following 2 contents are now parsed from the XML:
AudioContent
- Parses the tag "<audio>" with either Base64 data URIs or standard URIs, similar toImageContent
.BinaryContent
- Parses the tag "<file>" with either Base64 data URIs or standard URIs, similar toImageContent
.Here is a sample:
Amazon Bedrock
Modified the
Converse
API request generator to handle the subset of binary content supported by Amazon Bedrock (PDF, DOC, DOCX, and Image), as documented here.OpenAI
Modified the client to handle PDF content, audio content, and file references when generating a request to an OpenAI (or OpenAI compatible) client.
Contribution Checklist