Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Content-Length header with StringIO body #6917

Open
mgraczyk opened this issue Mar 21, 2025 · 5 comments
Open

Incorrect Content-Length header with StringIO body #6917

mgraczyk opened this issue Mar 21, 2025 · 5 comments

Comments

@mgraczyk
Copy link

mgraczyk commented Mar 21, 2025

When requests is used with an io.StringIO as the data type, and the body contains characters whose utf-8 encoding is multiple bytes, the Content-Length header is set incorrectly.

Looking at the implementation of super_len, it appears that io.StringIO has its length measured using seek and tell.
It has been implemented that way since June 2016 (af7729f).

It looks like this was fixed for str inputs in #6586 in 2023 but was never fixed for io.StringIO

I am happy to send a PR if the implementation is straightforward.
Off the top of my head I don't know how to count the bytes in a utf-8 encoded StringIO without copying, and previous PRs have tried to avoid a copy in super_len

Expected Result

Content-Length should match the number of bytes sent when using io.StringIO

Actual Result

Content-Length is the length of the string, not the bytes sent.

Reproduction Steps

Run the following script, which shows the problem in detail

import io
import requests
from urllib3.connection import HTTPConnection
from requests.utils import super_len
from requests.models import PreparedRequest

# A string that is 1 character but 4 bytes in UTF-8.
# requests will always send 4 bytes, but the Content-Length header depends on the type passed as `data`.
value = "💩"

body_types = [{
  "name": "str",
  "value": value,
}, {
  "name": "bytes",
  "value": value.encode("utf-8"),
}, {
  "name": "io.BytesIO",
  "value": io.BytesIO(value.encode("utf-8")),
},
{
  "name": "io.StringIO",
  "value": io.StringIO(value),
}]

print("## Super Len")
for body_type in body_types:
  print("Body Type:", body_type["name"])
  print("Super Len:", super_len(body_type["value"]))
  p = PreparedRequest()
  p.prepare(
      method="POST",
      url="http://example.com",
      data=body_type["value"],
  )
  print("Prepared Headers:", p.headers)

# Monkey patch to print the data sent.
old_send = HTTPConnection.send
def new_send(self, data):
  print("Sending Data:", data)
  old_send(self, data)
HTTPConnection.send = new_send


print("## Requests")
for body_type in body_types:
  r = requests.post(
    "http://example.com",
    data=body_type["value"],
  )

Here is my output:

## Super Len
Body Type: str
Super Len: 4
Prepared Headers: {'Content-Length': '4'}
Body Type: bytes
Super Len: 4
Prepared Headers: {'Content-Length': '4'}
Body Type: io.BytesIO
Super Len: 4
Prepared Headers: {'Content-Length': '4'}
Body Type: io.StringIO
Super Len: 1
Prepared Headers: {'Content-Length': '1'}
## Requests
Sending Data: b'POST / HTTP/1.1\r\nHost: example.com\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 4\r\n\r\n'
Sending Data: b'\xf0\x9f\x92\xa9'
Sending Data: b'POST / HTTP/1.1\r\nHost: example.com\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 4\r\n\r\n'
Sending Data: b'\xf0\x9f\x92\xa9'
Sending Data: b'POST / HTTP/1.1\r\nHost: example.com\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 4\r\n\r\n'
Sending Data: b'\xf0\x9f\x92\xa9'
Sending Data: b'POST / HTTP/1.1\r\nHost: example.com\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 1\r\n\r\n'
Sending Data: b'\xf0\x9f\x92\xa9'

Notice that the request is always the same, except that the Content-Length header is 1 instead of 4 when io.StringIO is used.

System Information

{
  "chardet": {
    "version": "5.2.0"
  },
  "charset_normalizer": {
    "version": "3.4.1"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.12.9"
  },
  "platform": {
    "release": "22.6.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.32.3"
  },
  "system_ssl": {
    "version": "30400010"
  },
  "urllib3": {
    "version": "2.3.0"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": false
}
@sigmavirus24
Copy link
Contributor

To quote myself from 2023

Bytes are the language of the internet

We cannot silently do a copy like this into a BytesIO largely because we can only guess the encoding, as we discussed and eventually did in #6586 and also because any silent copy that has happened in the past has been an unwelcome surprise for some in places they wish to not be surprised. (For example, the existing behaviour of multipart/form-data encoding creates a large in-memory string in some cases which is then given to http.client.) Introducing another case like that is not acceptable.

I'd be happy to include a warning that is emitted unconditionally when we see an io.StringIO passed in with the caution that multi-byte characters may not be properly measured for length. We can also add an admonition to the documentation for this to ensure that we warn people away from this but I see no good way to practically fix this in the same way as #6586.

@mgraczyk
Copy link
Author

mgraczyk commented Mar 21, 2025

@sigmavirus24

Thanks for the quick reply.

I understand the concerns, but I don't fully agree.

we can only guess the encoding

As I understand it, the problem is not that we can't guess the encoding. The problem is that we decide for the user to encode as utf-8, but then write the wrong size to the header. You can see that the sent bytes are utf-8, I believe that conversion happens in urllib3 but I'm not certain.

And isn't this the same situation for str?
We don't know what encoding the user wants their str to be, so we put it on the wire as utf-8 and set the length accordingly.
Couldn't we do the same for StringIO?

no good way to practically fix this

What would go wrong if you did something like the following

if isinstance(o, io.StringIO):
  current_position = o.tell()
  try:
    total_size = len(o.read().encode("utf-8"))
  finally:
    o.seek(current_position)

This would use more memory but only during this call, so it seems unlikely to increase the high-water mark for the call to requests.post, and I believe that could be mitigated by processing in chunks (I'm not a unicode expert, this requires sum(len(c.encode()) for c in chunk(o) == len(o.encode()) which I'm not sure is easy)

@sigmavirus24
Copy link
Contributor

As I understand it, the problem is not that we can't guess the encoding.

No, what I'm saying is that in order to translate this to bytes to get the correct encoding and what the server may be expecting, we have to guess at the correct length.

You can see that the sent bytes are utf-8, I believe that conversion happens in urllib3 but I'm not certain.

I don't recall if we try to pre-process it, if urllib3 does it, or if (this is my instinctual guess) http.client is doing that.

And isn't this the same situation for str?

No. With file-like objects, it's passed down to http.client and read in chunks. What it does from there is a totally different problem. We could write a wrapper around io.StringIO to just-in-time guess all of this but that would always force chunked transfer encoding

@mgraczyk
Copy link
Author

to get the correct encoding and what the server may be expecting, we have to guess at the correct length.

requests always converts small StringIO inputs to utf-8 encoding (at least when an encoding is not otherwise specified). It does not guess anywhere in this case, it just encodes as utf-8 and sends it.
Since it always uses utf-8, it does not need to guess at the correct length. The length of the http payload is always the size of the utf-8 encoded StringIO.

Is my understanding wrong that Content-Length is the size, in bytes, of the payload?

I don't recall if we try to pre-process it, if urllib3 does it, or if (this is my instinctual guess) http.client is doing that.

It looks like urllib3 is doing it, but regardless it is always sent as utf-8, so it seems reasonable to use the utf-8 length.

With file-like objects

I mean literal str, not a file-like object.
I am saying that str is identical to StringIO in this regard. You don't know what encoding the user intended, so you choose to use utf-8 for str, and super_len chooses to use the length of the utf-8 encoding. I don't understand why StringIO should be treated differently from str, it's just a wrapper around str right?

@mgraczyk
Copy link
Author

mgraczyk commented Mar 21, 2025

A very concise demonstration of the problem:

Would anyone expect these two requests to be different?

value = "💩"

print("REQUEST 1")
print("=" * 20)
r = requests.post(
  "http://example.com",
  data=value,
)
print("=" * 20)

print("REQUEST 2")
print("=" * 20)
r = requests.post(
  "http://example.com",
  data=io.StringIO(value),
)
print("=" * 20)

In fact, they are different:

REQUEST 1
====================
Sending Data: POST / HTTP/1.1
Host: example.com
User-Agent: python-requests/2.32.3
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 4


Sending Data: 💩
====================
REQUEST 2
====================
Sending Data: POST / HTTP/1.1
Host: example.com
User-Agent: python-requests/2.32.3
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 1


Sending Data: 💩
====================

And I believe the second one is just wrong, Content-Length is 1 but the payload is 4 bytes.

Reading the RFC, I believe the current request violates this part of RFP 9110:
https://www.rfc-editor.org/rfc/rfc9110#section-8.6-12

hackspaces added a commit to hackspaces/requests that referenced this issue Mar 22, 2025
Correctly calculate Content-Length for io.StringIO objects containing
multi-byte characters by measuring the UTF-8 encoded byte length.

Added test case with emoji character to verify the fix.

Fixes: psf#6917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants