Implement HTTP(s) support #468

pjbull · 2024-09-01T11:21:30Z

Initial pass at http.

implemented with standard library
tested against Python http.server
will try writing/deleting files with PUT/DELETE in case your server allows it
will try parsing "directories" from the returned HTML; seems like this grabs the right things from python's http.server and also looks like it should work for apache and nginx file servers from my googling. (allows user to override this method for their own server)
Bonus fix, use .anchor instead of .cloud_prefix where appropriate. For cloud providers, .anchor is the same as .cloud_prefix (e.g., s3://). For http, the anchor should be the server http://example.com and the .cloud_prefix should be http://. Constructing new paths should use .anchor

Limitations/caveats:

Assumes that url must have suffix (e.g., http://example.com/file.txt) to be a file; if no suffix, assumes dir. This is definitely not true of real-world URLs, but maybe is an ok assumption for anything serving files?
Assumes urls must end in / to be directories (user can pass custom test for urls if they have a different scenario)

Left to do:

Add https
Expose all the parsed url components publicly (not behind HttpPath._url).
Update default file + dir function to be based on a trailing slash
Allow also overriding the file vs. dir function
Add tests for http specific functions
Turn off noisy logs for test http server by default
Add tests that url search strings and fragments are persisted correctly
Documentation (table, caveats, docstrings, headers, auth (add cookies), files v dirs, etc.)
HTTP Test suite often passes locally, but not always; need to debug and make less flaky

Closes #455

github-actions · 2024-09-01T11:22:41Z

🚀 Deployed on https://deploy-preview-468--gallant-agnesi-5f7bb3.netlify.app

jayqi · 2024-09-04T15:52:41Z

Bonus fix, use .anchor instead of .cloud_prefix where appropriate. For cloud providers, .anchor is the same as .cloud_prefix (e.g., s3://). For http, the anchor should be the server http://example.com and the .cloud_prefix should be http://. Constructing new paths should use .anchor

We shouldn't call this "anchor". Anchor has a different meaning in a close enough context that I think this will be confusing. In particular, "anchor" colloquially refers to "anchor tags" in HTML documents, which you can reference within a URL with fragment identifiers #someref. The part preceding the :// is called "scheme" in URL/URI vocabulary and we should probably stick with that if we're going to call it anything. (Reference)

Assumes that url must have suffix (e.g., http://example.com/file.txt) to be a file; if no suffix, assumes dir. This is definitely not true of real-world URLs, but maybe is an ok assumption for anything serving files?

I'm not sure I like this. There are often genuine files that don't have file extensions that are usually plain text files, e.g., README, LICENSE, Makefile.

My impression is that, (while it's not absolute) the most common convention and default for many web servers and frameworks is to use a trailing slash explicitly to serve directories.

pjbull · 2024-09-04T21:27:37Z

We shouldn't call this "anchor". Anchor has a different meaning in a close enough context that I think this will be confusing.

Hm, yeah, I can see this is potentially confusing, but we're not calling it "anchor" arbitrarily. We're looking for the right analogy to populate the existing .anchor pathlib property. In the pathlib context .anchor is a Path object that is drive + root. We're using it to refer to scheme + netloc (not just the scheme) which reasoning by analogy feels about right. The difference from the cloud providers is that the scheme is the root is the drive (e.g., listing s3:// lists buckets you have access to). IMO docs can cover this source of potential confusion without too much worry, especially since anything referring to an anchor in a url is most often called the "fragment".

use a trailing slash explicitly to serve directories.

Yeah, this was the other default rule I considered—I think you're right it's probably a more reliable default. That is how the python server works (except for the root, which does not redirect to the slash). Also planning for the method to be overridable so people can configure for their particular servers if it is different.

MattOates · 2024-10-09T08:37:45Z

So one thing to be aware of is your code assumes no one would ever do HttpPath("http://username:password@host/path/file.txt") for basic auth.

url.netloc looks like the string "username:password@host" in this instance. Im currently struggling to see how in the Cloudpathlib design I can actually inject the auth from the URL into the client too. Looks more like you would always have to explicitly provide a client which feels not incredible from the user side. Perhaps we could introduce something that for http/sftp/ftp etc user/pass auth has some standard parameters that get passed to the client from the URL if the netloc has an @ in?

pjbull · 2024-10-14T22:17:47Z

So one thing to be aware of is your code assumes no one would ever do HttpPath("http://username:password@host/path/file.txt") for basic auth.

@MattOates I guess I was thinking that this just sticks around on that path and any paths derived from it (since we don't mess with netloc) and should "just work." Does it not? Maybe there are just some bugs to clean up (sorry, haven't had a chance to poke at it and confirm/deny).

Looks more like you would always have to explicitly provide a client which feels not incredible from the user side.

I think that we could (1) support env vars for basic auth like most of the cloud providers have for auth, (2) point people towards how to set the default client, or (3) recommend an explicit client.

jayqi · 2024-10-18T01:32:25Z

I guess I was thinking that this just sticks around on that path and any paths derived from it

Does that mean we're going to print out the username and password in plaintext in string representations?

pjbull · 2025-02-14T20:39:30Z

Does that mean we're going to print out the username and password in plaintext in string representations?

Yeah, FWIW urllib also is happy to just print that out too. We could be more conservative, but I'm not inclined to add a special case if the standard lib doesn't.

In [1]: from urllib.parse import urlparse

In [2]: urlparse("http://user:[email protected]")
Out[3]: ParseResult(scheme='http', netloc='user:[email protected]', path='', params='', query='', fragment='')

codecov · 2025-02-14T21:36:27Z

Codecov Report

Attention: Patch coverage is 96.07843% with 8 lines in your changes missing coverage. Please review.

Project coverage is 93.6%. Comparing base (6a6d3c5) to head (0554ecd).

Files with missing lines	Patch %	Lines
cloudpathlib/http/httpclient.py	94.1%	6 Missing ⚠️
cloudpathlib/http/httppath.py	97.6%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #468     +/-   ##
========================================
+ Coverage    93.4%   93.6%   +0.2%     
========================================
  Files          23      26      +3     
  Lines        1800    1996    +196     
========================================
+ Hits         1682    1870    +188     
- Misses        118     126      +8

Files with missing lines	Coverage Δ
cloudpathlib/__init__.py	`93.7% <100.0%> (+0.8%)`	⬆️
cloudpathlib/cloudpath.py	`94.5% <100.0%> (+0.3%)`	⬆️
cloudpathlib/http/__init__.py	`100.0% <100.0%> (ø)`
cloudpathlib/http/httppath.py	`97.6% <97.6%> (ø)`
cloudpathlib/http/httpclient.py	`94.1% <94.1%> (ø)`

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pjbull · 2025-04-22T05:36:28Z

OK @jayqi, this is fully working and ready for review!

jayqi · 2025-05-02T22:36:18Z

tests/test_cloudpath_file_io.py

-        (root / "fileA").write_text("fileA")
+        (root / "dirC" / "dirD" / "fileD.txt").write_text("fileD")
+        (root / "dirC" / "fileC.txt").write_text("fileC")
+        (root / "fileA.txt").write_text("fileA")


Are we still covering cases elsewhere where a file has no extension?

Will make sure we have some

jayqi · 2025-05-02T22:42:19Z

tests/test_cloudpath_file_io.py

    cloud_root.mkdir()

    _make_glob_directory(cloud_root)

-    local_root = tmp_path / "glob-tests"
+    local_root = tmp_path / "glob-tests/"


I'm generally nervous about all of these tests that add trailing slashes to all of our paths. Are we at risk of changing what our tests are covering for the cloud providers? Maybe it's fine, Codecov doesn't say our coverage is going down.

I'll make sure we have explicit tests for that.

Making sure we still recognize dirs properly is tested here

cloudpathlib/tests/test_cloudpath_file_io.py

Lines 326 to 347 in 6a6d3c5

def test_is_dir_is_file(rig, tmp_path):

# test on directories

dir_slash = rig.create_cloud_path("dir_0/")

dir_no_slash = rig.create_cloud_path("dir_0")

dir_nested_slash = rig.create_cloud_path("dir_1/dir_1_0/")

dir_nested_no_slash = rig.create_cloud_path("dir_1/dir_1_0")

for test_case in [dir_slash, dir_no_slash, dir_nested_slash, dir_nested_no_slash]:

assert test_case.is_dir()

assert not test_case.is_file()

file = rig.create_cloud_path("dir_0/file0_0.txt")

file_nested = rig.create_cloud_path("dir_1/dir_1_0/file_1_0_0.txt")

for test_case in [file, file_nested]:

assert test_case.is_file()

assert not test_case.is_dir()

# does not exist (same behavior as pathlib.Path that does not exist)

non_existent = rig.create_cloud_path("dir_0/not_a_file")

assert not non_existent.is_file()

assert not non_existent.is_dir()

jayqi · 2025-05-02T22:42:42Z

cloudpathlib/http/httpclient.py

+    def _move_file(self, src: HttpPath, dst: HttpPath, remove_src: bool = True) -> HttpPath:
+        # .fspath will download the file so the local version can be uploaded
+        self._upload_file(src.fspath, dst)
+        if remove_src:
+            self._remove(src)
+        return dst


Makes me nervous that this is not done as a transaction. Is there a way to check PUT or DELETE support up front?

_upload_file will throw if it doesn't get a 200 response on finishing, so it should never be the case that we remove a file that was not uploaded. The only risk is that the subsequent _remove fails (and raises) when a server does not support it. In that case raising on _remove versus pre-emptively failing on _move_file does not seem worth it for an extra server call.

We could catch and log a warning it was not removed before reraising?

Having a clear error message that indicates the file was copied seems worth it to me. Having a partial move feels like a surprising thing that could be confusing and could cause inconsistency or other correctness issues in certain cases.

github-actions bot temporarily deployed to pull request September 1, 2024 11:22 Inactive

github-actions bot temporarily deployed to pull request September 1, 2024 11:29 Inactive

pjbull force-pushed the 455-http branch from 256d99b to db0b813 Compare September 13, 2024 10:38

github-actions bot temporarily deployed to pull request September 13, 2024 10:39 Inactive

github-actions bot temporarily deployed to pull request September 16, 2024 18:58 Inactive

github-actions bot temporarily deployed to pull request September 17, 2024 21:55 Inactive

pjbull mentioned this pull request Sep 19, 2024

Add FTP backend #26

Open

TomNicholas mentioned this pull request Nov 22, 2024

Paths as URIs zarr-developers/VirtualiZarr#243

Merged

25 tasks

github-actions bot temporarily deployed to pull request February 14, 2025 20:43 Inactive

github-actions bot temporarily deployed to pull request February 14, 2025 20:48 Inactive

github-actions bot temporarily deployed to pull request February 14, 2025 20:52 Inactive

github-actions bot temporarily deployed to pull request February 14, 2025 22:33 Inactive

github-actions bot temporarily deployed to pull request February 14, 2025 22:35 Inactive

github-actions bot temporarily deployed to pull request February 14, 2025 22:40 Inactive

github-actions bot temporarily deployed to pull request February 16, 2025 00:27 Inactive

github-actions bot temporarily deployed to pull request February 16, 2025 01:09 Inactive

github-actions bot temporarily deployed to pull request February 16, 2025 01:11 Inactive

github-actions bot temporarily deployed to pull request February 17, 2025 19:49 Inactive

github-actions bot temporarily deployed to pull request February 17, 2025 21:58 Inactive

pjbull force-pushed the 455-http branch from 8dddc7f to fd0ab3f Compare February 17, 2025 22:44

github-actions bot temporarily deployed to pull request February 17, 2025 22:45 Inactive

github-actions bot temporarily deployed to pull request February 18, 2025 01:04 Inactive

pjbull added 15 commits April 20, 2025 21:14

improve http docs

698ab4a

add table

7064f34

lint

402f4fe

try skipping http rigs on windows in CI

14fd932

more stable tests

01305b3

test flakiness

e44c495

refresh cert

16d0137

flaky test fix

4fafb7e

simplify test servers

d0e819a

possibly?

86f5847

redo certs for 127.0.0.1

3126f61

update command

8cbdff3

Remove pytz and adjust sleep

1443ff5

update rigs

da280b6

update missing timestap

b71e40c

pjbull force-pushed the 455-http branch from ebeba8b to b71e40c Compare April 21, 2025 04:15

github-actions bot temporarily deployed to pull request April 21, 2025 04:16 Inactive

more resilient

c259f6e

github-actions bot temporarily deployed to pull request April 21, 2025 04:24 Inactive

sleepier

ed53b45

github-actions bot temporarily deployed to pull request April 21, 2025 04:33 Inactive

Tweaks

09a07d3

github-actions bot temporarily deployed to pull request April 22, 2025 05:24 Inactive

changelog

7208f76

github-actions bot deployed to pull request April 22, 2025 05:36 View deployment

pjbull changed the title ~~WIP: Implement HTTP~~ WIP: Implement HTTP(s) support Apr 22, 2025

pjbull changed the title ~~WIP: Implement HTTP(s) support~~ Implement HTTP(s) support Apr 22, 2025

jayqi reviewed May 2, 2025

View reviewed changes

Add explicit filename tests

0554ecd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HTTP(s) support #468

Implement HTTP(s) support #468

pjbull commented Sep 1, 2024 •

edited

Loading

github-actions bot commented Sep 1, 2024 •

edited

Loading

jayqi commented Sep 4, 2024

pjbull commented Sep 4, 2024

MattOates commented Oct 9, 2024 •

edited

Loading

pjbull commented Oct 14, 2024

jayqi commented Oct 18, 2024

pjbull commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

pjbull commented Apr 22, 2025

jayqi May 2, 2025

pjbull May 3, 2025

jayqi May 2, 2025

pjbull May 3, 2025

pjbull May 3, 2025

jayqi May 2, 2025

pjbull May 3, 2025 •

edited

Loading

jayqi May 4, 2025 •

edited

Loading

	def test_is_dir_is_file(rig, tmp_path):
	# test on directories
	dir_slash = rig.create_cloud_path("dir_0/")
	dir_no_slash = rig.create_cloud_path("dir_0")
	dir_nested_slash = rig.create_cloud_path("dir_1/dir_1_0/")
	dir_nested_no_slash = rig.create_cloud_path("dir_1/dir_1_0")

	for test_case in [dir_slash, dir_no_slash, dir_nested_slash, dir_nested_no_slash]:
	assert test_case.is_dir()
	assert not test_case.is_file()

	file = rig.create_cloud_path("dir_0/file0_0.txt")
	file_nested = rig.create_cloud_path("dir_1/dir_1_0/file_1_0_0.txt")

	for test_case in [file, file_nested]:
	assert test_case.is_file()
	assert not test_case.is_dir()

	# does not exist (same behavior as pathlib.Path that does not exist)
	non_existent = rig.create_cloud_path("dir_0/not_a_file")
	assert not non_existent.is_file()
	assert not non_existent.is_dir()

Implement HTTP(s) support #468

Are you sure you want to change the base?

Implement HTTP(s) support #468

Conversation

pjbull commented Sep 1, 2024 • edited Loading

github-actions bot commented Sep 1, 2024 • edited Loading

jayqi commented Sep 4, 2024

pjbull commented Sep 4, 2024

MattOates commented Oct 9, 2024 • edited Loading

pjbull commented Oct 14, 2024

jayqi commented Oct 18, 2024

pjbull commented Feb 14, 2025 • edited Loading

codecov bot commented Feb 14, 2025 • edited Loading

Codecov Report

pjbull commented Apr 22, 2025

jayqi May 2, 2025

Choose a reason for hiding this comment

pjbull May 3, 2025

Choose a reason for hiding this comment

jayqi May 2, 2025

Choose a reason for hiding this comment

pjbull May 3, 2025

Choose a reason for hiding this comment

pjbull May 3, 2025

Choose a reason for hiding this comment

jayqi May 2, 2025

Choose a reason for hiding this comment

pjbull May 3, 2025 • edited Loading

Choose a reason for hiding this comment

jayqi May 4, 2025 • edited Loading

Choose a reason for hiding this comment

pjbull commented Sep 1, 2024 •

edited

Loading

github-actions bot commented Sep 1, 2024 •

edited

Loading

MattOates commented Oct 9, 2024 •

edited

Loading

pjbull commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading

pjbull May 3, 2025 •

edited

Loading

jayqi May 4, 2025 •

edited

Loading