revisit "subpath"'s "serializer" and "parser" sections #448

jkowalleck · 2025-04-06T10:47:10Z

While the subpath description in the spec is clear, after the PRs in this issue (#379), I still feel that the "serializer" and "parser" sections need some work:

purl-spec/PURL-SPECIFICATION.rst

Lines 334 to 344 in d3beaa3

- If the ``subpath`` is not empty and not composed only of empty, '.' and '..'

segments:

- Append '#' to the ``purl``

- Strip the ``subpath`` from leading and trailing '/'

- Split this on '/' as segments

- Discard empty, '.' and '..' segments

- Percent-encode each segment

- UTF-8-encode each segment if needed in your programming language

- Join the segments with '/'

- Append this to the ``purl``

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

purl-spec/PURL-SPECIFICATION.rst

Lines 358 to 368 in d3beaa3

- Split the ``purl`` string once from right on '#'

- The left side is the ``remainder``

- Strip the right side from leading and trailing '/'

- Split this on '/'

- Discard any empty string segment from that split

- Percent-decode each segment

- Discard any '.' or '..' segment from that split

- UTF-8-decode each segment if needed in your programming language

- Join segments back with a '/'

- This is the ``subpath``

Parser: it doesn't specify how to deal with an encoded solidus /:

../.. will result in an empty subpath.

..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

Originally posted by @ppkarwasz in #379

The text was updated successfully, but these errors were encountered:

jkowalleck · 2025-04-06T11:45:52Z

While the subpath description in the spec is clear, after the PRs in this issue, I still feel that the "serializer" and "parser" sections need some work:

purl-spec/PURL-SPECIFICATION.rst

Lines 334 to 344 in d3beaa3

If the subpath is not empty and not composed only of empty, '.' and '..'
segments:

Append '#' to the purl

Strip the subpath from leading and trailing '/'

Split this on '/' as segments

Discard empty, '.' and '..' segments

Percent-encode each segment

UTF-8-encode each segment if needed in your programming language

Join the segments with '/'

Append this to the purl

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

Where the "forbidden" path-segments might come from does not matter for the process, does it?

I don't want to go into the details for the sake of this ticket. If needed, we could open an extra discussion based on the following practical example:
I could craft a yarn setup which subpackages and workspaces, where the packages depend on another and uses relative paths like file:.//../packages/other/.
For this example, some unsanitized purl could look like pkg:npm/[email protected]#packages/foo/.//../packages/other/. In this particular case, the implementation that gathered the path-data would resolve these paths, before handing the purl components over to a purl-library, to prevent well-expected information loss.
Anyway, such a path-resolve process is up to the original process that gathered the data, not to the purl-library that implements the spec. THe library simply follows the rules: no empty segments, no ./...

purl-spec/PURL-SPECIFICATION.rst

Lines 358 to 368 in d3beaa3

Split the purl string once from right on '#'

The left side is the remainder

Strip the right side from leading and trailing '/'

Split this on '/'

Discard any empty string segment from that split

Percent-decode each segment

Discard any '.' or '..' segment from that split

UTF-8-decode each segment if needed in your programming language

Join segments back with a '/'

This is the subpath

Parser: it doesn't specify how to deal with an encoded solidus /:

../.. will result in an empty subpath.

true. is this unexpected?

..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

let me step this through:
given subpath: ..%2F..

..%2F.. | stripped leading/trailing /
{..%2F..} | split on /
{..%2F..} | discarded empty segments
{../..} | percent-decode segments
{../..} | discarded ./.. segments
{../..} | utf8 decoded segments
../.. | joined with a slach(/)
resulting subpath: ../.. -- QED

per spec, after percent-decoding, there must not be a / in any segment

purl-spec/PURL-SPECIFICATION.rst

Lines 213 to 215 in d3beaa3

    
           - When percent-decoded, a segment: 
        
             - MUST NOT contain a '/'

issues i see:

there must not be a / in any segment
possible solution A: add an instruction/step in the instructions, right after "Percent-decode each segment":

If any segment contains a '/', then treat this PURL as invalid.
possible solution B: add an instruction/step in the instructions, right after "Percent-decode each segment":

Discard any segment that contains a '/' from that split
possible solution X: add other solutions in the comments.
the process joins on / to produce a final result.
- Joining path segments is not in the domain of PURL.
- I think the spec should stop after the path-segments are clear. Joining segments on FileSystem/OperatingSystem dependent path separators (/, \, >, etc - see https://en.wikipedia.org/wiki/Path_(computing) for more) is out of scope of the spec.
- The same with environment/FileSystem dependent character escapes.
- I've drafted a PR to discuss the details: subpath: parsing stops after segments are clear #449

ppkarwasz · 2025-04-07T16:17:08Z

purl-spec/PURL-SPECIFICATION.rst
Lines 334 to 344 in d3beaa3

If the subpath is not empty and not composed only of empty, '.' and '..'
segments:

Append '#' to the purl

Strip the subpath from leading and trailing '/'

Split this on '/' as segments

Discard empty, '.' and '..' segments

Percent-encode each segment

UTF-8-encode each segment if needed in your programming language

Join the segments with '/'

Append this to the purl

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

Where the "forbidden" path-segments might come from does not matter for the process, does it?

The serializer starts with a subpath that follows the PURL spec. Unless the Package URL library allows the creation of objects without validating them the subpath string will not have any "forbidden" path-segments. So the serialization algorithm can be simplified:

Append # to the purl
Split the subpath on / as segments
UTF-8 encode each segment
Percent-encode each segment
Join the segments with /
Append this to the purl

Parser: it doesn't specify how to deal with an encoded solidus /:

../.. will result in an empty subpath.

true. is this unexpected?

For this example maybe not, but the current description will also parse foo/../bar as foo/bar instead of the expected bar.
When dealing with .. segments I would expect the parser to:

either apply some normalization algorithm like the one described in Remove Dot Segments of RFC 3986.
or throw an exception.

..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

let me step this through: given subpath: ..%2F..
1. `..%2F..` | stripped leading/trailing `/`

2. {`..%2F..`} | split on `/`

3. {`..%2F..`} | discarded empty segments

4. {`../..`} | percent-decode segments

5. {`../..`} | discarded `.`/`..` segments

6. {`../..`} | utf8 decoded segments

7. `../..` | joined with a slash(`/`)
   resulting subpath: `../..` -- QED
per spec, after percent-decoding, there must not be a / in any segment

At this point I would expect the parser to throw an exception. I don't see any natural situation, when an application will append ..%2F.. (instead of ../..) to a subpath.
On the other hand encoding / might be used for some supply chain attacks to fool developers into believing they are using a different package that was is actually used

pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c#..%2F..%2Fmy_malicious_gorilla/context

jkowalleck · 2025-04-09T07:13:11Z

@matt-phylum wrote in #379 (comment)

The ../.. vs ..%2F.. problem is why URL implementations do not automatically percent decode the path component. WHATWG URL even specifies that during URL parsing, the parser is required to percent encode characters.

althonos/packageurl.rs, package-url/packageurl-java, phylum-dev/purl, sonatype/package-url-java return an error if the subpath contains %2F because they cannot represent such a subpath.

package-url/packageurl-ruby does not decode %2F in subpaths (or any other encoded characters?).

All other implementations tested convert %2F into /.

Of those implementations, counting only implementations that implement the ./.. segment rules at all, giterlizzi/perl-URI-PackageURL, package-url/packageurl-java, package-url/packageurl-php, package-url/packageurl-swift behave incorrectly for pkg:generic/a#..%2F.., first parsing as having subpath ../.. and then formatting as pkg:generic/a which has no subpath. maennchen/purl, package-url/packageurl-js, package-url/packageurl-python parse as having subpath ../.., which may be incorrect but at least it's consistent.

Note: for my testing, I am using code that treats the subpath as a string rather than a list of strings. I think there is at least one PURL implementation that natively handles subpaths as a list of strings. It's possible that such an implementation would parse ..%2F.. as ["../.."] and format ["../.."] as ..%2F.., which could be correct behavior, but the test harness code would break such an implementation when converting the list into a single string and back, and it's likely this same breakage would occur in software using that implementation because the subpath represents a path and paths are almost universally handled as strings and it's likely if somebody wants to generate a path to a file they will join the segments without checking for embedded path separator characters.

jkowalleck added Ecma specification Work on the core specification PURL subpath component labels Apr 6, 2025

jkowalleck added this to the 1.0 milestone Apr 6, 2025

jkowalleck mentioned this issue Apr 6, 2025

Clarify spec for subpath #379

Open

6 tasks

github-project-automation bot added this to Core PURL spec Apr 6, 2025

jkowalleck mentioned this issue Apr 6, 2025

subpath: parsing stops after segments are clear #449

Open

jkowalleck mentioned this issue Apr 9, 2025

Namespace/subpath: simplify parsing #452

Closed

jkowalleck linked a pull request Apr 9, 2025 that will close this issue

Namespace/subpath: simplify parsing #452

Closed

ppkarwasz mentioned this issue Apr 10, 2025

Subpath: simplify parsing #456

Open

johnmhoran changed the title ~~revistit "subpath"'s "serializer" and "parser" sections~~ revisit "subpath"'s "serializer" and "parser" sections Apr 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revisit "subpath"'s "serializer" and "parser" sections #448

revisit "subpath"'s "serializer" and "parser" sections #448

jkowalleck commented Apr 6, 2025 •

edited

Loading

jkowalleck commented Apr 6, 2025 •

edited

Loading

ppkarwasz commented Apr 7, 2025

jkowalleck commented Apr 9, 2025 •

edited

Loading

revisit "subpath"'s "serializer" and "parser" sections #448

revisit "subpath"'s "serializer" and "parser" sections #448

Comments

jkowalleck commented Apr 6, 2025 • edited Loading

jkowalleck commented Apr 6, 2025 • edited Loading

ppkarwasz commented Apr 7, 2025

jkowalleck commented Apr 9, 2025 • edited Loading

jkowalleck commented Apr 6, 2025 •

edited

Loading

jkowalleck commented Apr 6, 2025 •

edited

Loading

jkowalleck commented Apr 9, 2025 •

edited

Loading