Skip to content

revisit "subpath"'s "serializer" and "parser" sections #448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jkowalleck opened this issue Apr 6, 2025 · 3 comments
Open

revisit "subpath"'s "serializer" and "parser" sections #448

jkowalleck opened this issue Apr 6, 2025 · 3 comments
Labels
Ecma specification Work on the core specification PURL subpath component
Milestone

Comments

@jkowalleck
Copy link
Member

jkowalleck commented Apr 6, 2025

While the subpath description in the spec is clear, after the PRs in this issue (#379), I still feel that the "serializer" and "parser" sections need some work:

- If the ``subpath`` is not empty and not composed only of empty, '.' and '..'
segments:
- Append '#' to the ``purl``
- Strip the ``subpath`` from leading and trailing '/'
- Split this on '/' as segments
- Discard empty, '.' and '..' segments
- Percent-encode each segment
- UTF-8-encode each segment if needed in your programming language
- Join the segments with '/'
- Append this to the ``purl``

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

- Split the ``purl`` string once from right on '#'
- The left side is the ``remainder``
- Strip the right side from leading and trailing '/'
- Split this on '/'
- Discard any empty string segment from that split
- Percent-decode each segment
- Discard any '.' or '..' segment from that split
- UTF-8-decode each segment if needed in your programming language
- Join segments back with a '/'
- This is the ``subpath``

Parser: it doesn't specify how to deal with an encoded solidus /:

  • ../.. will result in an empty subpath.
  • ..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

Originally posted by @ppkarwasz in #379

@jkowalleck jkowalleck added Ecma specification Work on the core specification PURL subpath component labels Apr 6, 2025
@jkowalleck jkowalleck added this to the 1.0 milestone Apr 6, 2025
@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 6, 2025

While the subpath description in the spec is clear, after the PRs in this issue, I still feel that the "serializer" and "parser" sections need some work:

purl-spec/PURL-SPECIFICATION.rst

Lines 334 to 344 in d3beaa3

  • If the subpath is not empty and not composed only of empty, '.' and '..'
    segments:

    • Append '#' to the purl
    • Strip the subpath from leading and trailing '/'
    • Split this on '/' as segments
    • Discard empty, '.' and '..' segments
    • Percent-encode each segment
    • UTF-8-encode each segment if needed in your programming language
    • Join the segments with '/'
    • Append this to the purl

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

Where the "forbidden" path-segments might come from does not matter for the process, does it?

I don't want to go into the details for the sake of this ticket. If needed, we could open an extra discussion based on the following practical example:
I could craft a yarn setup which subpackages and workspaces, where the packages depend on another and uses relative paths like file:.//../packages/other/.
For this example, some unsanitized purl could look like pkg:npm/[email protected]#packages/foo/.//../packages/other/. In this particular case, the implementation that gathered the path-data would resolve these paths, before handing the purl components over to a purl-library, to prevent well-expected information loss.
Anyway, such a path-resolve process is up to the original process that gathered the data, not to the purl-library that implements the spec. THe library simply follows the rules: no empty segments, no ./...

purl-spec/PURL-SPECIFICATION.rst

Lines 358 to 368 in d3beaa3

  • Split the purl string once from right on '#'

    • The left side is the remainder
    • Strip the right side from leading and trailing '/'
    • Split this on '/'
    • Discard any empty string segment from that split
    • Percent-decode each segment
    • Discard any '.' or '..' segment from that split
    • UTF-8-decode each segment if needed in your programming language
    • Join segments back with a '/'
    • This is the subpath

Parser: it doesn't specify how to deal with an encoded solidus /:

  • ../.. will result in an empty subpath.

true. is this unexpected?

  • ..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

let me step this through:
given subpath: ..%2F..

  1. ..%2F.. | stripped leading/trailing /
  2. {..%2F..} | split on /
  3. {..%2F..} | discarded empty segments
  4. {../..} | percent-decode segments
  5. {../..} | discarded ./.. segments
  6. {../..} | utf8 decoded segments
  7. ../.. | joined with a slach(/)
    resulting subpath: ../.. -- QED

per spec, after percent-decoding, there must not be a / in any segment

- When percent-decoded, a segment:
- MUST NOT contain a '/'

issues i see:

  • there must not be a / in any segment
  • possible solution A: add an instruction/step in the instructions, right after "Percent-decode each segment":

    If any segment contains a '/', then treat this PURL as invalid.

  • possible solution B: add an instruction/step in the instructions, right after "Percent-decode each segment":

    Discard any segment that contains a '/' from that split

  • possible solution X: add other solutions in the comments.
  • the process joins on / to produce a final result.

@ppkarwasz
Copy link

purl-spec/PURL-SPECIFICATION.rst
Lines 334 to 344 in d3beaa3

  • If the subpath is not empty and not composed only of empty, '.' and '..'
    segments:

    • Append '#' to the purl
    • Strip the subpath from leading and trailing '/'
    • Split this on '/' as segments
    • Discard empty, '.' and '..' segments
    • Percent-encode each segment
    • UTF-8-encode each segment if needed in your programming language
    • Join the segments with '/'
    • Append this to the purl

Serializer: if subpath MUST NOT contain empty segments and segments equal to . and .., where are these segments coming from? Ditto for leading and trailing / (which I consider a particular case of empty segments).

Where the "forbidden" path-segments might come from does not matter for the process, does it?

The serializer starts with a subpath that follows the PURL spec. Unless the Package URL library allows the creation of objects without validating them the subpath string will not have any "forbidden" path-segments. So the serialization algorithm can be simplified:

  • Append # to the purl
  • Split the subpath on / as segments
  • UTF-8 encode each segment
  • Percent-encode each segment
  • Join the segments with /
  • Append this to the purl

Parser: it doesn't specify how to deal with an encoded solidus /:

  • ../.. will result in an empty subpath.

true. is this unexpected?

For this example maybe not, but the current description will also parse foo/../bar as foo/bar instead of the expected bar.
When dealing with .. segments I would expect the parser to:

  • either apply some normalization algorithm like the one described in Remove Dot Segments of RFC 3986.
  • or throw an exception.
  • ..%2F.. will result in a single ../.. segment, that will result in the same subpath as two .. segments.

let me step this through: given subpath: ..%2F..

1. `..%2F..` | stripped leading/trailing `/`

2. {`..%2F..`} | split on `/`

3. {`..%2F..`} | discarded empty segments

4. {`../..`} | percent-decode segments

5. {`../..`} | discarded `.`/`..` segments

6. {`../..`} | utf8 decoded segments

7. `../..` | joined with a slash(`/`)
   resulting subpath: `../..` -- QED

per spec, after percent-decoding, there must not be a / in any segment

At this point I would expect the parser to throw an exception. I don't see any natural situation, when an application will append ..%2F.. (instead of ../..) to a subpath.
On the other hand encoding / might be used for some supply chain attacks to fool developers into believing they are using a different package that was is actually used

pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c#..%2F..%2Fmy_malicious_gorilla/context

@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 9, 2025

@matt-phylum wrote in #379 (comment)

The ../.. vs ..%2F.. problem is why URL implementations do not automatically percent decode the path component. WHATWG URL even specifies that during URL parsing, the parser is required to percent encode characters.

althonos/packageurl.rs, package-url/packageurl-java, phylum-dev/purl, sonatype/package-url-java return an error if the subpath contains %2F because they cannot represent such a subpath.

package-url/packageurl-ruby does not decode %2F in subpaths (or any other encoded characters?).

All other implementations tested convert %2F into /.

Of those implementations, counting only implementations that implement the ./.. segment rules at all, giterlizzi/perl-URI-PackageURL, package-url/packageurl-java, package-url/packageurl-php, package-url/packageurl-swift behave incorrectly for pkg:generic/a#..%2F.., first parsing as having subpath ../.. and then formatting as pkg:generic/a which has no subpath. maennchen/purl, package-url/packageurl-js, package-url/packageurl-python parse as having subpath ../.., which may be incorrect but at least it's consistent.

Note: for my testing, I am using code that treats the subpath as a string rather than a list of strings. I think there is at least one PURL implementation that natively handles subpaths as a list of strings. It's possible that such an implementation would parse ..%2F.. as ["../.."] and format ["../.."] as ..%2F.., which could be correct behavior, but the test harness code would break such an implementation when converting the list into a single string and back, and it's likely this same breakage would occur in software using that implementation because the subpath represents a path and paths are almost universally handled as strings and it's likely if somebody wants to generate a path to a file they will join the segments without checking for embedded path separator characters.

@jkowalleck jkowalleck linked a pull request Apr 9, 2025 that will close this issue
@johnmhoran johnmhoran changed the title revistit "subpath"'s "serializer" and "parser" sections revisit "subpath"'s "serializer" and "parser" sections Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ecma specification Work on the core specification PURL subpath component
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants