revisit test-suite #409

jkowalleck · 2025-03-06T17:21:46Z

fixes #407

aspects:

Lines 455 to 456 in c30f2f6

    
           contains an array of objects. Each object represents a test with these key/value 
        
           pairs some of which may not be normalized:

purl-spec/PURL-SPECIFICATION.rst

Lines 473 to 482 in c30f2f6

    
           - parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed 
        
             components should return the test canonical ``purl`` 
        
           - parsing the test ``purl`` should return the components parsed from the test 
        
             canonical ``purl`` 
        
           - parsing the test ``purl`` then re-building a ``purl`` from these parsed components 
        
             should return the test canonical ``purl`` 
        
           - building a ``purl`` from the test components should return the test canonical ``purl``

fixes package-url#407 Signed-off-by: Jan Kowalleck <[email protected]>

jkowalleck · 2025-03-06T17:24:02Z

@dwalluck, @matt-phylum, et al.
please review

dwalluck · 2025-03-06T20:38:01Z

@dwalluck, @matt-phylum, et al.
please review

Yes, thank you. I think that this clears things up, and I hope that you understand that I had a point with #404, but my assumption was the opposite of what was intended, apparently). Feel free to close #404 when this is merged.

Thanks.

dwalluck · 2025-03-06T23:00:18Z

@matt-phylum I am going to try to test this with code first. I don't know if you want to as well.

It could be that my parsing algorithm is wrong, but it could be the tests. I got three failures.

Is version supposed to be null in these two or is it the part after the '@'?

 {
    "description": "a scheme is always required",
    "purl": "[email protected]",
    "canonical_purl": "[email protected]",
    "type": null,
    "namespace": null,
    "name": "EnterpriseLibrary.Common",
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  },
  {
    "description": "a name is required",
    "purl": "pkg:maven/@1.3.4",
    "canonical_purl": "pkg:maven/@1.3.4",
    "type": "maven",
    "namespace": null,
    "name": null,
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  },

Also, I am not sure how to parse this one to end up with components given. Maybe type is meant to be "pkg%3Amaven", not "maven" since this purl technically is missing the scheme.

  {
    "description": "invalid encoded colon : between scheme and type",
    "purl": "pkg%3Amaven/org.apache.commons/io",
    "canonical_purl": null,
    "type": "maven",
    "namespace": "org.apache.commons",
    "name": "io",
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  }

matt-phylum · 2025-03-07T12:58:58Z

Those are all invalid so your parsing algorithm is supposed to fail. In the second case the parser should decode the version before failing, but I don't think it affects the behavior of the test. If the PURL is invalid, you can't parse it, and if the components are invalid you can't format it, so there's nothing left to do.

dwalluck · 2025-03-07T14:18:18Z

In any case, the first two appear to be missing version, no? The parser will fail, but it parses from the right, so it should have the version component (assuming it doesn't check for scheme first, then name would be null, too.

matt-phylum · 2025-03-07T14:23:22Z

Most parsers do not return an invalid PURL so there is no way to assert the version.

dwalluck · 2025-03-07T14:26:18Z

One more, yes, it is a test with "is_invalid": true, but the parsing should be consistent. The spec specifies the order of parsing. The entire part after '?' should be qualifiers, which have no valid keys. The only thing it should parse is the type.

 {
    "description": "check for invalid character in type",
    "purl": "pkg:n&g?inx/[email protected]",
    "canonical_purl": null,
    "type": "n&g?inx",
    "namespace": null,
    "name": "nginx",
    "version": "0.8.9",
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  }

dwalluck · 2025-03-07T14:29:22Z

This pull request makes the valid purl components consistent with the original purl. Awesome!

However, I have now found inconsistencies in the invalid purl components.

It should be made consistent with the parsing algorithm, or leave everything null. The test author should not make up what they "think" should be the components without following the parsing algorithm laid out in the specification.

dwalluck · 2025-03-07T14:35:50Z

I can be more clear. The test does not test what it thinks it does. It actually tests an invalid qualifier key, not an invalid type. It could test an invalid type if it uses a character other than '#' or '?', I think. So, purl->fields is testing the qualifier keys, but fields->purl is testing the type.

matt-phylum · 2025-03-07T16:06:50Z

That's not how the tests work. The type is an input to formatting only. It has nothing to do with parsing.

dwalluck · 2025-03-07T22:45:51Z

That's not how the tests work. The type is an input to formatting only. It has nothing to do with parsing.

Sorry, what does "input to formatting" mean?

If a purl is valid, then the fields refer to the purl.
If the purl is invalid, then the fields refer to what?

Even though that purl is invalid, the fields there don't refer to that purl URI. I would have thought it would look more like:

 {
    "description": "check for invalid character in qualifier key",
    "purl": "pkg:n&g?inx/[email protected]",
    "canonical_purl": null,
    "type": "n&g",
    "namespace": null,
    "name": null,
    "version": null,
    "qualifiers": {"inx/[email protected]": ""},
    "subpath": null,
    "is_invalid": true
  },

matt-phylum · 2025-03-07T22:51:35Z

Sorry, what does "input to formatting" mean?

The values are input to the formatting algorithm described here:

purl-spec/PURL-SPECIFICATION.rst

Line 251 in 7f7e82f

How to build ``purl`` string from its components

If a purl is valid, then the fields refer to the purl.

They do not.

If the PURL is valid, then using those values as input to build a PURL string results in the canonical PURL. Parsing the canonical PURL does not necessarily produce the same values again.

matt-phylum · 2025-03-07T23:00:48Z

There is also a problem with this case:

purl-spec/test-suite-data.json

Lines 662 to 673 in 7f7e82f

    
           { 
        
             "description": "invalid encoded colon : between scheme and type", 
        
             "purl": "pkg%3Amaven/org.apache.commons/io", 
        
             "canonical_purl": null, 
        
             "type": "maven", 
        
             "namespace": "org.apache.commons", 
        
             "name": "io", 
        
             "version": null, 
        
             "qualifiers": null, 
        
             "subpath": null, 
        
             "is_invalid": true 
        
           },

The PURL is invalid, and is_invalid is true, but the type namespace name version qualifiers subpath are all valid. This results in every implementation failing. Should type be null?

dwalluck · 2025-03-07T23:39:13Z

They do not.

I meant purl as in the non-canonical purl field in the JSON. This pull request changed them so that they all (valid ones) match, except for the few invalid cases that I mentioned.

This is for invalid purls, the spec still defines a consistent parse result, so it would be possible for the implementations to get the same results (or not) depending on if they fail-fast or not.

jkowalleck · 2025-03-11T10:35:01Z

Hm, i see.

Let's iterate and improve one thing at a time. This one is basically about valid test data.

There is an open ticket to discuss how invalid test cases must be structured. As soon as that discussion can to a conclusion, according action can follow up.

see package-url/purl-spec#409

dwalluck · 2025-03-11T15:04:24Z

There is an open ticket to discuss how invalid test cases must be structured. As soon as that discussion can to a conclusion, according action can follow up.

Thanks. I am not seeing any issues with the valid test data except when something should be encoded or not. For example, ':' appears at times both unencoded and encoded (encoded in the version, but unencoded in the repository_url key).

As a rule, should the standalone test components always appear decoded, as in these valid examples?

  {
    "purl": "pkg:GOLANG/google.golang.org/genproto@abcdedf#/googleapis/%2E%2E/api/annotations/",
    "subpath": "/googleapis/../api/annotations/",
    "is_invalid": false
  }

or

 {
    "purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "canonical_purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "version": "sha256:244fd47e07d1004f0aed9c",
    "is_invalid": false
  }

matt-phylum · 2025-03-11T16:35:22Z

The standalone components are not encoded. If they contain percent signs, those are supposed to be encoded as %25, not decoded like percent encoding.

pombredanne · 2025-03-12T17:58:56Z

test-suite-data.json

    "namespace": "google.golang.org",
    "name": "genproto",
    "version": null,
    "qualifiers": null,
-    "subpath": "googleapis/api/annotations",
+    "subpath": "/googleapis/api/annotations/",


The subpath is speced as normalized by stripping slashes:
Leading and trailing slashes '/' are not significant and SHOULD be stripped in the canonical form .... we need to clarify what is the relationship between each test fields first.

we need to clarify what is the relationship between each test fields first.

is this not clear from current spec?

purl-spec/PURL-SPECIFICATION.rst

Lines 473 to 485 in 65d98c4

To test ``purl`` parsing and building, a tool can use this test suite and for

every listed test object, run these tests:

- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed

components should return the test canonical ``purl``

- parsing the test ``purl`` should return the components parsed from the test

canonical ``purl``

- parsing the test ``purl`` then re-building a ``purl`` from these parsed components

should return the test canonical ``purl``

- building a ``purl`` from the test components should return the test canonical ``purl``

it is stated, that test fields are used only as an input for crafting a purl. and the result is expected to be the canonical purl.

building a purl from the test components should return the test canonical purl

This field should have the unnormalized form for the tests to work properly.

purl-spec/PURL-SPECIFICATION.rst

Lines 455 to 456 in 7f7e82f

contains an array of objects. Each object represents a test with these key/value

pairs some of which may not be normalized:

purl-spec/PURL-SPECIFICATION.rst

Lines 473 to 482 in 7f7e82f

- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed

components should return the test canonical ``purl``

- parsing the test ``purl`` should return the components parsed from the test

canonical ``purl``

- parsing the test ``purl`` then re-building a ``purl`` from these parsed components

should return the test canonical ``purl``

- building a ``purl`` from the test components should return the test canonical ``purl``

If these fields contain the normalized form, then the tests do not ensure that normalization happens when building a PURL from components. An implementation that normalizes only during parsing would incorrectly pass the current version of the test because the PURL is built from the pre-normalized components and those components do not require further normalization, but that implementation would correctly fail with the proposed components.

Of course some implementations of the test suite have this backwards and parse the canonical PURL and assert that its fields match the components here, and those implementations will not work with the new version of the file, but those implementations of the test suite do not ensure that normalization happens when building a PURL from components because they have no way to obtain the unnormalized components of the purl field to use as inputs.

pombredanne

@jkowalleck Thanks. The tests should/will follow the updates of the core spec, so merging these should come afterwards to make sure we are not merging test updates that may end up being inconsistent with the streamlined spec! :]

Also we have case normalization rules already in the spec, and this PR seems to revert all of these tests? I am not sure why we would do this as this goes against the way things are specified and implemented.

Now the "is_invalid" field in tests is likely not enough as this is mixing things up and this should be clarified. This is not binary choice after all.

There are multiple cases to consider:

strict validation failing if not a canonical PURL
flexible parsing where the PURL may not be strictly valid, but can be parsed and normalized to recover its components, including case normalization.
handling of individual components as encoded or decoded.
Round-trips when we parse/unparse a PURL

And likely a few more. We must clarify this before updating tests.

I am leaving this open for now until we can revisit when the spec is clear.

pombredanne · 2025-03-12T18:07:15Z

@matt-phylum re: #409 (comment)

You wrote:

The standalone components are not encoded.

Agreed, individual components should be decoded.

If they contain percent signs, those are supposed to be encoded as %25, not decoded like percent encoding.

I am not sure what this means. An example would be helpful.

Also, FWIW, we should be clear on double encoding and avoid this if at all possible.

* update test suite * fix warning * align test generation with spec see package-url/purl-spec#409

matt-phylum · 2025-03-12T18:38:34Z

I am not sure what this means. An example would be helpful.

Example:

{
  "description": "Maven Central is too permissive",
  "purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
  "canonical_purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
  "type": "maven",
  "namespace": "net.databinder",
  "name": "dispatch-http%2Bjson_2.7.3",
  "version": "0.6.0",
  "is_invalid": false
}

The name field contains %2B because, likely due to some mistake since the POM doesn't match, that's really the name of the package: https://repo.maven.apache.org/maven2/net/databinder/dispatch-http%252Bjson_2.7.3/0.6.0/ . The %2B is not to be decoded by the test implementation before using the name to construct a PURL. The name used when building the PURL should be exactly as specified in the JSON.

Another example:

{
  "description": "PURLs are ASCII",
  "purl": "pkg:nuget/史密斯图wpf控件@1.0.3",
  "canonical_purl": "pkg:nuget/%E5%8F%B2%E5%AF%86%E6%96%AF%E5%9B%BEwpf%E6%8E%A7%E4%BB%[email protected]",
  "type": "nuget",
  "name": "\u53f2\u5bc6\u65af\u56fewpf\u63a7\u4ef6",
  "version": "1.0.3",
  "is_invalid": false
}

(https://www.nuget.org/packages/史密斯图wpf控件/1.0.3)

In case there is any confusion, this is referring to the value in the JSON, and any JSON escape sequences do need to be processed when parsing the JSON file to obtain that value. Normally this happens automatically because you're using a proper JSON parser to read the JSON file.

These tests, or tests similar to them, should probably be included in the test suite because there is one PURL implementation that do not escape % and there are two PURL implementations that fail to parse Unicode inputs (probably. it could be a problem with my test harness for one of them?) and there is one implementation that generates invalid percent encoding for Unicode characters.

dwalluck · 2025-03-12T18:51:25Z

4. Round-trips when we parse/unparse a PURL

What started me on all this is that the Java code was comparing the fields to the canonical purl. This pull request changes that so that the fields to match the original purl now.

Ideally, I had wanted to be able to round-trip both the original purl and the canonical purl, but the current Java code doesn't support this, as it is normalizing on input as opposed to later. If the component name "FOO" is valid, but the normalized component name is "foo", I ideally want the API PackageURL::getName to return "Foo", and PackageURL::normalize::getName to return "foo".

If I have an original (non-normalized) purl, I am interested in checking that the parsing, validation, and normalization are all correct. To make a long comment short: there's only one set of fields right now, so only one of the two purls in each test can really be verified by round-trip, I think.

matt-phylum · 2025-03-12T19:50:39Z

This is getting off topic, but PackageURL::getName returning an unnormalized name by default is an API footgun. The user normally wants the normalized name. Having the parse and format routines handle some canonicalization (percent decoding and encoding) but not other normalization is going to surprise somebody when they get a PURL that is not normalized. This is especially true if somebody writes something like new PackageURL(input).getQualifiers().get("checksums") and the PURL has "Checksums" so the qualifier isn't found. I'd recommend having the unnormalized form available only by explicit request.

AFAIK most PURL implementations do not expose the unnormalized form, and I wouldn't expect this test suite to have special tests for implementations that do expose it (many of these tests are about the normalization process anyway). However, since the purpose of these individual parts is to verify that building a PURL from unnormalized parts results in the canonical PURL the same as parsing the non-canonical PURL, I don't see a reason why most tests wouldn't have the unnormalized (but fully percent decoded) value as it appears in the non-canonical PURL, in which case you can (at least for the most part) use this same set of values to write tests for the unnormalized PURL logic.

To make a long comment short: there's only one set of fields right now, so only one of the two purls in each test can really be verified by round-trip, I think.

There are two sets of values, or three for implementations that don't immediately normalize. You obtain the other values by parsing the PURL strings.

dwalluck · 2025-03-12T21:37:26Z

as parsing the non-canonical PURL, I don't see a reason why most tests wouldn't have the unnormalized (but fully percent decoded) value as it appears in the non-canonical PURL, in which case you can (at least for the most part) use this same set of values to write tests for the unnormalized PURL logic.

I think as of this pull request, that is the case, that the individual components are the unnormalized form.

The issue I see with normalization is that a lot of it is type-specific. The tests currently contain a lot of type-specific tests where type-specific normalization is required.

This is "strange" in the sense that for an implementation to produce the normal form, it has to be aware of all types, and what about when new types are added that it doesn't know about?

matt-phylum · 2025-03-13T12:19:37Z

This is "strange" in the sense that for an implementation to produce the normal form, it has to be aware of all types, and what about when new types are added that it doesn't know about?

That's #38.

dwalluck · 2025-03-13T13:08:05Z

That's #38.

Thanks, I also noticed that you recently posted #408. How does this differ?

matt-phylum · 2025-03-13T13:46:46Z

#408 is more to do with percent encoding. The current spec is always talking about what a canonical PURL looks like, but if you format the components

{
  "type": "generic",
  "namespace": "!\"$%&'()*+,:;=?@[\\]^{|}~",
  "name": "!\"$%&'()*+,:;=?@[\\]^{|}~",
  "version":"!\"$%&'()*+,:;=?@[\\]^{|}~",
  "qualifiers": {
    "q": "!\"$%&'()*+,:;=?@[\\]^{|}~"
  },
  "subpath":"!\"$%&'()*+,:;=?@[\\]^{|}~"
}

you get back 10 different answers (and 1 unexpected error). To get exactly the right behavior requires a test that all characters encode correctly and passing that test will likely require PURL implementations to come with their own percent encoding routines. It's much easier to generate a PURL which can be understood by all implementations because it encodes at least the minimum set of characters than to generate exactly the canonical PURL.

johnmhoran · 2025-03-16T23:49:09Z

@jkowalleck In light of @pombredanne 's comments, I'll defer my review as well until we've completed the core-spec updating.

jeremylong

LGTM

revisit test-suite

077166e

fixes package-url#407 Signed-off-by: Jan Kowalleck <[email protected]>

jkowalleck added the Test suite label Mar 6, 2025

jkowalleck requested review from johnmhoran and pombredanne March 6, 2025 17:22

matt-phylum approved these changes Mar 6, 2025

View reviewed changes

dwalluck mentioned this pull request Mar 6, 2025

fix(test-suite-data): Fix subpath tests #404

Closed

matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 11, 2025

align test generation with spec

213fc4e

see package-url/purl-spec#409

matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 11, 2025

align test generation with spec

13dc877

see package-url/purl-spec#409

matt-phylum mentioned this pull request Mar 11, 2025

Update test suite phylum-dev/purl#27

Merged

3 tasks

pombredanne reviewed Mar 12, 2025

View reviewed changes

pombredanne requested changes Mar 12, 2025

View reviewed changes

matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 12, 2025

Update test suite (#27)

31ea274

* update test suite * fix warning * align test generation with spec see package-url/purl-spec#409

jeremylong approved these changes Mar 22, 2025

View reviewed changes

jeremylong mentioned this pull request Mar 22, 2025

fix: discard dot segments from subpath package-url/packageurl-java#176

Open

johnmhoran mentioned this pull request Mar 25, 2025

Clarify construction of JSON test objects #419

Open

johnmhoran added this to the 1.0-draft milestone Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revisit test-suite #409

revisit test-suite #409

jkowalleck commented Mar 6, 2025 •

edited

Loading

jkowalleck commented Mar 6, 2025

dwalluck commented Mar 6, 2025

dwalluck commented Mar 6, 2025 •

edited

Loading

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

dwalluck commented Mar 7, 2025

dwalluck commented Mar 7, 2025 •

edited

Loading

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

jkowalleck commented Mar 11, 2025

dwalluck commented Mar 11, 2025

matt-phylum commented Mar 11, 2025

pombredanne Mar 12, 2025

jkowalleck Mar 12, 2025 •

edited

Loading

matt-phylum Mar 12, 2025

pombredanne left a comment

pombredanne commented Mar 12, 2025

matt-phylum commented Mar 12, 2025 •

edited

Loading

dwalluck commented Mar 12, 2025

matt-phylum commented Mar 12, 2025

dwalluck commented Mar 12, 2025

matt-phylum commented Mar 13, 2025

dwalluck commented Mar 13, 2025

matt-phylum commented Mar 13, 2025

johnmhoran commented Mar 16, 2025

jeremylong left a comment

	contains an array of objects. Each object represents a test with these key/value
	pairs some of which may not be normalized:

	- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed
	components should return the test canonical ``purl``

	- parsing the test ``purl`` should return the components parsed from the test
	canonical ``purl``

	- parsing the test ``purl`` then re-building a ``purl`` from these parsed components
	should return the test canonical ``purl``

	- building a ``purl`` from the test components should return the test canonical ``purl``

	To test ``purl`` parsing and building, a tool can use this test suite and for
	every listed test object, run these tests:

	- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed
	components should return the test canonical ``purl``

	- parsing the test ``purl`` should return the components parsed from the test
	canonical ``purl``

	- parsing the test ``purl`` then re-building a ``purl`` from these parsed components
	should return the test canonical ``purl``

	- building a ``purl`` from the test components should return the test canonical ``purl``

revisit test-suite #409

Are you sure you want to change the base?

revisit test-suite #409

Conversation

jkowalleck commented Mar 6, 2025 • edited Loading

jkowalleck commented Mar 6, 2025

dwalluck commented Mar 6, 2025

dwalluck commented Mar 6, 2025 • edited Loading

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

dwalluck commented Mar 7, 2025

dwalluck commented Mar 7, 2025 • edited Loading

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

matt-phylum commented Mar 7, 2025

dwalluck commented Mar 7, 2025

jkowalleck commented Mar 11, 2025

dwalluck commented Mar 11, 2025

matt-phylum commented Mar 11, 2025

pombredanne Mar 12, 2025

Choose a reason for hiding this comment

jkowalleck Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

matt-phylum Mar 12, 2025

Choose a reason for hiding this comment

pombredanne left a comment

Choose a reason for hiding this comment

pombredanne commented Mar 12, 2025

matt-phylum commented Mar 12, 2025 • edited Loading

dwalluck commented Mar 12, 2025

matt-phylum commented Mar 12, 2025

dwalluck commented Mar 12, 2025

matt-phylum commented Mar 13, 2025

dwalluck commented Mar 13, 2025

matt-phylum commented Mar 13, 2025

johnmhoran commented Mar 16, 2025

jeremylong left a comment

Choose a reason for hiding this comment

jkowalleck commented Mar 6, 2025 •

edited

Loading

dwalluck commented Mar 6, 2025 •

edited

Loading

dwalluck commented Mar 7, 2025 •

edited

Loading

jkowalleck Mar 12, 2025 •

edited

Loading

matt-phylum commented Mar 12, 2025 •

edited

Loading