Skip to content

revisit test-suite #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkowalleck
Copy link
Member

@jkowalleck jkowalleck commented Mar 6, 2025

fixes #407

aspects:

contains an array of objects. Each object represents a test with these key/value
pairs some of which may not be normalized:

- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed
components should return the test canonical ``purl``
- parsing the test ``purl`` should return the components parsed from the test
canonical ``purl``
- parsing the test ``purl`` then re-building a ``purl`` from these parsed components
should return the test canonical ``purl``
- building a ``purl`` from the test components should return the test canonical ``purl``

fixes package-url#407

Signed-off-by: Jan Kowalleck <[email protected]>
@jkowalleck
Copy link
Member Author

@dwalluck, @matt-phylum, et al.
please review

@dwalluck
Copy link

dwalluck commented Mar 6, 2025

@dwalluck, @matt-phylum, et al.
please review

Yes, thank you. I think that this clears things up, and I hope that you understand that I had a point with #404, but my assumption was the opposite of what was intended, apparently). Feel free to close #404 when this is merged.

Thanks.

@dwalluck
Copy link

dwalluck commented Mar 6, 2025

@matt-phylum I am going to try to test this with code first. I don't know if you want to as well.

It could be that my parsing algorithm is wrong, but it could be the tests. I got three failures.

Is version supposed to be null in these two or is it the part after the '@'?

 {
    "description": "a scheme is always required",
    "purl": "[email protected]",
    "canonical_purl": "[email protected]",
    "type": null,
    "namespace": null,
    "name": "EnterpriseLibrary.Common",
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  },
  {
    "description": "a name is required",
    "purl": "pkg:maven/@1.3.4",
    "canonical_purl": "pkg:maven/@1.3.4",
    "type": "maven",
    "namespace": null,
    "name": null,
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  },

Also, I am not sure how to parse this one to end up with components given. Maybe type is meant to be "pkg%3Amaven", not "maven" since this purl technically is missing the scheme.

  {
    "description": "invalid encoded colon : between scheme and type",
    "purl": "pkg%3Amaven/org.apache.commons/io",
    "canonical_purl": null,
    "type": "maven",
    "namespace": "org.apache.commons",
    "name": "io",
    "version": null,
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  }

@matt-phylum
Copy link
Contributor

Those are all invalid so your parsing algorithm is supposed to fail. In the second case the parser should decode the version before failing, but I don't think it affects the behavior of the test. If the PURL is invalid, you can't parse it, and if the components are invalid you can't format it, so there's nothing left to do.

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

In any case, the first two appear to be missing version, no? The parser will fail, but it parses from the right, so it should have the version component (assuming it doesn't check for scheme first, then name would be null, too.

@matt-phylum
Copy link
Contributor

Most parsers do not return an invalid PURL so there is no way to assert the version.

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

One more, yes, it is a test with "is_invalid": true, but the parsing should be consistent. The spec specifies the order of parsing. The entire part after '?' should be qualifiers, which have no valid keys. The only thing it should parse is the type.

 {
    "description": "check for invalid character in type",
    "purl": "pkg:n&g?inx/[email protected]",
    "canonical_purl": null,
    "type": "n&g?inx",
    "namespace": null,
    "name": "nginx",
    "version": "0.8.9",
    "qualifiers": null,
    "subpath": null,
    "is_invalid": true
  }

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

This pull request makes the valid purl components consistent with the original purl. Awesome!

However, I have now found inconsistencies in the invalid purl components.

It should be made consistent with the parsing algorithm, or leave everything null. The test author should not make up what they "think" should be the components without following the parsing algorithm laid out in the specification.

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

I can be more clear. The test does not test what it thinks it does. It actually tests an invalid qualifier key, not an invalid type. It could test an invalid type if it uses a character other than '#' or '?', I think. So, purl->fields is testing the qualifier keys, but fields->purl is testing the type.

@matt-phylum
Copy link
Contributor

That's not how the tests work. The type is an input to formatting only. It has nothing to do with parsing.

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

That's not how the tests work. The type is an input to formatting only. It has nothing to do with parsing.

Sorry, what does "input to formatting" mean?

If a purl is valid, then the fields refer to the purl.
If the purl is invalid, then the fields refer to what?

Even though that purl is invalid, the fields there don't refer to that purl URI. I would have thought it would look more like:

 {
    "description": "check for invalid character in qualifier key",
    "purl": "pkg:n&g?inx/[email protected]",
    "canonical_purl": null,
    "type": "n&g",
    "namespace": null,
    "name": null,
    "version": null,
    "qualifiers": {"inx/[email protected]": ""},
    "subpath": null,
    "is_invalid": true
  },

@matt-phylum
Copy link
Contributor

Sorry, what does "input to formatting" mean?

The values are input to the formatting algorithm described here:

How to build ``purl`` string from its components

If a purl is valid, then the fields refer to the purl.

They do not.

If the PURL is valid, then using those values as input to build a PURL string results in the canonical PURL. Parsing the canonical PURL does not necessarily produce the same values again.

@matt-phylum
Copy link
Contributor

There is also a problem with this case:

{
"description": "invalid encoded colon : between scheme and type",
"purl": "pkg%3Amaven/org.apache.commons/io",
"canonical_purl": null,
"type": "maven",
"namespace": "org.apache.commons",
"name": "io",
"version": null,
"qualifiers": null,
"subpath": null,
"is_invalid": true
},

The PURL is invalid, and is_invalid is true, but the type namespace name version qualifiers subpath are all valid. This results in every implementation failing. Should type be null?

@dwalluck
Copy link

dwalluck commented Mar 7, 2025

They do not.

I meant purl as in the non-canonical purl field in the JSON. This pull request changed them so that they all (valid ones) match, except for the few invalid cases that I mentioned.

This is for invalid purls, the spec still defines a consistent parse result, so it would be possible for the implementations to get the same results (or not) depending on if they fail-fast or not.

@jkowalleck
Copy link
Member Author

Hm, i see.

Let's iterate and improve one thing at a time. This one is basically about valid test data.

There is an open ticket to discuss how invalid test cases must be structured. As soon as that discussion can to a conclusion, according action can follow up.

matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 11, 2025
matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 11, 2025
@matt-phylum matt-phylum mentioned this pull request Mar 11, 2025
3 tasks
@dwalluck
Copy link

There is an open ticket to discuss how invalid test cases must be structured. As soon as that discussion can to a conclusion, according action can follow up.

Thanks. I am not seeing any issues with the valid test data except when something should be encoded or not. For example, ':' appears at times both unencoded and encoded (encoded in the version, but unencoded in the repository_url key).

As a rule, should the standalone test components always appear decoded, as in these valid examples?

  {
    "purl": "pkg:GOLANG/google.golang.org/genproto@abcdedf#/googleapis/%2E%2E/api/annotations/",
    "subpath": "/googleapis/../api/annotations/",
    "is_invalid": false
  }

or

 {
    "purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "canonical_purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
    "version": "sha256:244fd47e07d1004f0aed9c",
    "is_invalid": false
  }

@matt-phylum
Copy link
Contributor

The standalone components are not encoded. If they contain percent signs, those are supposed to be encoded as %25, not decoded like percent encoding.

"namespace": "google.golang.org",
"name": "genproto",
"version": null,
"qualifiers": null,
"subpath": "googleapis/api/annotations",
"subpath": "/googleapis/api/annotations/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subpath is speced as normalized by stripping slashes:
Leading and trailing slashes '/' are not significant and SHOULD be stripped in the canonical form .... we need to clarify what is the relationship between each test fields first.

Copy link
Member Author

@jkowalleck jkowalleck Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to clarify what is the relationship between each test fields first.

is this not clear from current spec?

To test ``purl`` parsing and building, a tool can use this test suite and for
every listed test object, run these tests:
- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed
components should return the test canonical ``purl``
- parsing the test ``purl`` should return the components parsed from the test
canonical ``purl``
- parsing the test ``purl`` then re-building a ``purl`` from these parsed components
should return the test canonical ``purl``
- building a ``purl`` from the test components should return the test canonical ``purl``

it is stated, that test fields are used only as an input for crafting a purl. and the result is expected to be the canonical purl.

building a purl from the test components should return the test canonical purl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field should have the unnormalized form for the tests to work properly.

contains an array of objects. Each object represents a test with these key/value
pairs some of which may not be normalized:

- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed
components should return the test canonical ``purl``
- parsing the test ``purl`` should return the components parsed from the test
canonical ``purl``
- parsing the test ``purl`` then re-building a ``purl`` from these parsed components
should return the test canonical ``purl``
- building a ``purl`` from the test components should return the test canonical ``purl``

If these fields contain the normalized form, then the tests do not ensure that normalization happens when building a PURL from components. An implementation that normalizes only during parsing would incorrectly pass the current version of the test because the PURL is built from the pre-normalized components and those components do not require further normalization, but that implementation would correctly fail with the proposed components.

Of course some implementations of the test suite have this backwards and parse the canonical PURL and assert that its fields match the components here, and those implementations will not work with the new version of the file, but those implementations of the test suite do not ensure that normalization happens when building a PURL from components because they have no way to obtain the unnormalized components of the purl field to use as inputs.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck Thanks. The tests should/will follow the updates of the core spec, so merging these should come afterwards to make sure we are not merging test updates that may end up being inconsistent with the streamlined spec! :]

Also we have case normalization rules already in the spec, and this PR seems to revert all of these tests? I am not sure why we would do this as this goes against the way things are specified and implemented.

Now the "is_invalid" field in tests is likely not enough as this is mixing things up and this should be clarified. This is not binary choice after all.

There are multiple cases to consider:

  1. strict validation failing if not a canonical PURL
  2. flexible parsing where the PURL may not be strictly valid, but can be parsed and normalized to recover its components, including case normalization.
  3. handling of individual components as encoded or decoded.
  4. Round-trips when we parse/unparse a PURL

And likely a few more. We must clarify this before updating tests.

I am leaving this open for now until we can revisit when the spec is clear.

@pombredanne
Copy link
Member

@matt-phylum re: #409 (comment)

You wrote:

The standalone components are not encoded.

Agreed, individual components should be decoded.

If they contain percent signs, those are supposed to be encoded as %25, not decoded like percent encoding.

I am not sure what this means. An example would be helpful.

Also, FWIW, we should be clear on double encoding and avoid this if at all possible.

matt-phylum added a commit to phylum-dev/purl that referenced this pull request Mar 12, 2025
* update test suite

* fix warning

* align test generation with spec

see package-url/purl-spec#409
@matt-phylum
Copy link
Contributor

matt-phylum commented Mar 12, 2025

I am not sure what this means. An example would be helpful.

Example:

{
  "description": "Maven Central is too permissive",
  "purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
  "canonical_purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
  "type": "maven",
  "namespace": "net.databinder",
  "name": "dispatch-http%2Bjson_2.7.3",
  "version": "0.6.0",
  "is_invalid": false
}

The name field contains %2B because, likely due to some mistake since the POM doesn't match, that's really the name of the package: https://repo.maven.apache.org/maven2/net/databinder/dispatch-http%252Bjson_2.7.3/0.6.0/ . The %2B is not to be decoded by the test implementation before using the name to construct a PURL. The name used when building the PURL should be exactly as specified in the JSON.

Another example:

{
  "description": "PURLs are ASCII",
  "purl": "pkg:nuget/史密斯图wpf控件@1.0.3",
  "canonical_purl": "pkg:nuget/%E5%8F%B2%E5%AF%86%E6%96%AF%E5%9B%BEwpf%E6%8E%A7%E4%BB%[email protected]",
  "type": "nuget",
  "name": "\u53f2\u5bc6\u65af\u56fewpf\u63a7\u4ef6",
  "version": "1.0.3",
  "is_invalid": false
}

(https://www.nuget.org/packages/史密斯图wpf控件/1.0.3)

In case there is any confusion, this is referring to the value in the JSON, and any JSON escape sequences do need to be processed when parsing the JSON file to obtain that value. Normally this happens automatically because you're using a proper JSON parser to read the JSON file.

These tests, or tests similar to them, should probably be included in the test suite because there is one PURL implementation that do not escape % and there are two PURL implementations that fail to parse Unicode inputs (probably. it could be a problem with my test harness for one of them?) and there is one implementation that generates invalid percent encoding for Unicode characters.

@dwalluck
Copy link

4. Round-trips when we parse/unparse a PURL

What started me on all this is that the Java code was comparing the fields to the canonical purl. This pull request changes that so that the fields to match the original purl now.

Ideally, I had wanted to be able to round-trip both the original purl and the canonical purl, but the current Java code doesn't support this, as it is normalizing on input as opposed to later. If the component name "FOO" is valid, but the normalized component name is "foo", I ideally want the API PackageURL::getName to return "Foo", and PackageURL::normalize::getName to return "foo".

If I have an original (non-normalized) purl, I am interested in checking that the parsing, validation, and normalization are all correct. To make a long comment short: there's only one set of fields right now, so only one of the two purls in each test can really be verified by round-trip, I think.

@matt-phylum
Copy link
Contributor

This is getting off topic, but PackageURL::getName returning an unnormalized name by default is an API footgun. The user normally wants the normalized name. Having the parse and format routines handle some canonicalization (percent decoding and encoding) but not other normalization is going to surprise somebody when they get a PURL that is not normalized. This is especially true if somebody writes something like new PackageURL(input).getQualifiers().get("checksums") and the PURL has "Checksums" so the qualifier isn't found. I'd recommend having the unnormalized form available only by explicit request.

AFAIK most PURL implementations do not expose the unnormalized form, and I wouldn't expect this test suite to have special tests for implementations that do expose it (many of these tests are about the normalization process anyway). However, since the purpose of these individual parts is to verify that building a PURL from unnormalized parts results in the canonical PURL the same as parsing the non-canonical PURL, I don't see a reason why most tests wouldn't have the unnormalized (but fully percent decoded) value as it appears in the non-canonical PURL, in which case you can (at least for the most part) use this same set of values to write tests for the unnormalized PURL logic.

To make a long comment short: there's only one set of fields right now, so only one of the two purls in each test can really be verified by round-trip, I think.

There are two sets of values, or three for implementations that don't immediately normalize. You obtain the other values by parsing the PURL strings.

@dwalluck
Copy link

as parsing the non-canonical PURL, I don't see a reason why most tests wouldn't have the unnormalized (but fully percent decoded) value as it appears in the non-canonical PURL, in which case you can (at least for the most part) use this same set of values to write tests for the unnormalized PURL logic.

I think as of this pull request, that is the case, that the individual components are the unnormalized form.

The issue I see with normalization is that a lot of it is type-specific. The tests currently contain a lot of type-specific tests where type-specific normalization is required.

This is "strange" in the sense that for an implementation to produce the normal form, it has to be aware of all types, and what about when new types are added that it doesn't know about?

@matt-phylum
Copy link
Contributor

This is "strange" in the sense that for an implementation to produce the normal form, it has to be aware of all types, and what about when new types are added that it doesn't know about?

That's #38.

@dwalluck
Copy link

That's #38.

Thanks, I also noticed that you recently posted #408. How does this differ?

@matt-phylum
Copy link
Contributor

#408 is more to do with percent encoding. The current spec is always talking about what a canonical PURL looks like, but if you format the components

{
  "type": "generic",
  "namespace": "!\"$%&'()*+,:;=?@[\\]^{|}~",
  "name": "!\"$%&'()*+,:;=?@[\\]^{|}~",
  "version":"!\"$%&'()*+,:;=?@[\\]^{|}~",
  "qualifiers": {
    "q": "!\"$%&'()*+,:;=?@[\\]^{|}~"
  },
  "subpath":"!\"$%&'()*+,:;=?@[\\]^{|}~"
}

you get back 10 different answers (and 1 unexpected error). To get exactly the right behavior requires a test that all characters encode correctly and passing that test will likely require PURL implementations to come with their own percent encoding routines. It's much easier to generate a PURL which can be understood by all implementations because it encodes at least the minimum set of characters than to generate exactly the canonical PURL.

@johnmhoran
Copy link
Member

@jkowalleck In light of @pombredanne 's comments, I'll defer my review as well until we've completed the core-spec updating.

Copy link
Contributor

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

revisit the test suite
6 participants