-
Notifications
You must be signed in to change notification settings - Fork 176
revisit test-suite #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
revisit test-suite #409
Conversation
fixes package-url#407 Signed-off-by: Jan Kowalleck <[email protected]>
@dwalluck, @matt-phylum, et al. |
Yes, thank you. I think that this clears things up, and I hope that you understand that I had a point with #404, but my assumption was the opposite of what was intended, apparently). Feel free to close #404 when this is merged. Thanks. |
@matt-phylum I am going to try to test this with code first. I don't know if you want to as well. It could be that my parsing algorithm is wrong, but it could be the tests. I got three failures. Is {
"description": "a scheme is always required",
"purl": "[email protected]",
"canonical_purl": "[email protected]",
"type": null,
"namespace": null,
"name": "EnterpriseLibrary.Common",
"version": null,
"qualifiers": null,
"subpath": null,
"is_invalid": true
},
{
"description": "a name is required",
"purl": "pkg:maven/@1.3.4",
"canonical_purl": "pkg:maven/@1.3.4",
"type": "maven",
"namespace": null,
"name": null,
"version": null,
"qualifiers": null,
"subpath": null,
"is_invalid": true
}, Also, I am not sure how to parse this one to end up with components given. Maybe {
"description": "invalid encoded colon : between scheme and type",
"purl": "pkg%3Amaven/org.apache.commons/io",
"canonical_purl": null,
"type": "maven",
"namespace": "org.apache.commons",
"name": "io",
"version": null,
"qualifiers": null,
"subpath": null,
"is_invalid": true
} |
Those are all invalid so your parsing algorithm is supposed to fail. In the second case the parser should decode the version before failing, but I don't think it affects the behavior of the test. If the PURL is invalid, you can't parse it, and if the components are invalid you can't format it, so there's nothing left to do. |
In any case, the first two appear to be missing |
Most parsers do not return an invalid PURL so there is no way to assert the version. |
One more, yes, it is a test with {
"description": "check for invalid character in type",
"purl": "pkg:n&g?inx/[email protected]",
"canonical_purl": null,
"type": "n&g?inx",
"namespace": null,
"name": "nginx",
"version": "0.8.9",
"qualifiers": null,
"subpath": null,
"is_invalid": true
} |
This pull request makes the valid purl components consistent with the original purl. Awesome! However, I have now found inconsistencies in the invalid purl components. It should be made consistent with the parsing algorithm, or leave everything |
I can be more clear. The test does not test what it thinks it does. It actually tests an invalid qualifier key, not an invalid type. It could test an invalid type if it uses a character other than |
That's not how the tests work. The type is an input to formatting only. It has nothing to do with parsing. |
Sorry, what does "input to formatting" mean? If a purl is valid, then the fields refer to the Even though that purl is invalid, the fields there don't refer to that {
"description": "check for invalid character in qualifier key",
"purl": "pkg:n&g?inx/[email protected]",
"canonical_purl": null,
"type": "n&g",
"namespace": null,
"name": null,
"version": null,
"qualifiers": {"inx/[email protected]": ""},
"subpath": null,
"is_invalid": true
}, |
The values are input to the formatting algorithm described here: purl-spec/PURL-SPECIFICATION.rst Line 251 in 7f7e82f
They do not. If the PURL is valid, then using those values as input to build a PURL string results in the canonical PURL. Parsing the canonical PURL does not necessarily produce the same values again. |
There is also a problem with this case: purl-spec/test-suite-data.json Lines 662 to 673 in 7f7e82f
The PURL is invalid, and |
I meant This is for invalid purls, the spec still defines a consistent parse result, so it would be possible for the implementations to get the same results (or not) depending on if they fail-fast or not. |
Hm, i see. Let's iterate and improve one thing at a time. This one is basically about valid test data. There is an open ticket to discuss how invalid test cases must be structured. As soon as that discussion can to a conclusion, according action can follow up. |
Thanks. I am not seeing any issues with the valid test data except when something should be encoded or not. For example, As a rule, should the standalone test components always appear decoded, as in these valid examples? {
"purl": "pkg:GOLANG/google.golang.org/genproto@abcdedf#/googleapis/%2E%2E/api/annotations/",
"subpath": "/googleapis/../api/annotations/",
"is_invalid": false
} or {
"purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
"canonical_purl": "pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io",
"version": "sha256:244fd47e07d1004f0aed9c",
"is_invalid": false
} |
The standalone components are not encoded. If they contain percent signs, those are supposed to be encoded as |
"namespace": "google.golang.org", | ||
"name": "genproto", | ||
"version": null, | ||
"qualifiers": null, | ||
"subpath": "googleapis/api/annotations", | ||
"subpath": "/googleapis/api/annotations/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subpath is speced as normalized by stripping slashes:
Leading and trailing slashes '/' are not significant and SHOULD be stripped in the canonical form
.... we need to clarify what is the relationship between each test fields first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to clarify what is the relationship between each test fields first.
is this not clear from current spec?
purl-spec/PURL-SPECIFICATION.rst
Lines 473 to 485 in 65d98c4
To test ``purl`` parsing and building, a tool can use this test suite and for | |
every listed test object, run these tests: | |
- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed | |
components should return the test canonical ``purl`` | |
- parsing the test ``purl`` should return the components parsed from the test | |
canonical ``purl`` | |
- parsing the test ``purl`` then re-building a ``purl`` from these parsed components | |
should return the test canonical ``purl`` | |
- building a ``purl`` from the test components should return the test canonical ``purl`` |
it is stated, that test fields are used only as an input for crafting a purl. and the result is expected to be the canonical purl.
building a
purl
from the test components should return the test canonicalpurl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field should have the unnormalized form for the tests to work properly.
purl-spec/PURL-SPECIFICATION.rst
Lines 455 to 456 in 7f7e82f
contains an array of objects. Each object represents a test with these key/value | |
pairs some of which may not be normalized: |
purl-spec/PURL-SPECIFICATION.rst
Lines 473 to 482 in 7f7e82f
- parsing the test canonical ``purl`` then re-building a ``purl`` from these parsed | |
components should return the test canonical ``purl`` | |
- parsing the test ``purl`` should return the components parsed from the test | |
canonical ``purl`` | |
- parsing the test ``purl`` then re-building a ``purl`` from these parsed components | |
should return the test canonical ``purl`` | |
- building a ``purl`` from the test components should return the test canonical ``purl`` |
If these fields contain the normalized form, then the tests do not ensure that normalization happens when building a PURL from components. An implementation that normalizes only during parsing would incorrectly pass the current version of the test because the PURL is built from the pre-normalized components and those components do not require further normalization, but that implementation would correctly fail with the proposed components.
Of course some implementations of the test suite have this backwards and parse the canonical PURL and assert that its fields match the components here, and those implementations will not work with the new version of the file, but those implementations of the test suite do not ensure that normalization happens when building a PURL from components because they have no way to obtain the unnormalized components of the purl
field to use as inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkowalleck Thanks. The tests should/will follow the updates of the core spec, so merging these should come afterwards to make sure we are not merging test updates that may end up being inconsistent with the streamlined spec! :]
Also we have case normalization rules already in the spec, and this PR seems to revert all of these tests? I am not sure why we would do this as this goes against the way things are specified and implemented.
Now the "is_invalid" field in tests is likely not enough as this is mixing things up and this should be clarified. This is not binary choice after all.
There are multiple cases to consider:
- strict validation failing if not a canonical PURL
- flexible parsing where the PURL may not be strictly valid, but can be parsed and normalized to recover its components, including case normalization.
- handling of individual components as encoded or decoded.
- Round-trips when we parse/unparse a PURL
And likely a few more. We must clarify this before updating tests.
I am leaving this open for now until we can revisit when the spec is clear.
@matt-phylum re: #409 (comment) You wrote:
Agreed, individual components should be decoded.
I am not sure what this means. An example would be helpful. Also, FWIW, we should be clear on double encoding and avoid this if at all possible. |
* update test suite * fix warning * align test generation with spec see package-url/purl-spec#409
Example: {
"description": "Maven Central is too permissive",
"purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
"canonical_purl": "pkg:maven/net.databinder/dispatch-http%[email protected]",
"type": "maven",
"namespace": "net.databinder",
"name": "dispatch-http%2Bjson_2.7.3",
"version": "0.6.0",
"is_invalid": false
} The name field contains Another example: {
"description": "PURLs are ASCII",
"purl": "pkg:nuget/史密斯图wpf控件@1.0.3",
"canonical_purl": "pkg:nuget/%E5%8F%B2%E5%AF%86%E6%96%AF%E5%9B%BEwpf%E6%8E%A7%E4%BB%[email protected]",
"type": "nuget",
"name": "\u53f2\u5bc6\u65af\u56fewpf\u63a7\u4ef6",
"version": "1.0.3",
"is_invalid": false
} (https://www.nuget.org/packages/史密斯图wpf控件/1.0.3) In case there is any confusion, this is referring to the value in the JSON, and any JSON escape sequences do need to be processed when parsing the JSON file to obtain that value. Normally this happens automatically because you're using a proper JSON parser to read the JSON file. These tests, or tests similar to them, should probably be included in the test suite because there is one PURL implementation that do not escape % and there are two PURL implementations that fail to parse Unicode inputs (probably. it could be a problem with my test harness for one of them?) and there is one implementation that generates invalid percent encoding for Unicode characters. |
What started me on all this is that the Java code was comparing the fields to the canonical purl. This pull request changes that so that the fields to match the original purl now. Ideally, I had wanted to be able to round-trip both the original purl and the canonical purl, but the current Java code doesn't support this, as it is normalizing on input as opposed to later. If the component name "FOO" is valid, but the normalized component name is "foo", I ideally want the API If I have an original (non-normalized) purl, I am interested in checking that the parsing, validation, and normalization are all correct. To make a long comment short: there's only one set of fields right now, so only one of the two purls in each test can really be verified by round-trip, I think. |
This is getting off topic, but AFAIK most PURL implementations do not expose the unnormalized form, and I wouldn't expect this test suite to have special tests for implementations that do expose it (many of these tests are about the normalization process anyway). However, since the purpose of these individual parts is to verify that building a PURL from unnormalized parts results in the canonical PURL the same as parsing the non-canonical PURL, I don't see a reason why most tests wouldn't have the unnormalized (but fully percent decoded) value as it appears in the non-canonical PURL, in which case you can (at least for the most part) use this same set of values to write tests for the unnormalized PURL logic.
There are two sets of values, or three for implementations that don't immediately normalize. You obtain the other values by parsing the PURL strings. |
I think as of this pull request, that is the case, that the individual components are the unnormalized form. The issue I see with normalization is that a lot of it is type-specific. The tests currently contain a lot of type-specific tests where type-specific normalization is required. This is "strange" in the sense that for an implementation to produce the normal form, it has to be aware of all types, and what about when new types are added that it doesn't know about? |
That's #38. |
#408 is more to do with percent encoding. The current spec is always talking about what a canonical PURL looks like, but if you format the components {
"type": "generic",
"namespace": "!\"$%&'()*+,:;=?@[\\]^{|}~",
"name": "!\"$%&'()*+,:;=?@[\\]^{|}~",
"version":"!\"$%&'()*+,:;=?@[\\]^{|}~",
"qualifiers": {
"q": "!\"$%&'()*+,:;=?@[\\]^{|}~"
},
"subpath":"!\"$%&'()*+,:;=?@[\\]^{|}~"
} you get back 10 different answers (and 1 unexpected error). To get exactly the right behavior requires a test that all characters encode correctly and passing that test will likely require PURL implementations to come with their own percent encoding routines. It's much easier to generate a PURL which can be understood by all implementations because it encodes at least the minimum set of characters than to generate exactly the canonical PURL. |
@jkowalleck In light of @pombredanne 's comments, I'll defer my review as well until we've completed the core-spec updating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fixes #407
aspects:
purl-spec/PURL-SPECIFICATION.rst
Lines 455 to 456 in c30f2f6
purl-spec/PURL-SPECIFICATION.rst
Lines 473 to 482 in c30f2f6