Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/elasticsearch] Handle edge cases in metrics grouping causing TSDB version_conflict_engine_exception #39309

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

carsonip
Copy link
Contributor

@carsonip carsonip commented Apr 10, 2025

Description

Fix 2 edge cases where metrics grouping were not done correctly.

  1. In ecs mode, metric grouping now considers data point attributes overwriting resource attributes.
  2. In ecs and otel mode, attributes are now sorted before hashing. Metrics with attributes but in different order are now grouped together.

Implementation:

  • sort attributes
  • split up resource, scope and dp hash computation for otel mode to avoid sorting and rehashing

Link to tracking issue

Fixes #38083

Testing

Added unit tests

@carsonip carsonip changed the title [exporter/elasticsearch] Handle edge cases in metrics grouping to avoid TSDB version_conflict_engine_exception [exporter/elasticsearch] Handle edge cases in metrics grouping causing TSDB version_conflict_engine_exception Apr 11, 2025
// e.g. index is already considered during routing and DS attributes do not need to be considered in hashing
func mapHashExcludeReservedAttrs(hasher hash.Hash, m pcommon.Map, extra ...string) {
func mapHashSortedExcludeReservedAttrs(hasher hash.Hash, m pcommon.Map, extra ...string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider using common libraries like:

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/pdatautil/hash.go
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/internal/exp/metrics/identity

for the hash calculations? I think almost all the functions here could be replaced by existing hash options. For the exclusion of DataStream* attributes we could create a custom hash option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not able to use identity directly, since some fields are used but are not TSDB dimensions, e.g. schema url.

pdatautil hash, on the other hand, looks good and will get rid of a lot of the hashing code we have. Still need to look into how to achieve exclusions. I think we'll be better off creating a map and passing it in, instead of writing an Option which means reimplementing all the funcs in pdatautil hash because they are private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll be better off creating a map and passing it in

which will involve a fair bit of copying

Copy link
Member

@lahsivjar lahsivjar Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not able to use identity directly, since some fields are used but are not TSDB dimensions, e.g. schema url.

I thought so, the identity is a bit rigid.

Still need to look into how to achieve exclusions

We can implement a new HashOption (in elasticsearch-exporter for now) - something like WithMapExclusions(m pcommon.Map, excludedKeys ...string) HashOption

Copy link
Member

@lahsivjar lahsivjar Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of writing an Option which means reimplementing all the funcs in pdatautil hash because they are private

Ah, right!!! everything is private. Maybe we can add a TODO for exporter and create a PR in pdatautil to have this hash option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO, shouldn't be a blocker for this PR.

Comment on lines 49 to 57
return 0
}

func (h ecsDataPointHasher) hashScope(_ pcommon.InstrumentationScope) uint32 {
return 0
}

func (h ecsDataPointHasher) hashDataPoint(_ datapoints.DataPoint) uint32 {
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This feels a bit misleading for the API. As an API user, calling hashScope, I would expect to get the hash of the instrumentation scope - using the hash or not should be controlling which hash to keep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my shot at the following requirements

  • ecs hashes differently from otel as data point attributes and resource attributes are propagated to the root level
  • hashing of resource should be done per-resource, scope done per-scope, to avoid rehashing and sorting resource attributes and scope attributes per-data point.
  • (not a strict requirement) avoid leaking ecs vs otel difference to exporter level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the key to the discussion here is where to put the line between exporter logic and metrics grouping logic. Welcome any suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root of the issue is that the mapping mode is leaking in the hashing logic. What do you think about making the dataPointGroupKey aware of the mapping mode and calculate the key based on that. So, for example: for ecs mode, it will populate the combined hash key only, whereas for otel mode it will populate resource and scope keys? This will also mean that we will only need DataPointHasher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, check https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/39309/files#r2039284735 - maybe we can remove the *DataPointHasher completely but this is up to further discussion.

Copy link
Contributor Author

@carsonip carsonip Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making the dataPointGroupKey aware of the mapping mode and calculate the key based on that. So, for example: for ecs mode, it will populate the combined hash key only, whereas for otel mode it will populate resource and scope keys? This will also mean that we will only need DataPointHasher.

Let's say there exist a function call getGroupKey(mode, resource, scope, dp) dataPointGroupKey, doesn't it mean that it needs to be called per data point? Won't we lose the optimization to avoid sorting and rehashing resource and scope attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we lose the optimization to avoid sorting and rehashing resource and scope attributes?

We can just have an interface to update the dataPointGroupKey, something like UpdateWithScope, UpdateWithDatapoint...

Copy link
Contributor Author

@carsonip carsonip Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great advice. I've moved hash key to the grouping file to avoid leaking the implementation to exporter. cce4471

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@carsonip carsonip force-pushed the sort-attributes-hash branch from 4b1a72c to 6c9cbc4 Compare April 11, 2025 14:27
@carsonip carsonip requested a review from lahsivjar April 11, 2025 16:12
Copy link
Member

@lahsivjar lahsivjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just some minor comments.

Comment on lines +21 to +22
// TODO(carsonip): use opentelemetry-collector-contrib/pkg/pdatautil/hash.go when it can optionally exclude attributes
// We could have used it now but it'll involve creating a new Map and copying things over.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an issue/pr to track this?

Comment on lines +18 to +22
type HashKey struct {
resourceHash uint32
scopeHash uint32
dpHash uint32
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was to not have {ecs, otel}DataPointHasher at all and build the hash update logic in the HashKey but I think we can discuss that in the follow-up PR when the pdatautil has the method for hashing map with exclude keys.

Comment on lines +58 to +77
func (h *ECSDataPointHasher) HashKey() HashKey {
merged := pcommon.NewMap()
merged.EnsureCapacity(h.resource.Attributes().Len() + h.dp.Attributes().Len())
h.resource.Attributes().CopyTo(merged)
// scope attributes are ignored in ECS mode
for k, v := range h.dp.Attributes().All() {
v.CopyTo(merged.PutEmpty(k))
}

hasher := fnv.New32a()

timestampBuf := make([]byte, 8)
binary.LittleEndian.PutUint64(timestampBuf, uint64(h.dp.Timestamp()))
hasher.Write(timestampBuf)

mapHashSortedExcludeReservedAttrs(hasher, merged)

return HashKey{
dpHash: hasher.Sum32(),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we unify the behaviour of HashKey for both the datapoint hashers? Currently, the ECS one is doing hash calculation in the HashKey method whereas the OTel one does it on Update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[exporter/elasticsearch] Sort and deduplicate attributes for metrics grouping
3 participants