Skip to content

otel-collector pause breaks the grpc OTLPSpanExporter permanently. #4529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cospectrum opened this issue Apr 4, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@cospectrum
Copy link

cospectrum commented Apr 4, 2025

Describe your environment

OS: Darwin 22.6.0
Python version: 3.13.2
SDK version: v1.31.1
API version: v1.31.1
opentelemetry-exporter-otlp v1.31.1

What happened?

After otel-collector restart, the app failed to send traces and did not recover from the poisoned state. All subsequent grpc requests are broken.

Steps to Reproduce

  1. Start the otel-collector (with grpc receiver):
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.123.0
    ports:
      - 1888:1888 # pprof extension
      - 8888:8888 # Prometheus metrics exposed by the Collector
      - 8889:8889 # Prometheus exporter metrics
      - 13133:13133 # health_check extension
      - 4317:4317 # OTLP gRPC receiver
      - 4318:4318 # OTLP http receiver
      - 55679:55679 # zpages extension
  1. Start your instrumented app to send traces via grpc. For example:
import os
import time

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.id_generator import RandomIdGenerator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator


def main() -> None:
    otel_collector_grpc_endpoint = "localhost:4317"
    ping_interval = 3

    trace_provider = TracerProvider()
    trace_provider.add_span_processor(
        BatchSpanProcessor(
            OTLPSpanExporter(
                endpoint=os.path.join(otel_collector_grpc_endpoint, "v1/traces"),
                insecure=True,
            )
        )
    )
    trace.set_tracer_provider(trace_provider)

    while True:
        traceparent = random_traceparent()
        ping(traceparent)
        time.sleep(ping_interval)


def ping(traceparent: str) -> None:
    tracer = trace.get_tracer(__name__)

    carrier = {"traceparent": traceparent}
    print(carrier)
    ctx = TraceContextTextMapPropagator().extract(carrier)

    with tracer.start_as_current_span("ping", ctx):
        pass


def random_traceparent() -> str:
    to_hex = lambda s: hex(s)[2:]
    gen = RandomIdGenerator()
    trace_id = to_hex(gen.generate_trace_id()).zfill(32)
    span_id = to_hex(gen.generate_span_id()).zfill(16)
    return f"00-{trace_id}-{span_id}-01"


if __name__ == "__main__":
    main()
  1. Pause the otel-collector for more than ping_interval.
  2. Resume the otel-collector.
  3. Check that app fails to send new traces.

Expected Result

The instrumented app should send traces successfully after otel-collector restart.

Actual Result

The app fails to send new traces after otel-collector resumed.

Additional context

After otel-collector restart, the python app gives logs like that (for all batches):

Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317/v1/traces, retrying in 8s.

The only way to fix the app tracing is to restart the app itself, which is painful.
Also http mode OTLPSpanExporter seems to be working correctly unlike grpc.

Would you like to implement a fix?

None

@cospectrum cospectrum added the bug Something isn't working label Apr 4, 2025
@cospectrum cospectrum changed the title otel-collector restart breaks the grpc OTLPSpanExporter permanently. otel-collector pause breaks the grpc OTLPSpanExporter permanently. Apr 4, 2025
@xrmx
Copy link
Contributor

xrmx commented Apr 7, 2025

Dup of #4517 ?

@cospectrum
Copy link
Author

Looks like it, but I encountered a problem with grpc traces instead of metrics. Maybe there is a problem with _logs too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants