Troubleshoot Splunk Distribution of OpenTelemetry Collector

6 min readDec 31, 2021

(1) The K8s distribution is using splunk-otel-collector-chart. There is an official troubleshooting guide.

(2) One level up of splunk-otel-collector-chart is splunk-otel-collector that also has an official troubleshooting guide.

(3) Beyond the guides in Github, there is an official Splunk doc for troubleshooting.

There are three official troubleshooting guides afaik. So why am I writing another troubleshooting guide?

Motivation

Writing helps facilitate my thinking process so I could guide prospects / customers on troubleshooting their issues. Furthermore, I found that the official guides seemed to assume expert knowledge when following the guide. Therefore, my intention is to include basic steps in the troubleshooting. process

Kubernetes

Traces

One of the most common scenarios is an instrumented application not showing up in APM. What is the issue?

First, identify the Node name of the application Pod.

kubectl get pod/<application pod name> -o yaml | grep nodeName

Second, identify the OpenTelemetry Collector Agent Pod on the Node.

kubectl get pods --field-selector spec.nodeName=<node name>

Third, view the logs of the OpenTelemetry Collector Agent Pod. Look out for errors.

kubectl logs <splunk otel collector agent pod name>

Fourth, view the zpages to verify if traces are received and exported via exec or port-forwarding.

# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55679/debug/tracez | lynx -stdin
# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55679:55679
# after which go to http://localhost:55679/debug/tracez

Fifth, examine the OpenTelemetry Collector config. Look out for misconfiguration.

# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55554/debug/configz/effective | yq e# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55554:55554
# after which go to http://localhost:55554/debug/configz/effective

Sixth, send synthetic trace from a temporary Pod to OpenTelemetry Collector Daemonset Pod. Verify that it is accepted and shows up Splunk O11y APM.

kubectl run tmp --image=nginx:alpine
kubectl exec -it tmp -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
kubectl get pod/tmp -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it tmp -- curl -vi -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json
kubectl delete pod/tmp

or send empty json data using []

kubectl exec -it tmp -- curl -vi -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d []

or send to OpenTelemetry Collector

kubectl run tmp --image=nginx:alpine

kubectl exec -it tmp -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json

kubectl exec -it tmp -c tmp -- sh

ls

curl -vi http://<the address e.g. http://traceid-load-balancing-gateway-splunk-otel-collector.splunk-monitoring.svc > :4318/v1/traces -X POST -H "Content-Type: application/json" -d @test2.json

kubectl delete pod/tmp

Finally, send synthetic trace from the application Pod to OpenTelemetry Collector Agent Pod.

kubectl exec -it <application pod name> -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
# Edit the service name in the json file to differ from above step tmp pod synthetic trace for differentiation.
kubectl get pod/<application name> -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it <application pod name> -- curl -v -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json

As well as send trace from application container to Splunk Observability APM backend directly.

curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json


curl -X POST https://ingest.<...replace with your realm e.g. us1...>.signalfx.com/v2/trace/signalfxv1 -H'Content-Type: application/json' -H'X-SF-Token: <...replace with your access token from splunk O11y...>' -d @yelp.json

If all of the above steps work, we know that OpenTelemetry Collector Agent is working and there is no network policy issue between Pods. Next, the issue is likely related to

the instrumentation of the application,
the application is not triggering any operation that is auto/manual instrumented, or
misconfiguration of the URL endpoint from instrumented application to OpenTelemetry Collector Agent Pod.

Further steps for 1, 2, or 3.

(1) To isolate if it is an issue with the instrumentation of the application, focus on sending spans from application to Splunk O11y directly.

(2) To identify if it is an issue with no transaction, the prospects/customers must trigger actions to the right operation that is auto. or manually instrumented.

(3) To identify if it is an issue with the validity of the URL endpoint from instrumentation application to OpenTelemetry Collector Agent Pod, there are two key steps.

(3a) How is application Pod communicating with daemonset Pod? Refer to How does Splunk OpenTelemetry Collector (Agent) work as Kubernetes Daemonset. Most importantly, test that the application Pod can reach the daemonset Pod using the communication pattern of your choice.
(3b) Verify that the environment variables are configured correctly and aligned accordingly in the Deployment file.

Logs

… To be continued…

WIP: Send dummy logs using Go code or simple script.

package main

import (
	"context"
	"time"

	otelattribute "go.opentelemetry.io/otel/attribute"
	otlphttpexporter "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/sdk/metric/export"

	basicmetriccontroller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	basicmetricprocessor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	simplemetricselector "go.opentelemetry.io/otel/sdk/metric/selector/simple"
	otelresource "go.opentelemetry.io/otel/sdk/resource"
)

func main() {
	exporter, _ := otlphttpexporter.New(context.Background())
	start(exporter)
}

func start(otlpExporter export.Exporter) {
	processorFactory := basicmetricprocessor.NewFactory(
		simplemetricselector.NewWithInexpensiveDistribution(),
		otlpExporter,
	)
	controller := basicmetriccontroller.New(
		processorFactory,
		basicmetriccontroller.WithExporter(otlpExporter),
		basicmetriccontroller.WithResource(otelresource.NewSchemaless(otelattribute.String("R", "V"))),
	)
	_ = controller.Start(context.Background())
	meter := controller.Meter("my-meter")
	syncInt64 := meter.SyncInt64()
	counter, _ := syncInt64.Counter("my-counter")
	for range time.Tick(time.Second) {
		counter.Add(context.Background(), 1)
	}
}

or simple sh script

create a text file that will be used as our large log messages (fluentd is configured to send a max 2MB, also note logger , the tool we'll be using to send messages, will maxed out to 200KB, so we can create a smaller file if needed)

base64 /dev/urandom | head -c 1900000 | tr -d '\n' > file.tx

WIP: Jek for more info to search FAQ for send dummy logs

Metrics

… To be continued…

Q: How to test if firewall is allowed to send to Splunk Observability Cloud?

A:

For K8s.

kubectl run tmp --image=nginx:alpine

kubectl exec -it tmp -- curl -v -X POST "https://ingest.us1.signalfx.com/v2/event" \
    -H "Content-Type: application/json" \
    -H "X-SF-Token: <replace this with the ingest access token>" \
    -d '[
            {
                "category": "USER_DEFINED",
                "eventType": "jek-test-event-1-2-3",
                "dimensions": {
                    "environment": "production",
                    "service": "API"
                },
                "properties": {
                    "sha1": "1234567890abc"
                },
                "timestamp": 1556793030000
            }
        ]'

kubectl delete pod/tmp

For Linux.

curl -X POST "https://ingest.us1.signalfx.com/v2/event" \
    -H "Content-Type: application/json" \
    -H "X-SF-Token: <replace this with the ingest access token>" \
    -d '[
            {
                "category": "USER_DEFINED",
                "eventType": "jek-test-event-4-5-6",
                "dimensions": {
                    "environment": "production",
                    "service": "API"
                },
                "properties": {
                    "sha1": "1234567890abc"
                },
                "timestamp": 1556793030000
            }
        ]'

Q: How do we view the conflicting ports?

A:

In K8s use

kubectl get pods --all-namespaces -o=jsonpath='{.items[*].spec.containers[].ports}' | jq 'map(select(.hostPort != null))'

For better formatting use

kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.containers[].name}{"\t"}{.spec.containers[].ports}{"\n"}{end}}' | jq -R 'select(. != "\t") | select(length > 1) | split("\t") | {namespace: .[0], podname: .[1], containername: .[2], ports: .[3]} | select(.ports != "")' |  jq '.ports |= (fromjson | map(.containerPort |= tonumber))' | jq 'select(.ports | any(.hostPort != null)) | {namespace, podname, containername, ports}'

Q: What about other troubleshooting techniques?

A:

Enable debug logs for the agent and see if it has more details on the error. Pass the -Dotel.javaagent.debug=true JVM arg for that.
Verify that the collector is listening on the 4317 port at all. Do a port forward to your local machine via eg kubectl port-forward pods/<collector_pod_name> 4317:4317 and try telnet localhost 4317 -- does it time out? What does it do if you type something and press enter?
Find out if the e.g. jekspringwebapp can connect to the collector at all -- include telnet in the docker image and do telnet <ip_address> 4317.
Try to debug locally — do the same port-forward as in (2) and run the instrumented application from your IDE. Does it work?
If the error is Failed to connect to localhost/127.0.0.1:4317. But the collector is not on localhost, it's on a different pod on the same node. Is OTEL_EXPORTER_OTLP_ENDPOINT not picked up, or perhaps overridden by another setting?
Find out if any proxies that might be between the agent and the collector.
Check full agent-side configuration, including its version and other env vars/jvm arguments (such as -Dotel.exporter.otlp.endpoint or OTEL_EXPORTER_OTLP_PROTOCOL)
Check collector-side configuration, including its version and the receivers section of its config.
Enter the e.g. jekspringwebapp container and send us the outputs of ps faux, jps -lv and env? Stripping any access tokens or other secrets, of course. Need to see the full list of JVM args and env vars, plus the list and hierarchy or processes.

Disclaimer

My name is Jek. At the time of writing, I am a Sales Engineer specialising in Splunk Observability Cloud. I wrote this to document my learning.

The postings on this site are my own and do not represent the position or opinions of Splunk Inc., or its affiliates.