Troubleshoot Splunk Distribution of OpenTelemetry Collector
(1) The K8s distribution is using splunk-otel-collector-chart. There is an official troubleshooting guide.
(2) One level up of splunk-otel-collector-chart is splunk-otel-collector that also has an official troubleshooting guide.
(3) Beyond the guides in Github, there is an official Splunk doc for troubleshooting.
There are three official troubleshooting guides afaik. So why am I writing another troubleshooting guide?
Motivation
Writing helps facilitate my thinking process so I could guide prospects / customers on troubleshooting their issues. Furthermore, I found that the official guides seemed to assume expert knowledge when following the guide. Therefore, my intention is to include basic steps in the troubleshooting. process
Kubernetes
Traces
One of the most common scenarios is an instrumented application not showing up in APM. What is the issue?
First, identify the Node name of the application Pod.
kubectl get pod/<application pod name> -o yaml | grep nodeName
Second, identify the OpenTelemetry Collector Agent Pod on the Node.
kubectl get pods --field-selector spec.nodeName=<node name>
Third, view the logs of the OpenTelemetry Collector Agent Pod. Look out for errors.
kubectl logs <splunk otel collector agent pod name>
Fourth, view the zpages to verify if traces are received and exported via exec or port-forwarding.
# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55679/debug/tracez | lynx -stdin
# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55679:55679
# after which go to http://localhost:55679/debug/tracez
Fifth, examine the OpenTelemetry Collector config. Look out for misconfiguration.
# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55554/debug/configz/effective | yq e# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55554:55554
# after which go to http://localhost:55554/debug/configz/effective
Sixth, send synthetic trace from a temporary Pod to OpenTelemetry Collector Daemonset Pod. Verify that it is accepted and shows up Splunk O11y APM.
kubectl run tmp --image=nginx:alpine
kubectl exec -it tmp -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
kubectl get pod/tmp -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it tmp -- curl -vi -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json
kubectl delete pod/tmp
or send empty json data using []
kubectl exec -it tmp -- curl -vi -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d []
or send to OpenTelemetry Collector
kubectl run tmp --image=nginx:alpine
kubectl exec -it tmp -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
kubectl exec -it tmp -c tmp -- sh
ls
curl -vi http://<the address e.g. http://traceid-load-balancing-gateway-splunk-otel-collector.splunk-monitoring.svc > :4318/v1/traces -X POST -H "Content-Type: application/json" -d @test2.json
kubectl delete pod/tmp
Finally, send synthetic trace from the application Pod to OpenTelemetry Collector Agent Pod.
kubectl exec -it <application pod name> -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
# Edit the service name in the json file to differ from above step tmp pod synthetic trace for differentiation.
kubectl get pod/<application name> -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it <application pod name> -- curl -v -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json
As well as send trace from application container to Splunk Observability APM backend directly.
curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
curl -X POST https://ingest.<...replace with your realm e.g. us1...>.signalfx.com/v2/trace/signalfxv1 -H'Content-Type: application/json' -H'X-SF-Token: <...replace with your access token from splunk O11y...>' -d @yelp.json
If all of the above steps work, we know that OpenTelemetry Collector Agent is working and there is no network policy issue between Pods. Next, the issue is likely related to
- the instrumentation of the application,
- the application is not triggering any operation that is auto/manual instrumented, or
- misconfiguration of the URL endpoint from instrumented application to OpenTelemetry Collector Agent Pod.
Further steps for 1, 2, or 3.
(1) To isolate if it is an issue with the instrumentation of the application, focus on sending spans from application to Splunk O11y directly.
(2) To identify if it is an issue with no transaction, the prospects/customers must trigger actions to the right operation that is auto. or manually instrumented.
(3) To identify if it is an issue with the validity of the URL endpoint from instrumentation application to OpenTelemetry Collector Agent Pod, there are two key steps.
- (3a) How is application Pod communicating with daemonset Pod? Refer to How does Splunk OpenTelemetry Collector (Agent) work as Kubernetes Daemonset. Most importantly, test that the application Pod can reach the daemonset Pod using the communication pattern of your choice.
- (3b) Verify that the environment variables are configured correctly and aligned accordingly in the Deployment file.
Logs
… To be continued…
WIP: Send dummy logs using Go code or simple script.
package main
import (
"context"
"time"
otelattribute "go.opentelemetry.io/otel/attribute"
otlphttpexporter "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/sdk/metric/export"
basicmetriccontroller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
basicmetricprocessor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
simplemetricselector "go.opentelemetry.io/otel/sdk/metric/selector/simple"
otelresource "go.opentelemetry.io/otel/sdk/resource"
)
func main() {
exporter, _ := otlphttpexporter.New(context.Background())
start(exporter)
}
func start(otlpExporter export.Exporter) {
processorFactory := basicmetricprocessor.NewFactory(
simplemetricselector.NewWithInexpensiveDistribution(),
otlpExporter,
)
controller := basicmetriccontroller.New(
processorFactory,
basicmetriccontroller.WithExporter(otlpExporter),
basicmetriccontroller.WithResource(otelresource.NewSchemaless(otelattribute.String("R", "V"))),
)
_ = controller.Start(context.Background())
meter := controller.Meter("my-meter")
syncInt64 := meter.SyncInt64()
counter, _ := syncInt64.Counter("my-counter")
for range time.Tick(time.Second) {
counter.Add(context.Background(), 1)
}
}
or simple sh script
create a text file that will be used as our large log messages (fluentd is configured to send a max 2MB, also note logger , the tool we'll be using to send messages, will maxed out to 200KB, so we can create a smaller file if needed)
base64 /dev/urandom | head -c 1900000 | tr -d '\n' > file.tx
WIP: Jek for more info to search FAQ for send dummy logs
Metrics
… To be continued…
Q: How to test if firewall is allowed to send to Splunk Observability Cloud?
A:
For K8s.
kubectl run tmp --image=nginx:alpine
kubectl exec -it tmp -- curl -v -X POST "https://ingest.us1.signalfx.com/v2/event" \
-H "Content-Type: application/json" \
-H "X-SF-Token: <replace this with the ingest access token>" \
-d '[
{
"category": "USER_DEFINED",
"eventType": "jek-test-event-1-2-3",
"dimensions": {
"environment": "production",
"service": "API"
},
"properties": {
"sha1": "1234567890abc"
},
"timestamp": 1556793030000
}
]'
kubectl delete pod/tmp
For Linux.
curl -X POST "https://ingest.us1.signalfx.com/v2/event" \
-H "Content-Type: application/json" \
-H "X-SF-Token: <replace this with the ingest access token>" \
-d '[
{
"category": "USER_DEFINED",
"eventType": "jek-test-event-4-5-6",
"dimensions": {
"environment": "production",
"service": "API"
},
"properties": {
"sha1": "1234567890abc"
},
"timestamp": 1556793030000
}
]'
Q: How do we view the conflicting ports?
A:
In K8s use
kubectl get pods --all-namespaces -o=jsonpath='{.items[*].spec.containers[].ports}' | jq 'map(select(.hostPort != null))'
For better formatting use
kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.containers[].name}{"\t"}{.spec.containers[].ports}{"\n"}{end}}' | jq -R 'select(. != "\t") | select(length > 1) | split("\t") | {namespace: .[0], podname: .[1], containername: .[2], ports: .[3]} | select(.ports != "")' | jq '.ports |= (fromjson | map(.containerPort |= tonumber))' | jq 'select(.ports | any(.hostPort != null)) | {namespace, podname, containername, ports}'
Q: What about other troubleshooting techniques?
A:
- Enable debug logs for the agent and see if it has more details on the error. Pass the
-Dotel.javaagent.debug=true
JVM arg for that. - Verify that the collector is listening on the 4317 port at all. Do a port forward to your local machine via eg
kubectl port-forward pods/<collector_pod_name> 4317:4317
and trytelnet localhost 4317
-- does it time out? What does it do if you type something and press enter? - Find out if the e.g.
jekspringwebapp
can connect to the collector at all -- includetelnet
in the docker image and dotelnet <ip_address> 4317
. - Try to debug locally — do the same port-forward as in (2) and run the instrumented application from your IDE. Does it work?
- If the error is
Failed to connect to localhost/127.0.0.1:4317
. But the collector is not on localhost, it's on a different pod on the same node. IsOTEL_EXPORTER_OTLP_ENDPOINT
not picked up, or perhaps overridden by another setting? - Find out if any proxies that might be between the agent and the collector.
- Check full agent-side configuration, including its version and other env vars/jvm arguments (such as
-Dotel.exporter.otlp.endpoint
orOTEL_EXPORTER_OTLP_PROTOCOL
) - Check collector-side configuration, including its version and the
receivers
section of its config. - Enter the e.g.
jekspringwebapp
container and send us the outputs ofps faux
,jps -lv
andenv
? Stripping any access tokens or other secrets, of course. Need to see the full list of JVM args and env vars, plus the list and hierarchy or processes.
Disclaimer
My name is Jek. At the time of writing, I am a Sales Engineer specialising in Splunk Observability Cloud. I wrote this to document my learning.
The postings on this site are my own and do not represent the position or opinions of Splunk Inc., or its affiliates.