Troubleshoot Splunk Distribution of OpenTelemetry Collector
(1) The K8s distribution is using splunk-otel-collector-chart. There is an official troubleshooting guide.
(2) One level up of splunk-otel-collector-chart is splunk-otel-collector that also has an official troubleshooting guide.
(3) Beyond the guides in Github, there is an official Splunk doc for troubleshooting.
There are three official troubleshooting guides afaik. So why am I writing another troubleshooting guide?
Motivation
Writing helps facilitate my thinking process so I could guide prospects / customers on troubleshooting their issues. Furthermore, I found that the official guides seemed to assume expert knowledge when following the guide. Therefore, my intention is to include basic steps in the troubleshooting. process
Kubernetes
Traces
One of the most common scenarios is an instrumented application not showing up in APM. What is the issue?
First, identify the Node name of the application Pod.
kubectl get pod/<application pod name> -o yaml | grep nodeName
Second, identify the OpenTelemetry Collector Agent Pod on the Node.
kubectl get pods --field-selector spec.nodeName=<node name>
Third, view the logs of the OpenTelemetry Collector Agent Pod. Look out for errors.
kubectl logs <splunk otel collector agent pod name>
Fourth, view the zpages to verify if traces are received and exported via exec or port-forwarding.
# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55679/debug/tracez | lynx -stdin
# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55679:55679
# after which go to http://localhost:55679/debug/tracez
Fifth, examine the OpenTelemetry Collector config. Look out for misconfiguration.
# access the container in the pod
kubectl exec -it <splunk otel collector agent pod name> -- curl localhost:55554/debug/configz/effective | yq e# or use port forwarding to view on desktop
kubectl port-forward pod/<splunk otel collector agent pod name> 55554:55554
# after which go to http://localhost:55554/debug/configz/effective
Sixth, send synthetic trace from a temporary Pod to OpenTelemetry Collector Agent Pod. Verify that it is accepted and shows up Splunk O11y APM.
kubectl run tmp --image=nginx:alpine
kubectl exec -it tmp -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
kubectl get pod/tmp -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it tmp -- curl -v -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json
kubectl delete pod/tmp
Finally, send synthetic trace from the application Pod to OpenTelemetry Collector Agent Pod.
kubectl exec -it <application pod name> -- curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json
# Edit the service name in the json file to differ from above step tmp pod synthetic trace for differentiation.
kubectl get pod/<application name> -o yaml | grep nodeName
kubectl get node <nodeName> -o wide
kubectl exec -it <application pod name> -- curl -v -X POST http://<node internal ip>:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json
If all of the above steps work, we know that OpenTelemetry Collector Agent is working and there is no network policy issue between Pods. Next, the issue is likely related to
- the instrumentation of the application,
- the application is not triggering any operation that is auto/manual instrumented, or
- misconfiguration of the URL endpoint from instrumented application to OpenTelemetry Collector Agent Pod.
Further steps for 1, 2, or 3.
(1) To isolate if it is an issue with the instrumentation of the application, focus on sending spans from application to Splunk O11y directly.
(2) To identify if it is an issue with no transaction, the prospects/customers must trigger actions to the right operation that is auto. or manually instrumented.
(3) To identify if it is an issue with the validity of the URL endpoint from instrumentation application to OpenTelemetry Collector Agent Pod, there are two key steps.
- (3a) How is application Pod communicating with daemonset Pod? Refer to How does Splunk OpenTelemetry Collector (Agent) work as Kubernetes Daemonset. Most importantly, test that the application Pod can reach the daemonset Pod using the communication pattern of your choice.
- (3b) Verify that the environment variables are configured correctly and aligned accordingly in the Deployment file.
Logs
… To be continued…
WIP: Send dummy logs using Go code or simple script.
package main
import (
"context"
"time"
otelattribute "go.opentelemetry.io/otel/attribute"
otlphttpexporter "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/sdk/metric/export"
basicmetriccontroller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
basicmetricprocessor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
simplemetricselector "go.opentelemetry.io/otel/sdk/metric/selector/simple"
otelresource "go.opentelemetry.io/otel/sdk/resource"
)
func main() {
exporter, _ := otlphttpexporter.New(context.Background())
start(exporter)
}
func start(otlpExporter export.Exporter) {
processorFactory := basicmetricprocessor.NewFactory(
simplemetricselector.NewWithInexpensiveDistribution(),
otlpExporter,
)
controller := basicmetriccontroller.New(
processorFactory,
basicmetriccontroller.WithExporter(otlpExporter),
basicmetriccontroller.WithResource(otelresource.NewSchemaless(otelattribute.String("R", "V"))),
)
_ = controller.Start(context.Background())
meter := controller.Meter("my-meter")
syncInt64 := meter.SyncInt64()
counter, _ := syncInt64.Counter("my-counter")
for range time.Tick(time.Second) {
counter.Add(context.Background(), 1)
}
}
or simple sh script
create a text file that will be used as our large log messages (fluentd is configured to send a max 2MB, also note logger , the tool we'll be using to send messages, will maxed out to 200KB, so we can create a smaller file if needed)
base64 /dev/urandom | head -c 1900000 | tr -d '\n' > file.tx
WIP: Jek for more info to search FAQ for send dummy logs
Metrics
… To be continued…
Disclaimer
My name is Jek. At the time of writing, I am a Sales Engineer specialising in Splunk Observability Cloud. I wrote this to document my learning.
The postings on this site are my own and do not represent the position or opinions of Splunk Inc., or its affiliates.