OpenTelemetry is now the standard for .NET, and the failure mode is no longer choosing the wrong library. It is instrumenting everything and dashboarding nothing. A team ships forty panels, ten exporters and zero alerts that anyone trusts, and the on-call still finds out from a customer. Observability is operational work, not visual work, and the bar is whether the next incident gets shorter.
Three signals carry almost all the value. The p95 latency of the request pipeline tells you when the user is suffering. The error rate broken down by route and status class tells you what is breaking. The saturation of the critical resource, almost always the database connection pool or the queue depth, tells you why. Everything else is interesting, not actionable. Start there, get the three right, and add the fourth only when an incident proves you needed it.
Correlation is the single feature that turns logs and traces into a debugger. Inject a request ID at the edge, propagate it through HttpClient and the messaging library with the standard W3C trace context, and write it into every log line through the ASP.NET Core logging scope. When a customer sends a screenshot with a request ID, you should land on the exact trace, the exact SQL, and the exact downstream call in under thirty seconds. If you cannot, the instrumentation has a hole.
Sampling is where cost meets signal. Tail-based sampling at ten percent for normal HTTP traffic and one hundred percent for anything that errors or crosses a latency threshold is the boring rule that survives. Head-based at a fixed rate looks cheaper and is a trap, because it drops the slow traces you actually need. The OpenTelemetry Collector handles this cleanly with the tail sampling processor, and keeping the Collector as a sidecar or DaemonSet keeps exporter changes out of the application.
The exporter is OTLP, period. From there it lands in Azure Monitor through the Azure Monitor Exporter or in Grafana Cloud Tempo and Loki through the Collector. Both work. The choice is a function of where the rest of the platform already lives and who pays the bill. What does not work is exporting to two backends in parallel forever, because the eventual divergence in retention and schema costs more than the migration you were avoiding.
Alerts are the discipline that ties this together. One dashboard per bounded context, not per service. Three or four alerts per dashboard, each with a clear owner, a runbook link and a tested page. If an alert has fired more than twice without action, it is noise and it goes. The goal is a small number of alerts that always mean something, not a wall of yellow that everyone learns to ignore.
Tags
- #dotnet
- #observability
- #azure