SVTA Distributed Tracing: Working Towards Improved Observability and QoS/QoE

  • Home
  • 2022
  • December
  • 27
  • SVTA Distributed Tracing: Working Towards Improved Observability and QoS/QoE

As the streaming tech stack has grown in complexity, ensuring a high-quality viewing experience has become increasingly challenging. Part of that is the nature of a distributed architecture frequently assembled from third party services and technologies as well as just more moving parts. But part of it is also a lack of well-established standards which meet the specific needs of streaming content providers. The combination of these two issues results in operational “blind spots” when tracing issues through the workflow. Because of the critical importance of data within streaming operations, any blind spots can significantly increase the Mean Time to Diagnose (MTTD) and, more importantly, the Mean Time to Resolve (MTTR). When these operational streaming metrics go up, QoE and viewer satisfaction goes down often resulting in loss of revenue and/or increased churn. There is a need then, for a standardized approach to trace data across the different vendors in the streaming video technology stack with an obvious starting point: CDNs.

One Presentation Started A Revolution

In May 2020 at the SVTA Q2 member meeting Diane Strutner, CEO of Datazoom, made a bold statement: “the streaming industry needs distributed tracing to address huge observability gaps and corresponding impacts on viewer quality of experience.” The challenge, she posited, was that streaming video content providers have all of the responsibility for providing a great experience without the observability or control to deliver it because of the necessity to use third-party providers like commercial CDNs. Tracing, if embraced broadly by the CDNs, would significantly close those gaps.

The argument was compelling and garnered significant support within the SVTA QoE/Measurement Working Group. Thus the effort to prove out the value of distributed request tracing for streaming media was born. The group took on the project with strong participation from a number of key SVTA members including Amazon Web Services, CBC/Radio-Canada, Datazoom, Fastly, picoNETS, THEO Technologies, and Touchstream. The first phase of the project focused on VOD delivery with a simple two-tier CDN and single origin is near completion. The plan is to publish a detailed white paper, explained at the end of this blog post, which will share the approach and results with the broader streaming video industry.

Demonstrating The Possibility of A Holistic Approach To Monitoring

When the QoE/Measurement group decided to pursue the project, the goal was simple: demonstrate that distributed tracing through the streaming workflow was possible and valuable for reducing MTTD and MTTR. By rapidly developing a prototype system from available open source and commercial solutions, we would be able to demonstrate how a holistic view of workflow performance and quality could be used to improve QoE on an individual viewer basis. And, more specifically, how tracing would enable rapid root cause analysis and resolution not available to streaming operators today. Below is an example dashboard from the tracing project that leverages player-based events, media object requests, and multi-tier CDN log data.

Netflix Has Proven The Benefit Of This Approach

In the streaming industry, Netflix has long been synonymous with quality. Their streaming operations consistently use fine-grained, log-level data to ensure that encoding, delivery, and playback are all performing at the highest level possible. But, because they own the entire workflow, they are able to apply a deeply integrated, common data model (an internal standard, so to speak) to not only view individual sessions but also to generate an end-to-end, correlated observability in an effort to understand the holistic issues. Netflix can do this because they own the full tech stack from players to core services to CDN. If the rest of the streaming media industry wants to reach the same level of observability and quality, using third party services, broadly adopted standards must be brought to bear. Netflix frequently publishes technical blogs about their approach to QoE, encoding, and other workflow challenges. You can read more about how they approach distributed tracing, which provides a high-level blueprint for an industry-wide effort, here.

From Prototype to Standard: Moving the SVTA Work to CTA-WAVE

During the SVTA Q3 member meeting in 2022, the group showed off the completed prototype, clearly demonstrating through a demo page and test environment that it was possible to trace an object through the entire workflow and isolate root causes for latency or failures. But the SVTA is not a standards definition organization. We knew that we needed to move this into a forum that supported the complex and time-intensive process involved with developing standards. Given our history with CTA-WAVE (they adopted our QoE metrics definitions document as the basis for their standardization work) and a current joint project around CDN Tokens, we felt that CTA-WAVE would be the perfect fit. The project was adopted by WAVE on March 31, 2022.

This joint project will work on a standardized approach to tracing all aspects of streaming media delivery. This includes four inter-connected standards initiatives driven by a single Streaming Media Tracing working group.

  • Request Tracing. The request tracing specification seeks to address observability gaps in the delivery of real-time streaming media experiences, over HTTP, across content providers and third party services. Request tracing spans the spectrum of HTTP requests made by client applications and other services on the client’s behalf. The specification takes into consideration modern streaming media delivery protocols and conventions in addition to complex request flows such as 304 redirects and CDN request collapsing. To this end, in contrast to other tracing specifications and conventions, this specification will organize information based on the set of requests required to fulfill a single intent (e.g. retrieval of a single media from one or more CDNs).
  • Content Creation and Propagation. The Content Creation and Propagation specification will seek to address observability gaps for the online and offline creation processes for both live and VOD scenarios. This includes ingest from live sources or mezzanine source files followed by various workflow stages such as encoding, transcoding to multiple bitrates, dynamic ad stitching (SSAI), encryption, etc. This specification is meant to work in tandem with the Request Tracing specification to provide both enhanced observability for the requesting of media objects and related data from media applications and the creation, propagation and staging of media objects in order to fulfill those requests either offline or in real time.Trace
  • Telemetry Export. While the two specifications above will seek to propagate data in-band, leveraging HTTP headers or query strings, there are scenarios where trace data must be delivered from workflow sources to centralized storage and analysis services. For instance, if a media object request is made by a video player and that request times out only out-of-band logging to centralized services will support proper root cause analysis.The Trace Telemetry Export specification will seek to close this gap by defining the information useful to the identification of impediments.
  • Cross-cutting Support for Object Transformation Tracing. In order to enable the tracing of media object defects (rather than impediments to the delivery of those objects) the three specifications listed above will seek to include data elements necessary to determine where, when and how each object is modified during creation, propagation, or delivery.

Why A Standard Is The End-Goal

Although the project within the SVTA’s QoE/Measurement Working Group could serve as a good blueprint for CDNs to follow, it’s not a plan of action in and of itself. The planned whitepaper will provide excellent information for any CDN to see how a holistic, industry-wide approach to distributed tracing could improve streaming operations (and even make their own operations easier, especially when collaborating with their customers on diagnostics or issue resolution). But it doesn’t necessarily provide a solution. That’s where a standard comes in. By building on the foundation of the SVTA’s work, the CTA-WAVE joint project can formalize the “how” of an industry-wide recommendation for how tracing data should be formed, articulated, and ultimately made available. The standard will provide a clear way for CDNs to implement distributed tracing, not just see how it might be accomplished. We hope, of course, that the standard can evolve over time to include other streaming video workflow components upstream from the CDN.

Coming Soon: A Whitepaper Explaining The Technology Approach

Even while the CTA-WAVE/SVTA joining project on Distributed Tracing continues to move forward, the QoE/Measurement group will work on developing a whitepaper which more fully explains the technical underpinnings of the project. Once published, the industry will be able to better understand the details behind our approach to Distributed Tracing. We hope to have this whitepaper published sometime in Q1, 2023.

If you are a current member and interested in participating, you can reach out to the SVTA and ask to be included in the CTA-WAVE project. For those of you not members, we invite you to explore membership and join our continuing work on this technical solution.

The Measurement/QoE Working Group addresses technical challenges related to data and measurement within the streaming video workflow.

  • Jason Thibeault

  • Jason Thibeault

  • Amber Winans

    Streaming Video Technology Alliance Launches SVTA University Expands Educational Resources with Release of Inaugural University…