One of the key technical requirements of an Online Video Platform is the ability to process massive amounts of media data in order to provide accurate quality video on each platform and device. Decoding source video followed by encoding in several output formats (a process called “transcoding” ) is something not to be taken lightly- especially when used at large scale. The transcoding process can easily turn into an avalanche that cripples your entire IT infrastructure. This blog post will delve into some details of the current transcoding infrastructure and will provide you with a sneak peek at the near future.
Why is video transcoding not trivial?
- Complex calculations: video transcoding - especially the encoding part - is very CPU intensive
- Big: video transcoding is I/O intensive compared to other media types
- Various formats: one source video typically needs to be converted into 4-6 different sizes and bitrates, this is preferably done in parallel to guarantee a swift delivery on all platforms.
Some OVP history
Back in the day when we were outlining our first OVP components (early 2007) there weren’t a lot of examples of efficiently running video transcoding workloads. Some competitors were working with (Windows) desktop based solutions that were “hacked” somewhat to be able to integrate with a typical web hosting/content management environment. Other competitors were using very expensive dedicated hardware appliances. Open source products were available (FFMpeg, Mencoder, Avisynth etc.) but just weren’t “one stop shop” enough to be able to run a high volume transcoding farm on their own. They could do a nice enough job of simply transcoding one or even a small set of input formats into a video file but lacked capabilities for management, scaling, probing and size calculations, and integration with content delivery networks. We decided to build our own “glue”: An application that was capable of receiving, transcoding, and delivering video (and audio) files at large scale.
Deliver video the way you need
And thus the Blue Billywig Format Engine was born. It consists of:
- Media Store: a relational database with media information
- A large shared file system
- Queue Manager: An application that distributes transcoding and video processing jobs to available workers
- Numerous transcoding workers: Virtual servers that are solely used for the actual transcoding and processing
- Content delivery tools: delivering source files and end results to several content delivery networks
The Format Engine was designed to provide a stable transcoding queue management solution while being as flexible as possible on the actual transcoding and processing jobs. Transcoding recipes can easily be added. For instance after designing the transcoding process for 360 video in both bicubic and spherical it took only a few hours to turn it into a recipe and make it available as one of the encoding profiles.
A lot of media in store
Files are ingested in several ways (FTP, OVP web interface etc.). They are stored on a large distributed file system. This filesystem and its underlying hardware infrastructure have undergone a lot of changes over the years. We started out with a simple NFS share running on a virtual server and have eventually migrated to a GlusterFS based multiple physical servers distributed file system. We chose GlusterFS because of its (relatively) light foot print - its high performance, especially for writing large files – and its reliable base architecture (using proven existing file systems like XFS and EXT4 as building blocks). GlusterFS performance – unlike some other large file system solutions – actually gets better when you add more servers. We’ve never used our shared file systems for content delivery. We realised early on that reliably hosting media content was a too important task to effectively and cost-efficiently deploy ourselves.
Scalable...but not endlessly so
Maintaining a large distributed file system for video processing still appears too close to rocket science though. We’ve experienced quite some file storage related infrastructure issues over the years - fortunately most weren’t directly noticeable from the “outside”.
Also the use of one central Queue Manager - while easily maintainable - has some drawbacks: It scales well up to around 25 or even 40-50 simultaneously running workers but each worker adds a little overhead. While still satisfied with the functional aspects of the Format Engine we're foreseeing continuous growth and want to be able to scale up our transcoding platform before we get any structural congestion.
The lessons learned about the current infrastructure led to the following design considerations for the evolution into our “next level” OVP transcoding and media management facilities:
- No more classic self-maintained shared file systems: use a cloud data store that's simply available in any size you need
- Decentralised transcoding queue management: Let the workers manage the queue themselves instead of having one manager control each individual step
Amazon Simple Storage Service: Using a large Key/value store instead of a randomly accessible file system
We've been a heavy AWS S3 user since 2008. It has never let us down. No noticeable downtime. Ever. It also scales extremely well - far beyond the scale we currently dream of with our OVP. It can handle all file sizes without much consideration. It supports security delegation in an elegant way. etc. Up until now we've used AWS S3 mainly as our content hosting and archiving solution. In the new OVP video processing backend S3 will also become our default content entry point: Content can either be uploaded to AWS S3 via the OVP web interface or via third party S3 compatible tools. For legacy content feeds we'll support FTP through a virtual server on AWS EC2 that buffers and transfers files directly to S3. Ingested files will be "seen" by the AWS Lambda service that will gather some basic media information and notify the main OVP backend services of their existence.
Self-supporting teams of workers: Decentralised video transcoding and processing
Upon ingest of a new video file the OVP backend services gather and deliver relevant processing information into a central distributed job queue on AWS. This information can also include special tags that enable specific processing/transcoding workloads or ensure the jobs are processed with extended resources.
The job queue will be read by "workers": Essentially virtual servers running on AWS EC2 that contain a lightweight "Queue Manager" application. These workers are all completely isolated and replaceable. They simply:
- Read all relevant information from the queue
- Fetch the video file and store it in a local buffer
- Perform the actual processing and transcoding tasks
- Deliver the end results back to AWS S3
- Perform postupdate processing: deliver files to third party CDNs etc.
Workers send out "heartbeat" signals to indicate they're busy and report on ETA. The main OVP backend services can still act in case of trouble: for instance by disabling a worker and giving it's job(s) back to the queue. Because of this decentralised and isolated nature the new solution will be (almost) endlessly scalable. We're currently experimenting with auto scaling algorithms that automatically scale up the number of available workers based on demand using AWS spot fleets. This will greatly reduce the time needed for large content imports as we can add and remove capacity on demand. Workers are also easily replaceable by newer (and better) versions of themselves - ensuring you get the latest in video processing technology.
The future is near
We're currently in the process of finalising and testing the new Transcoding factory. We expect to launch it for production use in at most a few months. I hope to have given you an interesting peek at our near future. If you have any questions or thoughts on this matter don't hesitate to contact us.