Adventures with gamma transfer functions and video colorspace conversions
October 18, 2020
Recently I wrote Video Player from scratch for Subtitle Composer that relies only on FFmpeg for decoding, OpenGL for rendering video and OpenAL for playing audio.
After using several external video players and various libraries over the years - I got fed up with annoying issues, code getting filled with nasty workarounds for simple stuff like seeks, timestamps and volume control and wasting time debugging player/file issues that shouldn’t be there in the first place.
The idea is simple:
- FFmpeg loads whatever media file and feeds audio/video frames to Subtitle Composer
- SC feeds audio frames to OpenAL
- SC feeds video frames to OpenGL shader which does colorspace conversion, gamma correction and rendering - all of that using GPU - and displays it inside Qt video player widget
- SC renders subtitle text into a texture that again GPU draws over video
- SC can control seek precision down to a frame and do whatver it needs to make amazing user experience
I started by taking example ffplay code from FFmpeg, cleaning it up a bit and replacing small bits with C++ and Qt5 classes and APIs. In no time I had working video player that was working nicely and rendering video into a SDL window and playing SDL audio.
After awhile SDL window was replaced with QOpenGLWidget. YUV video frame was fed into a texture, simple GLSL shader was doing YUV -> RGB conversion and voila - frame is rendered inside a video widget. Everything seemed fine so far, but I knew that video frames don’t always have to be in YUV format and they can have different colorspaces.
Afterwards SDL audio was replaced with OpenAL - that part went smoothly too and audio part was done.
Then I went digging into various frame formats and colorspaces and looking information about that - honestly was hoping that I would find some information on Wikipedia or online, and some nice source in existing video player like MPV or VLC do some copy/pasta add some credits and be done with it.
What I ended up figuring is that there is a whole science about image encoding and perception that needed to be taken into account. Then there is science and history of transmitting and encoding analog/digital video signals and their relative standards (whose specs are nowhere to be found publicly, hard to obtain and hidden behind paywalls) and some linear algebra. I found out that most open source video players around did wrong color space conversions, did approximate gamma correction which resulted in colors which weren’t perfect and dark video frames being sometimes completely unperceivable.
Dark video frames in most software video players were very close to all black and I had trouble perceiving shapes, objects and people on dark scenes without tweaking brightness/contrast which would result in light scenes becoming too light. Watching the same video on my smart TV resulted with completely different experience - I could perceive people and objects in dark scenes, and light scenes had normal color and lightness.
This was all happening because of gamma encoding of video pixels. To put it simply they aren’t properly decoded into RGB pixels. I won’t go into too much detail here and will explain it with somewhat incorrect terms, just to give the general idea. I have listed references to books that I’ve found very informative, useful and accurate at the bottom.
The thing that bothered me the most is how I was able only to find a bunch of incorrect and inaccurate matrices and examples online for colorspace conversion and gamma correction, but the math, science and accurate explanation is nowhere to be freely found - that sadly includes Wikipedia which doesn’t explain much and some information it does is incorrect.
It all boils down to human vision experiments done in 1931 by CIE - human eye has 3 different type of cells that are sensible to different light spectrum. Some cells are more sensitive to light than the others, and in general we can better perceive small color differences in low light than in bright light conditions.
Light spectrum information is encoded into RGB values (red, green and blue - one for each receptor cell in human eye) which are then mathematically transformed into XYZ colorspace. More or less Y component is our perception of lightness - XZ components are our perception of color (hue).
Video is usually encoded into some YUV colorspace - Y component was supposed to represent lightness (but it isn’t) and UV are supposed to represent color. As said before human eye is more sensitive to differences in low light than in bright light. For example if Y was to be represented with values between 0 and 1000, with 0 being complete darkness, human eye might notice the difference between 3 and 4 but would might not notice the difference between 900 and 960.
Since each component in YUV is usually encoded into one byte (0-255 or -128-127) - values are transformed using exponential function. That way Y of 2-9 would end up between 1.4-3.0, while 900-960 would change to 30.0-31.0 (this is just crude example). This can then easily be stored into one byte and when restored we will perceive lightness with little distortion. The problem with using exponential function is that it’s distorting low (dark) values too much. As said before eye might notice the difference between Y of 3 and 4, but exponential function (after rounding to integer) is encoding both those values into value of 2. So it was decided keep low values upto some threshold as they are and encode higher values using exponential function. That way 2-9 would stay 2.0-9.0 and 900-960 would end up in 40.0-41.0 range. And more or less what is what gamma function does.
The computer monitor and video cards also has some gamma function (usually that of CRT as defined in sRGB) and is sometimes very similar to that of encoded video.
To accurately display colors and pixels video renderer should first decode pixels using inverse gamma function and colorspace that was used on source, and afterwards apply gamma and colorspace conversion so pixel values match that of your video card and monitor.
Since each video frame contain a bunch of pixels (480000 pixels for 800x600), each has to be gamma corrected in 4ms (25 fps) - what most software video players did is ignore the fact about linear part in gamma function and approximate the exponential part in order to reduce the number of calculations on CPU so they can display fluid video in realtime without dropping any frames.
Nowadays most of the video players do the GPU accelerated rendering, but a lot of them do gamma and colorspace correction incorrectly or simple ignore it.
Here’s the literature I’ve used and I recommend reading if you’re into whole colorspace thing:
- CAN/CSA-ISO/IEC 23001-8:18 - Information technology — MPEG systems technologies — Part 8: Coding-independent code points (ISO/IEC 23001-8:2016, IDT)
- “Digital Video and HDTV - Algorithms and Interfaces” by C. Poynton (2003)
- “A Review of RGB Color Spaces (…from xyY to R’G’B’)” by Danny Pascale (2003)
- “Colour Space Conversions” by Adrian Ford and Alan Roberts (1998)
P.S. Several months after completing color conversion and redering work I’ve found out libplacebo that does what i did with rendering, and likely better. So be sure to check it out - maybe it’ll be included into SC in the future - for now current player works good.