Modern Digital Audio: Understanding Dither and Oversampling in Hi-Res Age"

bhobba · Oct 14, 2023

I have recently been investigating modern digital audio.

First, we need some background in Digital Signals. This can be mathematically quite advanced, but since I would like this post to be accessible to as wide an audience as possible, here is a link that explains what is needed (not even Calculus is required):

https://brianmcfee.net/dstbook-site/content/intro.html

An important point not emphasised in the above is if we have a signal with maximum frequency f, Shannon guarantees not only can it be reconstructed if sampled at 2f, it can be exactly reconstructed - no phase shift, ringing, blurring, etc, but exact reconstruction.

The most fundamental building block of modern digital audio is quickly becoming outdated: CD audio sampled at 44.1 kHz, with each sample 16 bits, abbreviated as 44.1/16.

We know this has a signal-to-noise ratio (SNR) of 96 dB from the article on digital signals.

But enter dither:

What is dither, and is it still relevant in the Hi-Res audio age?

The actual SNR using triangular dither (TDPF) is about 112 dB SNR.

44.1/16 allows transmitting frequencies up to 22 kHz from Shannon.

To determine the number of bits needed, the background noise of high-quality recordings needs to be examined. The minimum SNR under ideal conditions with the best equipment is 110 dB. 130 dB would be a very reasonable SNR to aim at. Indeed, that is close to the thermal limit noise. Going beyond that is useless.

Using 16 bits with dither has an SNR of 112db. As we will see, we can increase this further, achieving well over 130db. Fourteen bits are found to be enough to achieve 130 db

A fascinating phenomenon happens when you convert it to analogue due to aliasing. You get your original audio plus reflections of it that go on forever. It needs to be filtered at about 20 kHz to eliminate those. They are above audibility, so leaving them there has no audible consequences but can play havoc with amplifiers, etc., when listening to audio. Some don't bother when designing a DAC. They are called NOS DACs, but most designers like to remove them.

Combine this with a filter to limit the signal to 22 kHz so Shanon holds, and can be exactly reproduced without aliasing; two hard-to-design steep analogue filters are required. Well, life is not perfect, and the first DACs to appear did just this.

Then engineers started to have bright ideas.

First, is there an easier way to tackle the filter issue in the CD player? While the minimum frequency you can transmit at is 44.1k, nothing stops designers at the other end from increasing the sampling frequency, let us say, eight times to 352k - it is called oversampling. You take one 44.1 k sample, then seven zero samples, and continue this way. Designing a 22 kHz digital filter that uses this upsampled data is straightforward. Now, you have all these copies at 176 kHz instead of 22 kHz. It's much easier to filter. Oversampling was the first idea.

This had the following important byproduct. If dithered, adding extra zero samples means all the samples are no longer dithered. The noise is concentrated in the non-zero samples. Applying the 22 kHz filter spreads the noise evenly across all samples. For eight times oversampling, the overall noise is now eight times less. Each halving of the noise means 3 dB less noise. So, you now have not 112 dB SNR, but 115 SNR. 8 times oversampling means we have 121 dB SNR. The first DAC chips could not handle 16 bits - 14 bits was the max. But using four times oversampling and an early form of noise shaping (to be discussed later), they were made equivalent to 16-bit DACs.

As always, from those early days, things move on. Let's look at a modern DAC like the PS Audio Direct Stream (DS). I use that as an example because I own one and have investigated how it works. It is nothing special; most other high-quality DACs these days work similarly. It over-samples a whopping 1280 times or about 56 mHz sampling. Consider what this oversampling does to SNR. Let's keep dividing it by 2: 640, 320, 160, 80, 40, 20, 10, 5, 2.5. 1.25. Count the number of doublings, and we get ten doublings. This is an extra 30 dB on the 112 dB we have after dithering, giving an SNR of 142 dB, way over what is required when the thermal limit is considered. Fourteen bits give 130 dB SNR. If a degradation in SNR of 130 is acceptable, 12 or even 8 bits could be used, giving an SNR of 118 dB and 102 dB, respectively. Considering the DS has an overall noise floor of 120 dB, 12 bits would be acceptable. An even better strategy would be to locate the noise floor of the recording and only transmit enough bits to reproduce above that. FLAC compression does not compress noise well, and doing this will reduce FLAC files considerably.

As an experiment, I took some 44.1/16, changed it to 44.1/8 with dither and played it on my computer. During quiet passages, you could hear a faint hiss. But through my Direct Stream DAC - it is dead quiet even with my ear next to the speaker. As I said, 130 db has a margin of safety on the best recordings, but even 102 db is good.

This leads us naturally to how modern audio is done. The exact implementation will vary a bit, but here goes. We feed the output of a microphone into one side of a comparator. It outputs a one if it is greater than the other side. Otherwise, a zero. This is sampled at a very high frequency, say 56 MHz, and then fed into an integrator whose output voltage slowly rises if one is present and falls if a zero is present. If the input voltage is positive, each sample will be one, and the integrator will slowly rise. Eventually, it will be greater than the input voltage, and a zero is output, so the voltage falls. Thus, we have a large number of zeroes and ones that are easy to convert to an analog signal by simply using a low pass filter like a capacitor or a high-quality transformer whose frequency drops off at, say, about 70 kHz.

To create the master from which the audio files are distributed, we digitally filter it to give eight times oversampled PCM called DXD. Why DXD? Audio engineers want a format guaranteed to have a sampling frequency above any maximum possible audio frequency, so Shannon implies exact reconstruction. They decided to make it much more than necessary. Nearly all recordings have no frequencies over 44 kHz that are not swamped by noise. A few recordings do have frequencies not masked by noise above 44 kHz. It is rare to come across a recording with frequencies above 88 kHz, and none, to my knowledge, are above 176 kHz. 24-bit resolution is used for the same reason.

After downsampling the audio to DXD, the resolution is more like 8 bits than 24 bits. This is where a trick called noise shaping comes in. It is explained here:

https://www.analog.com/en/technical-articles/behind-the-sigma-delta-adc-topology.html

It covers what I said previously about increasing resolution using TDPF. But goes further, discussing another type of dither, called noise-shaped dither. These types of dither do not increase SNR equally across all frequencies. The SNR increases compared to TDPF dither at the lower frequencies but much less at higher frequencies. The sampling rate of the one-bit audio, e.g. 56 MHz, records frequencies up to 28 MHz. This is far too high to be of any concern, and we can have a horrid SNR at that frequency but a much better SNR of 24 bits at the DXD frequencies.

Knowing this, we can complete how the DS DAC works. Everything is upsampled to 1280 times the CD sampling rate. Then, it is downsampled ten times, uses the same process that created the 1-bit stream with noise shaping and passes it through a transformer to get rid of the digital high frequencies to give the audio output. The designer arranged it so that above about 70 kHz, the transformer's frequency response drop cancels the rise in noise above 70 kHz from the one-bit converter and its noise shaper. The SNR is 120 dB to very high frequencies.

That's basically how modern audio is recorded and played back. For those who want the ultimate fidelity, you can purchase the DXD master. But in most cases, everything is recovered by downsampling it to 176k or 88k. 44.1k is becoming less popular because the 20 kHz filter removes actual recorded frequencies. How audible this is is a matter of debate. But 88k, for nearly all recordings, is enough to preserve all frequencies. Remember Shannon - provided the highest frequency is below half the sampling frequency, you get exact reproduction. Many DAC designers put a 50 kHz filter on the output to reduce noise because so few recordings have content above 50 kHz that is not masked by recording noise.

Also, now you understand dithering; we could use dither on DXD and only 14, 12 or even 8 bits. Combined with the upsampling, it would likely be indistinguishable from the DXD master. There is a further trick that can be used. A program that determines the maximum frequency of DXD recording that is not masked by noise can be written. A filter a bit above that frequency can be applied. This will not affect exact reproduction, but since noise is often present at high frequencies, the final file is more accessible to compression using something like FlAC, which does not compress noise well. Indeed, the lower bits are mostly noise, so only transmitting 12 bits, for example, will also make a big difference. Some experiments I have done indicate a reduction to about the size of 44.1/16.

IMHO, this may eventually become the standard way audio is distributed.

Another issue is something audio engineers noticed. As the sampling rate is increased, the audio sounds better. Not only this, but the effect continues well into mHz sampling rates. We can only hear up to 20 kHz, so it can't be the possible reconstruction of higher frequencies. I won't go into the hypothesised reasons for this, except to note it is a phenomenon well-known to audio engineers. However, as suggested above, we get exact reconstruction when played back if produced correctly. We upsample to a high sampling rate to simulate high sampling frequencies, which the upsampling of 1280 times in the PS Audio DAC does.

This is important. A system called MQA was devised to reduce time smear, one of the hypothesised reasons high sampling rates sound better. This is of no importance in the system I described because we have exact reproduction at a very high sampling rate - there is no time smear - simple as that. It caused a lot of heated debate in Hi-Fi circles. But IMHO, it is a non-issue because modern DACs have exact reproduction at very high sampling rates.

Wrichik Basu · Oct 15, 2023

This would make a good Insight article IMO.

bhobba · Oct 15, 2023

Wrichik Basu said:

This would make a good Insight article IMO.

Thought someone would suggest it.

Give me a bit of time.

Thanks
Bill

FactChecker · Oct 15, 2023

bhobba said:

An important point not emphasised in the above is if we have a signal with maximum frequency f, Shannon guarantees not only can it be reconstructed if sampled at 2f, it can be exactly reconstructed - no phase shift, ringing, blurring, etc, but exact reconstruction.

I question whether this is only true for an infinite sample over infinite time. I seem to remember a more complicated formula for the limit of errors derived from a finite sample at the Nyquist frequency over a finite time. I can not find a reference and I am not able to see with certainty whether an infinite sample is required. If someone knows the answer to this question, I would really appreciate their input.

256bits · Oct 15, 2023

FactChecker said:

I question whether this is only true for an infinite sample over infinite time. I seem to remember a more complicated formula for the limit of errors derived from a finite sample at the Nyquist frequency over a finite time. I can not find a reference and I am not able to see with certainty whether an infinite sample is required. If someone knows the answer to this question, I would really appreciate their input.

Fourier transform perhaps
The signal was to be from t = -∞ to + ∞ ., since the edge cutoff of the signal can produce distortions.
Or at least over an integral number of sampled cycles.
The signal was to be assumed to be repeatable.

https://www.robots.ox.ac.uk/~sjrob/Teaching/SP/l7.pdf

berkeman · Oct 15, 2023

Paging @FallenStarFeatures

bhobba · Oct 15, 2023

FactChecker said:

Is this only true for an infinite sample over infinite time? I remember a more complicated formula for the limit of errors derived from a finite sample at the Nyquist frequency over a finite time. I can not find a reference, and I cannot see with certainty whether an infinite sample is required. I would appreciate their input if someone knows the answer to this question.

You are correct. Exact reconstruction is only in the limit of infinitely large upsampling using an infinitely long sinc filter.

The first link in my original post has a section on filters, but I will detail the most important case. The bit depth is assumed to be so large for all practical purposes, that it is infinite, and all calculations are done at that resolution. To clarify the statement above, we will start with two times upsampling and how it is done. You put the first sample in position one, zero in the second position, the second in position three, zero in the fourth, and so on. These zero samples add higher frequency stuff to the signal if you convert it to an analogue signal. So you want to get rid of it.

Technically, this is done by convolving it with the sinc function for the frequency you want to filter.

For those more into advanced math, see the Wikipedia article:
https://en.wikipedia.org/wiki/Sinc_function

For the rest of us, here is the intuitive explanation. Suppose we put an impulse function of unit height and short duration through an ideal filter with a cutoff of frequency f. Then, you get the sinc function for that cutoff frequency. In the linked Wikipedia article, you can look at one and see its function (sine(x)/x). However, it is scaled depending on the cutoff frequency. For those who know some more advanced math, here is why. The Fourier transform of a Dirac delta function is one. Intuitively, the Fourier transform breaks down the function into its frequency spectrum. We replace it with a box function with zero outside of f and -f to band limit it. When you do an inverse Fourier transform to return to the time domain, you get the sinc function. An impulse is approximately proportional to the delta function.

It is symmetric about time t=0 with values forward and backward in time.

Suppose you have a signal sampled at such a high frequency. Each sample, for all practical purposes, is an impulse multiplied by A, the value of the signal for that sample. You pass that through your ideal filter at frequency f. Then, you will get the sum of all the sync functions multiplied by each sample value, A.

One way to approximate this is to sample the sinc function at the output sample frequency so you have an array of sampled numbers S(n) stored in memory. For a sinc function, that memory would need to be of infinite length, but since we can only keep numbers of finite precision after some length L, they will all be zero.

You have an array O(n) of length L. When a sample comes in, you first output O(1) and shift all the values down so O(n+1) is moved to O(n), leaving O(L) as zero. You multiply the sample by sinc value S(n) and add it to O(n). You keep repeating as signal samples arrive. This is called convolving with the sinc function.

This way, the digital signal is filtered at frequency f (at least approximately). This is done to the signal sampled at 2f when upsampled to 4f using zero for the missing samples, so the signal is the same as the original, but now sampled at 4f instead of 2f.

Of course, you can upsample it to any frequency you like. You pad the extra values with zeroes. When the frequency is very high, you get a good reconstruction of the original signal that is practically continuous because of how high the sampling is. You can make it analogue by converting it to one bit with noise shaping and passing it through a low-pass filter.

If the limit of this process is taken (it can't be done by an actual digital processor - but mathematically, we can analyse it and see what happens in the limit), then you get exact reproduction in that limit. The limit is called infinitely large upsampling using an infinitely long sinc filter.

Note I mentioned downsampling in my original post a couple of times. It is easy to modify the above process to downsample instead of upsample. It is an excellent exercise working out the details. Remember, in the end, you can throw away samples more significant than two times the sampling frequency.

It also gives some idea of the computing power needed for high-quality digital audio.

If you are a real glutton for punishment, know Lebesgue integration and functional analysis, see:

https://d-nb.info/1114893048/34

Have fun o0)

Thanks
Bill.

FallenStarFeatures · Oct 15, 2023

As this is my second post here at PhysicsForums, I'll provide some background on my technical experience. I'm a retired audio DSP engineer with several decades experience at companies such as Dolby Laboratories and THX. In addition, I've worked extensively in audio signal compression applications of software codecs such as MP3, AAC, AC3, and MPEG. Here are responses to issues raised in this thread:

bhobba said:

The most fundamental building block of modern digital audio is quickly becoming outdated: CD audio sampled at 44.1 kHz, with each sample 16 bits, abbreviated as 44.1/16.

We know this has a signal-to-noise ratio (SNR) of 96 dB from the article on digital signals.

But enter dither:

The actual SNR using triangular dither (TDPF) is about 112 dB SNR.

44.1/16 allows transmitting frequencies up to 22 kHz from Shannon.

In audio content distribution, a sample rate of 44.1KHz @ 16-bit is commonly referred to as CD-Audio quality, and has been a professional audio standard since Compact Disc media was introduced in the early 1980's. Synchronized soundtracks integrated with video and film content, however, are sampled at 48KHz @ 16-bit. These standard sample rates are close enough to deliver comparable audio quality, but cannot be intermixed without undergoing sample rate conversion. [1]

Professional audio recordings commonly use increased sample rates and/or bit-depths to provide additional digital processing room for multi-track mixdown, equalization, and special effects, while avoiding the risk of degrading audio content below CD-Audio quality. Sample rates of 96KHz or 192KHz are also frequently used to provide increased accuracy of anticipated sample rate conversions to 44.1KHz. Bit depths of 24-bit or 32-bit increase signal-to-noise ratio to more than cover maximum practical analog dynamic range.

Dithering is routinely used in sample rate and bit-depth conversion algorithms to minimize truncation errors. The dithering process adds low-level noise to the digitalized signal, which randomizes the coarse thresholds produced by digital truncation. This statistically smooths out the transitions between digital signal levels, reducing average quantization error while only slightly degrading signal-to-noise ratio.[2]

bhobba said:

A fascinating phenomenon happens when you convert it to analogue due to aliasing. You get your original audio plus reflections of it that go on forever. It needs to be filtered at about 20 kHz to eliminate those. They are above audibility, so leaving them there has no audible consequences but can play havoc with amplifiers, etc., when listening to audio. Some don't bother when designing a DAC. They are called NOS DACs, but most designers like to remove them.

In practice, aliasing of high frequency audio content against the digital sample rate is avoided by the use of conservatively-tuned analog multi-band anti-aliasing filters. Such filters have an upper frequency limit, beyond which higher frequencies are dramatically reduced. However, there is a narrow transition band just above the upper frequency limit, where higher frequencies are only marginally reduced, For this reason, the upper frequency limit is typically placed close to 20KHz (widely considered the limit of human hearing), leaving a guard-band of about 2KHz before aliasing starts at 22.05 KHz (with a sampling frequency of 44.1 KHz.) [3]

[1] https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html
[2] https://www.izotope.com/en/learn/what-is-dithering-in-audio.html
[3] https://www.blackghostaudio.com/blog/the-quick-guide-to-audio-aliasing

bhobba · Dec 11, 2023

Wrichik Basu said:

This would make a good Insight article IMO.

Now done.

I will start on a new one explaining oversampling, digital filtering, and exact reconstruction.

Thanks
Bill

bhobba · Dec 14, 2023

I have posted the article Digital Filtering and Exact Reconstruction of Digital Audio.

Thanks
Bill

Modern Digital Audio: Understanding Dither and Oversampling in Hi-Res Age"

What is dither in digital audio?

How does oversampling benefit digital audio processing?

What is Hi-Res audio and how does it differ from standard resolution audio?

Why is dither necessary when converting between different bit depths?

Can oversampling and dithering improve the sound quality of all digital audio files?

Similar threads

Hot Threads

Recent Insights