Hey hey,
I watched the 39C3 Talk about PTP protocol for time sync, which you can find here if you`re interested Excuse me, what precise time is It? - media.ccc.de , and since than had a tiny idea:
Someone in the audience asked the question if syncing with precision of nanoseconds is required, and the speaker claimed it is because he believes we can hear into the ns range.
Well … do we?
So I fought to give it a try and created some DIY-test-script in strudel.cc to check my own auditory skills. Turns out, Strudel has quite severe bottlenecks when it comes to quantization: It created some artifacts that seemed like I hit rock bottom of what Firefox/a browser engine/strudel/js is offering when I was in the area of 1ms splits between 2 distinguished sounds.
Thats why I reworked everything in Tidal with a slightly different script (why cant both have the same language interpreter…) and I think it does pretty well for v0.1.
What do we got?
This test is about time resolution: the smallest left - right timing difference your auditory system can perceive. Because we localize sound using binaural timing cues, finer time discrimination should correspond to finer angular localization!
If you are interested to try it for yourself, here is the code:
setcps (120/60/4) >> d1 (let dt = (1/16) :: Time maskOnL = ("1 1 1 1" :: Pattern Bool) maskOnR = ("1 0 1 0" :: Pattern Bool) maskOffL = ("0 0 0 0" :: Pattern Bool) maskOffR = ("0 1 0 1" :: Pattern Bool) q = fast 4 $ s "bd" # sustain (pure 0.20) # release (pure 0.20) # gain (pure 1) shift24 = cat [ pure 0 , choose [(-dt), dt] , pure 0 , choose [(-dt), dt] ] onBeatL = (mask maskOnL q) # pan (pure 0) # gain (pure 1) onBeatR = (mask maskOnR q) # pan (pure 1) # gain (pure 1) offBeatL = (shift24 ~> (mask maskOffL q)) # pan (pure 0) # gain (pure 1) offBeatR = (shift24 ~> (mask maskOffR q)) # pan (pure 1) # gain (pure 1) in stack [onBeatL, onBeatR, offBeatL, offBeatR])– Tidal code published under CC BY 4.0 Author: be.Motion.
Whats going on there exactly?
Bare with me, I will copy/paste the description I already wrote for a rendered YouTube video I did so my friends can try it too:
Setup / concept
-
Important: Use stereo headphones! Best result: Play by Computer not by Smartphone
-
Tempo is 120 bpm → 120 beats/60 seconds = 2 beats per second.
-
A simple 4/4 rhythm: 1 bar = 4 beats = 2 seconds.
I call the constant reference beat “on-time”: Played stereo (left + right). Stays constant on beat 1 and 3 (every second beat is a reference).
The delayed beat is the “off-beat”: Plays right channel only on beat 2 and 4. Has an added delay (or negative delay). To avoid missing beats on the left at 2 and 4, the on-time beat is played there again.
Key idea: when the "on-time" and "off-beat" are very close in time, you may perceive a change in character (slightly longer / more “smeared”) which goes "tik tok tik tok" due to beat 1 and 3 always playing on-beat.
As soon as everything sounds perfectly the same (short, precise, sharp) and you hear a constant "tik tik tik tik", you know there is still some delay but you can´t perceive it any longer!
Et voilà: You are at your limit and need to go one step back!
What's the actual delay?
Set it by the dt = 1/x value!
Try 16, 32, 64, 96, 128, 256, 512, 1000 or 1024, 2000, 4k, 8k, 16k, 20k, … up until you reach your Sample Rate.
Because 1 bar = 2 seconds, the timing step is calculated by:
- 2s/16 means 1/8 = 0.125s or 125ms 2s/32 = 0,0625s or 62,5ms
- 2s/64 = 0,03125s or 31,25ms 2s/96 = 0,02083 or 20,83ms
- 2s/128 = 0,015625 or 15,63ms
- 2s/256 = 0,0078125 or 7,8ms
- 2s/512 = 0,00390625 or 3,9ms
- 2s/1024 = 0,001953125 or 1,953ms
- 2s/2000 = 0,001s or 1 ms or 1000µs
- 2s/4000 = 0,000 5s or 0,5ms or 500µs
- 2s/8000 = 0,000 25s or 0,25ms or 250µs
- 2s/16.000 = 0,000 125s or 0,125ms or 125µs
- 2s/20.000 = 0,000 1s or 0,1 ms or 100µs
- …
- 2s/200.000 = 0,000 01s or 0,01 ms or 10µs
Expected outcome
If you are similar to me you might hear a difference up to around 1/2000 = 1 ms, but I think I capped out there. 1/4000 = 500 µs is something I dont think I can actually hear.. Beware: I’m not fully confident in my own result yet, because multiple technical factors can unintentionally alter the generated sound, making us “hear” differences that aren’t the intended effect.
How this was made
-
Linux Kernel 6.14.0 with PREEMPT_DYNAMIC Kernel flag, no legacy 1000hz tick rates!
-
pipewire 1.0.5 + pipewire-jack: Format Float32 Stereo 128 Quant 48kHz Rate
-
Helvum Patchbay
-
SuperCollidor + SuperDirt + Tidal via ansible installer
-
Pulsar + TidalCycles Plugin
So the question is left: How precisely can you localize sound now?
I have an idea to calculate this using simple trigonomic relations and the speed of sound: translate your “time resolution value” into a 1D distance metric, then calculate the missing side and find how densely distinguishable sources can be packed at a given distance and therefor your 2D "localization resolution".
Feel free to try it yourself before I post my calculations - which was actually my motivation why I did this. Its too easy in the day and age of AI to ask an LLM or Google something - so I was in the mood to make a plan how to tackle this question.
That is why everybody is welcomed to share their thoughts and contribute to the matter for the sake of fun and exploration.
I already have other scripts in mind testing other aspects of acustic time resolution, as we tested only some simple Metronome with always the same sound! So dont think this was the last thing or the question being already solved ![]()
Cheers! And a Happy New Year!