Automatic Segmentation of Audio by Speaker

Posted 4 years ago | Originally written on 10 Mar 2020

Problem

I recently attended a meeting in which I had to write notes from the discussions. I was feeling lazy so I decided that I would record the audio on my phone then make notes from that. That part worked really well until I sat down to make the notes.

The meeting was structured such that there were several speakers each of whom would be interrupted with questions or comments. It was the comments that I was after. Overall the meeting lasted just under an hour with interruptions occurring at any point.

I started by playing the audio directly off my phone then taking notes. It didn't take long to realise that I couldn't keep up and that I had very little control over playback. I wanted to be able to move forward and back precisely but that wasn't possible on the phone. Furthermore, I wish I didn't have to listen to the whole thing again. I then began to wonder if there was a simple way to segment the audio then only skip to the start of each new segment...

Surely someone has integrated this into a popular tool.

Development

I've used Audacity at church to record sermons but I've never really played with it. It has a respectable number of filters and effects and I had a hunch that it would have a way to do some sort of automatic segmentation.

Unfortunately it didn't have. But it has a facility to label points or portions of audio. There's was no way I was going to listen to the whole piece just to add labels - that would take an awful chunk of time. What I needed was an automatic way to add labels.

My first search was for 'audio segmentation' which turned up a bunch of research articles on various approaches to doing so. Clearly I was not going to spend hours (far more exciting than rote listening) implementing these. I wanted a complete solution.

I then searched for 'segment audio by speaker' which had python - Audio Analysis : Segment audio based on speaker recognition - Data Science Stack Exchange as my first result which had two links in the question: the first one (https://github.com/tyiannak/pyAudioAnalysis/wiki/5.-Segmentation) was more descriptive and less practical while the second one (https://github.com/aalto-speech/speaker-diarization) was a repository to a collection of Python scripts that could do various automatic tasks.

I was not yet convinced that I wanted to learn how to, let alone, run a bunch of unknown scripts, I still hoped to get a quick and dirty solution especially one which involved click-click-click (yeah, I know!).

I next tried to search for 'audacity audio segmentation' which had results on how to 'split' audio and how to 'label' audio but not how to perform automatic segmentation by speaker.

By this point I figured that I would have to use Audacity anyway so I downloaded and installed it. Since the audio was recorded as M4A (AAC) format and Audacity doesn't ship with the encoder pre-installed I had to download and install the ffmpeg library which would support this.

After toying around for a bit it became clear that I would have to invest the time in getting the speaker-diarization scripts to work. The fastest way would involve running the scripts from a Docker container (as outline in the helpful README). So off I went in search of Docker...

Once this was set up I tried out the example.

~$ docker run -it blabbertabber/aalto-speech-diarizer bash
[root@3d30f04ba1d7 /]# cd /speaker-diarization
[root@3d30f04ba1d7 speaker-diarization]# curl -k -OL https://nono.io/meeting.wav
 % Total   % Received % Xferd Average Speed  Time   Time    Time Current
                                Dload Upload  Total  Spent   Left Speed
100 17.2M 100 17.2M   0    0 1661k     0 0:00:10 0:00:10 --:--:-- 1882k
[root@3d30f04ba1d7 speaker-diarization]# ./spk-diarization2.py meeting.wav
Reading file: meeting.wav
Writing output to: stdout
Using feacat from: /speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /speaker-diarization/lna
Writing exp files in: /speaker-diarization/exp
Writing features in: /speaker-diarization/fea
Performing exp generation and feacat concurrently
tokenpass: ./VAD/tokenpass/test_token_pass
Reading recipe: /tmp/init91Tj5U.recipe
Using model: ./hmms/mfcc_16g_11.10.2007_10
Writing `.lna` files in: /speaker-diarization/lna
Writing `.exp` files in: /speaker-diarization/exp
Processing file 1/1
Input: meeting.wav
Output: /speaker-diarization/lna/meeting.lna
FAN OUT: 0 nodes, 0 arcs
FAN IN: 0 nodes, 0 arcs
Prefix tree: 3 nodes, 6 arcs
WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam.
Calling voice-detection2.py
Reading recipe from: /tmp/init91Tj5U.recipe
Reading .exp files from: /speaker-diarization/exp
Writing output to: /tmp/vadkNIKcZ.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Waiting for feacat to end.
Calling spk-change-detection.py
Reading recipe from: /tmp/vadkNIKcZ.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: /tmp/spkcKQXuDX.recipe
Conversion rate set to frame rate: 125.0
Using a growing window
Deltaws set to: 0.096 seconds
Using BIC as distance measure, lambda = 1.0
Window size set to: 1.0 seconds
Window step set to: 3.0 seconds
Threshold distance: 0.0
Useful metrics for determining the right threshold:
---------------------------------------------------
Average between windows distance: -370.524559348
Maximum between windows distance: 2039.1026531722064
Minimum between windows distance: -1222.9105555792084
Total windows: 346
Total segments: 64
Average between detected segments distance: 327.139634263
Maximum between detected segments distance: 2043.4163627520657
Minimum between detected segments distance: 11.18226920088864
Total detected speaker changes: 41
Calling spk-clustering.py
Reading recipe from: /tmp/spkcKQXuDX.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: stdout
Conversion rate set to frame rate: 125.0
Using hierarchical clustering
Using BIC as distance measure, lambda = 1.3
Threshold distance: 0.0
Maximum speakers: 0
Initial cluster with: 64 speakers
Merging: 38 and 44 distance: -2921.769470541185
Merging: 38 and 40 distance: -2951.2135406549614
Merging: 38 and 43 distance: -2871.713460678535
Merging: 38 and 44 distance: -2917.2587434102315
Merging: 51 and 53 distance: -2871.0543575817246
Merging: 51 and 54 distance: -2940.8351154446605
Merging: 28 and 38 distance: -2852.6957536639875
Merging: 50 and 51 distance: -2850.9433078229094
Merging: 28 and 39 distance: -2759.0140977398596
Merging: 49 and 52 distance: -2695.0033931089265
Merging: 44 and 49 distance: -2756.0663835379573
Merging: 44 and 49 distance: -2710.348483406712
Merging: 28 and 36 distance: -2667.4482367885075
Merging: 28 and 39 distance: -2660.3583427837257
Merging: 20 and 28 distance: -2657.115016038645
Merging: 20 and 35 distance: -2715.1799624734203
Merging: 18 and 20 distance: -2710.654666705846
Merging: 17 and 18 distance: -2684.771867092685
Merging: 17 and 19 distance: -2617.350598439542
Merging: 15 and 17 distance: -2620.851348572456
Merging: 15 and 28 distance: -2607.3123879864024
Merging: 18 and 24 distance: -2471.3269385791946
Merging: 1 and 4 distance: -2433.1455909440465
Merging: 10 and 14 distance: -2358.085004075313
Merging: 16 and 25 distance: -2350.1272441177916
Merging: 16 and 21 distance: -2387.941421214947
Merging: 16 and 19 distance: -2393.6868377148476
Merging: 16 and 18 distance: -2417.338076661149
Merging: 20 and 28 distance: -2339.7809645759626
Merging: 3 and 22 distance: -2321.1849558166023
Merging: 28 and 30 distance: -2288.3740035107658
Merging: 10 and 12 distance: -2284.9976499143118
Merging: 7 and 10 distance: -2266.637970303841
Merging: 18 and 25 distance: -2181.9640388075186
Merging: 7 and 29 distance: -2173.480848412663
Merging: 7 and 9 distance: -2130.329377092388
Merging: 4 and 7 distance: -2130.82077204709
Merging: 4 and 15 distance: -2110.6807496186516
Merging: 4 and 14 distance: -2115.528322770778
Merging: 4 and 16 distance: -2171.816660672457
Merging: 4 and 6 distance: -2114.09627666888
Merging: 3 and 8 distance: -1928.8540169733815
Merging: 3 and 7 distance: -2005.493368380382
Merging: 2 and 4 distance: -1928.8040567328771
Merging: 3 and 15 distance: -1922.8509757421589
Merging: 17 and 19 distance: -1843.9285212133682
Merging: 14 and 15 distance: -1815.3507925949698
Merging: 2 and 9 distance: -1797.8386215631135
Merging: 2 and 4 distance: -1908.8801149227675
Merging: 2 and 5 distance: -1845.968505681797
Merging: 6 and 9 distance: -1655.75556196682
Merging: 2 and 4 distance: -1625.9641887100834
Merging: 2 and 7 distance: -1404.1768198308237
Merging: 2 and 7 distance: -1317.0797306828836
Merging: 3 and 4 distance: -1299.294455103457
Merging: 7 and 9 distance: -1179.8781203555072
Merging: 5 and 8 distance: -1144.8095559456488
Merging: 1 and 3 distance: -741.7540523122689
Merging: 4 and 5 distance: -618.7548791396439
Final speakers: 5
Useful metrics for determining the right threshold:
---------------------------------------------------
Maximum between segments distance: 21370.577321699926
Minimum between segments distance: -2951.2135406549614
Total segments: 64
Total detected speakers: 5

And what does the output look like?

[root@3d30f04ba1d7 speaker-diarization]# cat stdout
audio=meeting.wav lna=a_1 start-time=0.384 end-time=5.82 speaker=speaker_1
audio=meeting.wav lna=a_2 start-time=5.82 end-time=31.648 speaker=speaker_2
audio=meeting.wav lna=a_3 start-time=31.648 end-time=58.272 speaker=speaker_1
audio=meeting.wav lna=a_4 start-time=60.032 end-time=66.536 speaker=speaker_1
audio=meeting.wav lna=a_5 start-time=66.536 end-time=68.748 speaker=speaker_2
audio=meeting.wav lna=a_6 start-time=68.748 end-time=70.576 speaker=speaker_2
audio=meeting.wav lna=a_7 start-time=70.576 end-time=78.264 speaker=speaker_2
audio=meeting.wav lna=a_8 start-time=79.84 end-time=80.248 speaker=speaker_2
audio=meeting.wav lna=a_9 start-time=80.248 end-time=82.792 speaker=speaker_2
audio=meeting.wav lna=a_10 start-time=82.792 end-time=83.372 speaker=speaker_2
audio=meeting.wav lna=a_11 start-time=83.372 end-time=88.96 speaker=speaker_2
audio=meeting.wav lna=a_12 start-time=88.96 end-time=93.288 speaker=speaker_1
audio=meeting.wav lna=a_13 start-time=93.288 end-time=93.9 speaker=speaker_2
audio=meeting.wav lna=a_14 start-time=93.9 end-time=96.436 speaker=speaker_1
audio=meeting.wav lna=a_15 start-time=96.436 end-time=98.436 speaker=speaker_2
audio=meeting.wav lna=a_16 start-time=98.436 end-time=102.736 speaker=speaker_2
audio=meeting.wav lna=a_17 start-time=102.736 end-time=103.284 speaker=speaker_2
audio=meeting.wav lna=a_18 start-time=103.284 end-time=103.888 speaker=speaker_2
audio=meeting.wav lna=a_19 start-time=103.888 end-time=110.156 speaker=speaker_1
audio=meeting.wav lna=a_20 start-time=110.156 end-time=114.2 speaker=speaker_2
audio=meeting.wav lna=a_21 start-time=119.936 end-time=124.256 speaker=speaker_2
audio=meeting.wav lna=a_22 start-time=124.256 end-time=126.512 speaker=speaker_3
audio=meeting.wav lna=a_23 start-time=126.512 end-time=140.956 speaker=speaker_2
audio=meeting.wav lna=a_24 start-time=140.956 end-time=143.256 speaker=speaker_3
audio=meeting.wav lna=a_25 start-time=148.76 end-time=152.472 speaker=speaker_3
audio=meeting.wav lna=a_26 start-time=157.208 end-time=166.98 speaker=speaker_2
audio=meeting.wav lna=a_27 start-time=166.98 end-time=171.5 speaker=speaker_3
audio=meeting.wav lna=a_28 start-time=171.5 end-time=173.588 speaker=speaker_2
audio=meeting.wav lna=a_29 start-time=173.588 end-time=190.016 speaker=speaker_3
audio=meeting.wav lna=a_30 start-time=190.016 end-time=193.208 speaker=speaker_2
audio=meeting.wav lna=a_31 start-time=195.176 end-time=195.88 speaker=speaker_4
audio=meeting.wav lna=a_32 start-time=195.88 end-time=199.672 speaker=speaker_2
audio=meeting.wav lna=a_33 start-time=201.888 end-time=203.436 speaker=speaker_2
audio=meeting.wav lna=a_34 start-time=203.436 end-time=209.304 speaker=speaker_3
audio=meeting.wav lna=a_35 start-time=210.912 end-time=212.88 speaker=speaker_1
audio=meeting.wav lna=a_36 start-time=215.256 end-time=216.708 speaker=speaker_2
audio=meeting.wav lna=a_37 start-time=216.708 end-time=218.912 speaker=speaker_2
audio=meeting.wav lna=a_38 start-time=224.424 end-time=226.968 speaker=speaker_2
audio=meeting.wav lna=a_39 start-time=226.968 end-time=227.448 speaker=speaker_2
audio=meeting.wav lna=a_40 start-time=227.448 end-time=240.544 speaker=speaker_2
audio=meeting.wav lna=a_41 start-time=242.92 end-time=243.628 speaker=speaker_2
audio=meeting.wav lna=a_42 start-time=243.628 end-time=257.08 speaker=speaker_3
audio=meeting.wav lna=a_43 start-time=257.08 end-time=259.384 speaker=speaker_2
audio=meeting.wav lna=a_44 start-time=261.096 end-time=293.136 speaker=speaker_2
audio=meeting.wav lna=a_45 start-time=298.96 end-time=301.064 speaker=speaker_2
audio=meeting.wav lna=a_46 start-time=301.064 end-time=304.952 speaker=speaker_2
audio=meeting.wav lna=a_47 start-time=304.952 end-time=306.896 speaker=speaker_2
audio=meeting.wav lna=a_48 start-time=339.76 end-time=357.404 speaker=speaker_4
audio=meeting.wav lna=a_49 start-time=357.404 end-time=360.664 speaker=speaker_1
audio=meeting.wav lna=a_50 start-time=360.664 end-time=365.416 speaker=speaker_4
audio=meeting.wav lna=a_51 start-time=369.728 end-time=370.428 speaker=speaker_4
audio=meeting.wav lna=a_52 start-time=370.428 end-time=382.376 speaker=speaker_4
audio=meeting.wav lna=a_53 start-time=382.376 end-time=390.176 speaker=speaker_5
audio=meeting.wav lna=a_54 start-time=390.176 end-time=414.136 speaker=speaker_4
audio=meeting.wav lna=a_55 start-time=417.936 end-time=448.504 speaker=speaker_4
audio=meeting.wav lna=a_56 start-time=451.032 end-time=465.808 speaker=speaker_4
audio=meeting.wav lna=a_57 start-time=473.504 end-time=487.584 speaker=speaker_4
audio=meeting.wav lna=a_58 start-time=492.048 end-time=493.64 speaker=speaker_4
audio=meeting.wav lna=a_59 start-time=495.992 end-time=499.336 speaker=speaker_4
audio=meeting.wav lna=a_60 start-time=501.68 end-time=525.328 speaker=speaker_4
audio=meeting.wav lna=a_61 start-time=537.92 end-time=545.268 speaker=speaker_4
audio=meeting.wav lna=a_62 start-time=545.268 end-time=549.18 speaker=speaker_5
audio=meeting.wav lna=a_63 start-time=549.18 end-time=549.768 speaker=speaker_2
audio=meeting.wav lna=a_64 start-time=549.768 end-time=565.584 speaker=speaker_4

Not too bad. But what can I do with that? Hmm....

Perhaps if I can just get the labels then I can figure out what to do with them. Here I go:

First, it seems that the aalto scripts only work with WAV so I first have to export my audio from M4A to WAV. This is trivial in Audacity:

Then I need to make sure the Docker image can see the folder containing the new WAV file.

docker run -it --mount type=bind,source=/Users/pkorir/Downloads,destination=/data blabbertabber/aalto-speech-diarizer bash

Now let's try and diarise the audio.

[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav
Reading file: sprint_review.wav
Writing output to: stdout
Using feacat from: /speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /speaker-diarization/lna
Writing exp files in: /speaker-diarization/exp
Writing features in: /speaker-diarization/fea
Performing exp generation and feacat concurrently
tokenpass: ./VAD/tokenpass/test_token_pass
Reading recipe: /tmp/initYH6FfW.recipe
Using model: ./hmms/mfcc_16g_11.10.2007_10
Writing `.lna` files in: /speaker-diarization/lna
Writing `.exp` files in: /speaker-diarization/exp
Processing file 1/1
Input: sprint_review.wav
Output: /speaker-diarization/lna/sprint_review.lna
exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.
Traceback (most recent call last):
 File "./generate_exp.py", line 264, in <module>
   shift_dec_bord(lnas, arguments['--exppath'])
 File "./generate_exp.py", line 181, in shift_dec_bord
   num_models, l = _read_lna(lna)
 File "./generate_exp.py", line 123, in _read_lna
   with open(lna, 'r') as f:
IOError: [Errno 2] No such file or directory: '/speaker-diarization/lna/sprint_review.lna'
Calling voice-detection2.py
Reading recipe from: /tmp/initYH6FfW.recipe
Reading .exp files from: /speaker-diarization/exp
Writing output to: /tmp/vadK0SiCg.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Error, /speaker-diarization/exp/sprint_review.exp does not exist
Waiting for feacat to end.
^CTraceback (most recent call last):
 File "./spk-diarization2.py", line 116, in <module>
   child2.wait()
 File "/usr/lib64/python2.7/subprocess.py", line 1099, in wait
   pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
 File "/usr/lib64/python2.7/subprocess.py", line 125, in _eintr_retry_call
   return func(*args)
KeyboardInterrupt

Oops! There's a mismatch in the sampling rate used indicated by the line

exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.

So let's go back to Audacity and fix this. It took me a while to figure this out though all the while it was right under my nose:

then

which changes

to

Now we try running the script again.

[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav
Reading file: sprint_review.wav
Writing output to: stdout
Using feacat from: /speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /speaker-diarization/lna
Writing exp files in: /speaker-diarization/exp
Writing features in: /speaker-diarization/fea
Performing exp generation and feacat concurrently
tokenpass: ./VAD/tokenpass/test_token_pass
Reading recipe: /tmp/init6YosNd.recipe
Using model: ./hmms/mfcc_16g_11.10.2007_10
Writing `.lna` files in: /speaker-diarization/lna
Writing `.exp` files in: /speaker-diarization/exp
Processing file 1/1
Input: sprint_review.wav
Output: /speaker-diarization/lna/sprint_review.lna
FAN OUT: 0 nodes, 0 arcs
FAN IN: 0 nodes, 0 arcs
Prefix tree: 3 nodes, 6 arcs
WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam.
Calling voice-detection2.py
Reading recipe from: /tmp/init6YosNd.recipe
Reading .exp files from: /speaker-diarization/exp
Writing output to: /tmp/vad9FZr1b.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Waiting for feacat to end.
Calling spk-change-detection.py
Reading recipe from: /tmp/vad9FZr1b.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: /tmp/spkcOhyXK4.recipe
Conversion rate set to frame rate: 125.0
Using a growing window
Deltaws set to: 0.096 seconds
Using BIC as distance measure, lambda = 1.0
Window size set to: 1.0 seconds
Window step set to: 3.0 seconds
Threshold distance: 0.0
Useful metrics for determining the right threshold:
---------------------------------------------------
Average between windows distance: -741.884289478
Maximum between windows distance: 2056.233417322154
Minimum between windows distance: -1634.5738624276773
Total windows: 1693
Total segments: 164
Average between detected segments distance: 295.898429423
Maximum between detected segments distance: 2062.7827747870438
Minimum between detected segments distance: 5.3772136141556075
Total detected speaker changes: 40
Calling spk-clustering.py
Reading recipe from: /tmp/spkcOhyXK4.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: stdout
Conversion rate set to frame rate: 125.0
Using hierarchical clustering
Using BIC as distance measure, lambda = 1.3
Threshold distance: 0.0
Maximum speakers: 0
Initial cluster with: 164 speakers
Merging: 108 and 135 distance: -3887.955395546159
Merging: 1 and 108 distance: -3829.558634475854
Merging: 1 and 133 distance: -3786.060190436657
Merging: 1 and 93 distance: -3770.7263608466424
Merging: 79 and 131 distance: -3749.047306462543
Merging: 1 and 92 distance: -3717.2981892853973
Merging: 1 and 113 distance: -3677.9809939843003
Merging: 1 and 91 distance: -3636.012753072265
Merging: 1 and 105 distance: -3628.2522779525566
Merging: 1 and 129 distance: -3581.0688774678083
Merging: 1 and 107 distance: -3575.992436049587
Merging: 1 and 94 distance: -3568.4366839299973
Merging: 1 and 3 distance: -3527.330699268212
Merging: 1 and 90 distance: -3522.014658249364
Merging: 74 and 116 distance: -3493.7359497637935
Merging: 1 and 102 distance: -3465.499077569202
Merging: 1 and 107 distance: -3445.3836108861806
Merging: 1 and 52 distance: -3451.722730361522
Merging: 1 and 33 distance: -3387.876121993725
Merging: 51 and 63 distance: -3384.4959699685132
Merging: 48 and 75 distance: -3354.5669035578567
Merging: 71 and 79 distance: -3349.328440026835
Merging: 71 and 79 distance: -3314.5059704949717
Merging: 71 and 108 distance: -3253.5329670281417
Merging: 44 and 69 distance: -3237.6886447327024
Merging: 44 and 49 distance: -3386.3979517532134
Merging: 44 and 49 distance: -3371.639373859509
Merging: 44 and 55 distance: -3385.7581267712203
Merging: 44 and 53 distance: -3380.7840683752183
Merging: 44 and 63 distance: -3376.7621000717972
Merging: 24 and 44 distance: -3403.54812645695
Merging: 24 and 40 distance: -3436.4345257878567
Merging: 19 and 24 distance: -3476.843817278601
Merging: 3 and 19 distance: -3475.193588642015
Merging: 3 and 55 distance: -3474.0118563581327
Merging: 3 and 42 distance: -3578.643272853399
Merging: 3 and 52 distance: -3548.303482323008
Merging: 3 and 42 distance: -3577.0468024010906
Merging: 3 and 47 distance: -3598.6588382225755
Merging: 3 and 85 distance: -3608.2916045281318
Merging: 3 and 36 distance: -3603.7144836529587
Merging: 3 and 37 distance: -3617.6925098052097
Merging: 3 and 50 distance: -3582.978842707711
Merging: 3 and 51 distance: -3608.363897509921
Merging: 3 and 14 distance: -3616.959015783171
Merging: 3 and 14 distance: -3673.1394593964087
Merging: 3 and 6 distance: -3645.692757898267
Merging: 3 and 21 distance: -3651.831659533852
Merging: 3 and 45 distance: -3635.4377651669965
Merging: 3 and 42 distance: -3625.768902747991
Merging: 3 and 16 distance: -3637.3145431461453
Merging: 3 and 12 distance: -3656.1644745539015
Merging: 3 and 39 distance: -3619.0243746250035
Merging: 3 and 70 distance: -3617.532832619885
Merging: 3 and 31 distance: -3538.1118926415274
Merging: 3 and 24 distance: -3588.4551846544327
Merging: 3 and 23 distance: -3595.3136668354373
Merging: 3 and 22 distance: -3563.4762067514275
Merging: 3 and 34 distance: -3473.1897660250816
Merging: 3 and 66 distance: -3497.4745288209124
Merging: 3 and 51 distance: -3488.191266448631
Merging: 3 and 5 distance: -3435.2971353689245
Merging: 3 and 5 distance: -3412.6258974336815
Merging: 3 and 33 distance: -3421.702443699097
Merging: 3 and 10 distance: -3347.9573937274645
Merging: 3 and 4 distance: -3275.507261738103
Merging: 3 and 28 distance: -3222.640697116016
Merging: 32 and 63 distance: -3220.758971141154
Merging: 79 and 85 distance: -3209.354815380604
Merging: 79 and 81 distance: -3206.8351002940353
Merging: 3 and 6 distance: -3204.3674454124757
Merging: 3 and 43 distance: -3188.818815699572
Merging: 3 and 18 distance: -3148.030529647065
Merging: 3 and 6 distance: -3205.4135632110892
Merging: 3 and 46 distance: -3231.040880173908
Merging: 3 and 24 distance: -3208.884162691228
Merging: 28 and 31 distance: -3133.263655995979
Merging: 28 and 58 distance: -3115.8873861076936
Merging: 28 and 56 distance: -3164.7155330911028
Merging: 70 and 73 distance: -3100.665370054021
Merging: 3 and 12 distance: -3092.46044930492
Merging: 3 and 40 distance: -3087.035097019593
Merging: 3 and 20 distance: -3081.700056095887
Merging: 3 and 24 distance: -3086.8293437925913
Merging: 3 and 11 distance: -3097.1391768792882
Merging: 73 and 78 distance: -3075.370397383048
Merging: 24 and 25 distance: -3047.826910527925
Merging: 24 and 25 distance: -3175.594527744134
Merging: 24 and 34 distance: -3125.426855246352
Merging: 24 and 31 distance: -3143.175391292376
Merging: 1 and 63 distance: -3031.692234416072
Merging: 3 and 41 distance: -3020.1835212450815
Merging: 3 and 35 distance: -3011.633718292628
Merging: 58 and 59 distance: -3001.264566856854
Merging: 1 and 49 distance: -2978.7819505136677
Merging: 3 and 39 distance: -2965.2039816969127
Merging: 55 and 56 distance: -2939.950569232509
Merging: 24 and 44 distance: -2893.0393418751846
Merging: 20 and 21 distance: -2881.333358393781
Merging: 2 and 20 distance: -2891.8227857940656
Merging: 14 and 18 distance: -2880.5349814279652
Merging: 6 and 14 distance: -2944.2111947274816
Merging: 6 and 24 distance: -2991.7338627010076
Merging: 6 and 24 distance: -3058.018780854746
Merging: 49 and 58 distance: -2876.5058351204634
Merging: 49 and 56 distance: -2915.1050006814294
Merging: 13 and 49 distance: -2921.3550738196755
Merging: 6 and 27 distance: -2876.4117349206253
Merging: 19 and 20 distance: -2845.2434796312136
Merging: 19 and 24 distance: -2937.9409080638825
Merging: 2 and 48 distance: -2822.8020023667686
Merging: 6 and 11 distance: -2822.2471799865234
Merging: 4 and 6 distance: -2834.310032658269
Merging: 17 and 19 distance: -2793.0188621542147
Merging: 16 and 17 distance: -2900.7354074650893
Merging: 16 and 22 distance: -2862.6145736734416
Merging: 16 and 22 distance: -2869.23538299506
Merging: 16 and 18 distance: -2850.6370966794266
Merging: 4 and 22 distance: -2781.876035270495
Merging: 3 and 14 distance: -2741.736072613906
Merging: 3 and 30 distance: -2759.259362664974
Merging: 3 and 8 distance: -2757.8665416641434
Merging: 21 and 24 distance: -2719.112465081248
Merging: 14 and 16 distance: -2707.5240582398374
Merging: 14 and 23 distance: -2660.882175893672
Merging: 3 and 9 distance: -2649.7183508639337
Merging: 3 and 7 distance: -2651.3659778753836
Merging: 12 and 13 distance: -2616.9054823202387
Merging: 4 and 5 distance: -2582.178114937249
Merging: 3 and 13 distance: -2554.2970459399385
Merging: 1 and 13 distance: -2548.118699103049
Merging: 1 and 21 distance: -2546.7834587369316
Merging: 1 and 23 distance: -2521.634724829474
Merging: 1 and 6 distance: -2649.776882152113
Merging: 1 and 19 distance: -2603.222960608062
Merging: 1 and 17 distance: -2495.2817102370227
Merging: 1 and 19 distance: -2602.05798529557
Merging: 19 and 20 distance: -2347.927030688841
Merging: 6 and 23 distance: -2228.488638502052
Merging: 6 and 25 distance: -2431.776312676784
Merging: 19 and 21 distance: -2160.2725334151246
Merging: 19 and 20 distance: -2220.9533237121896
Merging: 1 and 12 distance: -2138.357070794871
Merging: 8 and 16 distance: -2113.677056401817
Merging: 17 and 18 distance: -2041.0840088828236
Merging: 12 and 13 distance: -2038.2327352681687
Merging: 11 and 13 distance: -1944.9097104167731
Merging: 1 and 9 distance: -1921.8353706803282
Merging: 14 and 15 distance: -1872.455349311388
Merging: 3 and 5 distance: -1842.473173390661
Merging: 1 and 12 distance: -1663.8068944748202
Merging: 5 and 6 distance: -1236.5990251376343
Merging: 10 and 12 distance: -247.4109391082884
Merging: 4 and 5 distance: -41.04158619899499
Final speakers: 10
Useful metrics for determining the right threshold:
---------------------------------------------------
Maximum between segments distance: 167571.74992590738
Minimum between segments distance: -3887.955395546159
Total segments: 164
Total detected speakers: 10
[root@f0c9dbb6dfcf speaker-diarization]# less stdout

Which looks like so:

~$ head ~/Downloads/sprint_review_speakers.txt
audio=sprint_review.wav lna=a_1 start-time=5.448 end-time=6.504 speaker=speaker_1
audio=sprint_review.wav lna=a_2 start-time=11.352 end-time=14.504 speaker=speaker_2
audio=sprint_review.wav lna=a_3 start-time=16.408 end-time=17.0 speaker=speaker_1
audio=sprint_review.wav lna=a_4 start-time=24.28 end-time=28.768 speaker=speaker_3
audio=sprint_review.wav lna=a_5 start-time=34.528 end-time=45.712 speaker=speaker_3
audio=sprint_review.wav lna=a_6 start-time=49.608 end-time=55.16 speaker=speaker_3
audio=sprint_review.wav lna=a_7 start-time=57.032 end-time=62.144 speaker=speaker_3
audio=sprint_review.wav lna=a_8 start-time=66.272 end-time=73.08 speaker=speaker_3
audio=sprint_review.wav lna=a_9 start-time=75.04 end-time=79.952 speaker=speaker_4
audio=sprint_review.wav lna=a_10 start-time=83.32 end-time=90.732 speaker=speaker_4

It then hit me that if I am able to create and save labels in Audacity then I should certainly be able import them. The only question is the form of the labels: hopefully Audacity does not require some complex format.

So I fired up Google with the search 'audacity import labels' which turned up this page: https://ttmanual.audacityteam.org/man/Label_Tracks. Right at the bottom was exactly what I was looking for:

Brilliant! Now to work out how to massage the output of the aalto scripts into Audacity labels. For this I whipped up a dirty Python script.

#!/usr/bin/env python3.8
import sys
with open(sys.argv[1]) as f:
   for row in f:
      l = row.strip().split(' ')
      start = float(l[2].split('=')[-1])
      stop = float(l[3].split('=')[-1])
      label = l[-1].split('=')[-1]
      diff = 0.0
      print(start - diff, stop - diff, label, sep='\t')
      #print(l[2].split('=')[-1], l[3].split('=')[-1], l[-1].split('=')[-1], sep='\t')

which I ran as follows:

~$ ./sprint_review_labels.py sprint_review_speakers.txt > sprint_review_speaker_labels.txt

whose output now looks like

~$ head sprint_review_speaker_labels.txt
5.448   6.504   speaker_1
11.352   14.504   speaker_2
16.408   17.0   speaker_1
24.28   28.768   speaker_3
34.528   45.712   speaker_3
49.608   55.16   speaker_3
57.032   62.144   speaker_3
66.272   73.08   speaker_3
75.04   79.952   speaker_4
83.32   90.732   speaker_4

I was now in a position to import the labels into Audacity alongside the audio track. This is what the final output looks like:

The labels now meant that I did not have to listen to the full audio and I could simply skip to the change of speaker to get their reactions and the ensuing discussion.

Reflection

I think this can be an awesome feature to integrate into Audacity. Doing so will take advantage of the underlying implementation which benefits from the integrated features such as resampling and filters that can improve detection quality. I can certainly see myself having to do this again and I'm sure there are numerous other scenarios that others would need to do some quick scanning.

In this application I was not super-interested in having accuracy: a quick heuristic was sufficient to save me tonnes of time. Granted, discovering the whole process above took me a couple of hours, but I think it was well spent if I will in future just take at most 20 minutes to segment the audio. I noticed that whenever several speakers spoke at once the algorithm would get confused. The assignment of labels to speakers was not always correct but what really mattered for me was the transition from speaker to speaker rather than the identity of each speaker.

If you need something more accurate you will need to ensure that the recording is of sufficient quality, perhaps do some post-processing such as minimising echos. Using better quality mics is also necessary.

When I first imported the labels I suspected that they were offset by some value. This is why the script has a diff variable: by setting a small positive value to diff (<5.448) I was able to shift the labels to the left and see if they matched better with the audio. However, after examining the sync it occurred to me that the beginning might not have been the best quality. Further down the line the labels were very reliable.

One little quirk is that the labels are defined in such a way that the total labeled length is less that the total running length meaning there are unlabeled intervals where there is silence. This can be annoying at times because it means there will be times when the same labels associated with the same speaker are split up. It would be nice to merge them to the minimum number of segments (hence the --merge-similar argument below).

Ideal Usage

Here's how I would really like to process the data. Personally, I prefer working on the command line. I envision a single command

~$ some_audio_cmd segment --merge-similar file.m4a -o file_labels.txt

That's it!

Note that the intermediate conversion into WAV would be automatic: the user never even needs to know it happened.

If this is to work from Audacity there would be a filter which simply say 'Segment audio...' with a few default options which produces a labels track.

I hope that someone finds this useful. If I had a lot of time on my hands (which I definitely don't have) I would repackage the aalto scripts into a more user-friendly package as well as make it more robust against audio sampling frequency.