Automatic Segmentation of Audio by Speaker

Posted 4 years, 7 months ago | Originally written on 10 Mar 2020

Problem

I recently attended a meeting in which I had to write notes from the discussions. I was feeling lazy so I decided that I would record the audio on my phone then make notes from that. That part worked really well until I sat down to make the notes.

The meeting was structured such that there were several speakers each of whom would be interrupted with questions or comments. It was the comments that I was after. Overall the meeting lasted just under an hour with interruptions occurring at any point.

I started by playing the audio directly off my phone then taking notes. It didn't take long to realise that I couldn't keep up and that I had very little control over playback. I wanted to be able to move forward and back precisely but that wasn't possible on the phone. Furthermore, I wish I didn't have to listen to the whole thing again. I then began to wonder if there was a simple way to segment the audio then only skip to the start of each new segment...

Surely someone has integrated this into a popular tool.

Development

I've used Audacity at church to record sermons but I've never really played with it. It has a respectable number of filters and effects and I had a hunch that it would have a way to do some sort of automatic segmentation.

Unfortunately it didn't have. But it has a facility to label points or portions of audio. There's was no way I was going to listen to the whole piece just to add labels - that would take an awful chunk of time. What I needed was an automatic way to add labels.

My first search was for 'audio segmentation' which turned up a bunch of research articles on various approaches to doing so. Clearly I was not going to spend hours (far more exciting than rote listening) implementing these. I wanted a complete solution.

I then searched for 'segment audio by speaker' which had python - Audio Analysis : Segment audio based on speaker recognition - Data Science Stack Exchange as my first result which had two links in the question: the first one (https://github.com/tyiannak/pyAudioAnalysis/wiki/5.-Segmentation) was more descriptive and less practical while the second one (https://github.com/aalto-speech/speaker-diarization) was a repository to a collection of Python scripts that could do various automatic tasks.

I was not yet convinced that I wanted to learn how to, let alone, run a bunch of unknown scripts, I still hoped to get a quick and dirty solution especially one which involved click-click-click (yeah, I know!).

I next tried to search for 'audacity audio segmentation' which had results on how to 'split' audio and how to 'label' audio but not how to perform automatic segmentation by speaker.

By this point I figured that I would have to use Audacity anyway so I downloaded and installed it. Since the audio was recorded as M4A (AAC) format and Audacity doesn't ship with the encoder pre-installed I had to download and install the ffmpeg library which would support this.

After toying around for a bit it became clear that I would have to invest the time in getting the speaker-diarization scripts to work. The fastest way would involve running the scripts from a Docker container (as outline in the helpful README). So off I went in search of Docker...

Once this was set up I tried out the example.

~$ docker run -it blabbertabber/aalto-speech-diarizer bash
[root@3d30f04ba1d7 /]# cd /speaker-diarization
[root@3d30f04ba1d7 speaker-diarization]# curl -k -OL https://nono.io/meeting.wav
 % Total   % Received % Xferd Average Speed  Time   Time    Time Current
                                Dload Upload  Total  Spent   Left Speed
100 17.2M 100 17.2M   0    0 1661k     0 0:00:10 0:00:10 --:--:-- 1882k
[root@3d30f04ba1d7 speaker-diarization]# ./spk-diarization2.py meeting.wav
Reading file: meeting.wav
Writing output to: stdout
Using feacat from: /speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /speaker-diarization/lna
Writing exp files in: /speaker-diarization/exp
Writing features in: /speaker-diarization/fea
Performing exp generation and feacat concurrently
tokenpass: ./VAD/tokenpass/test_token_pass
Reading recipe: /tmp/init91Tj5U.recipe
Using model: ./hmms/mfcc_16g_11.10.2007_10
Writing `.lna` files in: /speaker-diarization/lna
Writing `.exp` files in: /speaker-diarization/exp
Processing file 1/1
Input: meeting.wav
Output: /speaker-diarization/lna/meeting.lna
FAN OUT: 0 nodes, 0 arcs
FAN IN: 0 nodes, 0 arcs
Prefix tree: 3 nodes, 6 arcs
WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam.
Calling voice-detection2.py
Reading recipe from: /tmp/init91Tj5U.recipe
Reading .exp files from: /speaker-diarization/exp
Writing output to: /tmp/vadkNIKcZ.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Waiting for feacat to end.
Calling spk-change-detection.py
Reading recipe from: /tmp/vadkNIKcZ.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: /tmp/spkcKQXuDX.recipe
Conversion rate set to frame rate: 125.0
Using a growing window
Deltaws set to: 0.096 seconds
Using BIC as distance measure, lambda = 1.0
Window size set to: 1.0 seconds
Window step set to: 3.0 seconds
Threshold distance: 0.0
Useful metrics for determining the right threshold:
---------------------------------------------------
Average between windows distance: -370.524559348
Maximum between windows distance: 2039.1026531722064
Minimum between windows distance: -1222.9105555792084
Total windows: 346
Total segments: 64
Average between detected segments distance: 327.139634263
Maximum between detected segments distance: 2043.4163627520657
Minimum between detected segments distance: 11.18226920088864
Total detected speaker changes: 41
Calling spk-clustering.py
Reading recipe from: /tmp/spkcKQXuDX.recipe
Reading feature files from: /speaker-diarization/fea
Feature files extension: .fea
Writing output to: stdout
Conversion rate set to frame rate: 125.0
Using hierarchical clustering
Using BIC as distance measure, lambda = 1.3
Threshold distance: 0.0
Maximum speakers: 0
Initial cluster with: 64 speakers
Merging: 38 and 44 distance: -2921.769470541185
Merging: 38 and 40 distance: -2951.2135406549614
Merging: 38 and 43 distance: -2871.713460678535
Merging: 38 and 44 distance: -2917.2587434102315
Merging: 51 and 53 distance: -2871.0543575817246
Merging: 51 and 54 distance: -2940.8351154446605
Merging: 28 and 38 distance: -2852.6957536639875
Merging: 50 and 51 distance: -2850.9433078229094
Merging: 28 and 39 distance: -2759.0140977398596
Merging: 49 and 52 distance: -2695.0033931089265
Merging: 44 and 49 distance: -2756.0663835379573
Merging: 44 and 49 distance: -2710.348483406712
Merging: 28 and 36 distance: -2667.4482367885075
Merging: 28 and 39 distance: -2660.3583427837257
Merging: 20 and 28 distance: -2657.115016038645
Merging: 20 and 35 distance: -2715.1799624734203
Merging: 18 and 20 distance: -2710.654666705846
Merging: 17 and 18 distance: -2684.771867092685
Merging: 17 and 19 distance: -2617.350598439542
Merging: 15 and 17 distance: -2620.851348572456
Merging: 15 and 28 distance: -2607.3123879864024
Merging: 18 and 24 distance: -2471.3269385791946
Merging: 1 and 4 distance: -2433.1455909440465
Merging: 10 and 14 distance: -2358.085004075313
Merging: 16 and 25 distance: -2350.1272441177916
Merging: 16 and 21 distance: -2387.941421214947
Merging: 16 and 19 distance: -2393.6868377148476
Merging: 16 and 18 distance: -2417.338076661149
Merging: 20 and 28 distance: -2339.7809645759626
Merging: 3 and 22 distance: -2321.1849558166023
Merging: 28 and 30 distance: -2288.3740035107658
Merging: 10 and 12 distance: -2284.9976499143118
Merging: 7 and 10 distance: -2266.637970303841
Merging: 18 and 25 distance: -2181.9640388075186
Merging: 7 and 29 distance: -2173.480848412663
Merging: 7 and 9 distance: -2130.329377092388
Merging: 4 and 7 distance: -2130.82077204709
Merging: 4 and 15 distance: -2110.6807496186516
Merging: 4 and 14 distance: -2115.528322770778
Merging: 4 and 16 distance: -2171.816660672457
Merging: 4 and 6 distance: -2114.09627666888
Merging: 3 and 8 distance: -1928.8540169733815
Merging: 3 and 7 distance: -2005.493368380382
Merging: 2 and 4 distance: -1928.8040567328771
Merging: 3 and 15 distance: -1922.8509757421589
Merging: 17 and 19 distance: -1843.9285212133682
Merging: 14 and 15 distance: -1815.3507925949698
Merging: 2 and 9 distance: -1797.8386215631135
Merging: 2 and 4 distance: -1908.8801149227675
Merging: 2 and 5 distance: -1845.968505681797
Merging: 6 and 9 distance: -1655.75556196682
Merging: 2 and 4 distance: -1625.9641887100834
Merging: 2 and 7 distance: -1404.1768198308237
Merging: 2 and 7 distance: -1317.0797306828836
Merging: 3 and 4 distance: -1299.294455103457
Merging: 7 and 9 distance: -1179.8781203555072
Merging: 5 and 8 distance: -1144.8095559456488
Merging: 1 and 3 distance: -741.7540523122689
Merging: 4 and 5 distance: -618.7548791396439
Final speakers: 5
Useful metrics for determining the right threshold:
---------------------------------------------------
Maximum between segments distance: 21370.577321699926
Minimum between segments distance: -2951.2135406549614
Total segments: 64
Total detected speakers: 5

And what does the output look like?

[root@3d30f04ba1d7 speaker-diarization]# cat stdout
audio=meeting.wav lna=a_1 start-time=0.384 end-time=5.82 speaker=speaker_1
audio=meeting.wav lna=a_2 start-time=5.82 end-time=31.648 speaker=speaker_2
audio=meeting.wav lna=a_3 start-time=31.648 end-time=58.272 speaker=speaker_1
audio=meeting.wav lna=a_4 start-time=60.032 end-time=66.536 speaker=speaker_1
audio=meeting.wav lna=a_5 start-time=66.536 end-time=68.748 speaker=speaker_2
audio=meeting.wav lna=a_6 start-time=68.748 end-time=70.576 speaker=speaker_2
audio=meeting.wav lna=a_7 start-time=70.576 end-time=78.264 speaker=speaker_2
audio=meeting.wav lna=a_8 start-time=79.84 end-time=80.248 speaker=speaker_2
audio=meeting.wav lna=a_9 start-time=80.248 end-time=82.792 speaker=speaker_2
audio=meeting.wav lna=a_10 start-time=82.792 end-time=83.372 speaker=speaker_2
audio=meeting.wav lna=a_11 start-time=83.372 end-time=88.96 speaker=speaker_2
audio=meeting.wav lna=a_12 start-time=88.96 end-time=93.288 speaker=speaker_1
audio=meeting.wav lna=a_13 start-time=93.288 end-time=93.9 speaker=speaker_2
audio=meeting.wav lna=a_14 start-time=93.9 end-time=96.436 speaker=speaker_1
audio=meeting.wav lna=a_15 start-time=96.436 end-time=98.436 speaker=speaker_2
audio=meeting.wav lna=a_16 start-time=98.436 end-time=102.736 speaker=speaker_2
audio=meeting.wav lna=a_17 start-time=102.736 end-time=103.284 speaker=speaker_2
audio=meeting.wav lna=a_18 start-time=103.284 end-time=103.888 speaker=speaker_2
audio=meeting.wav lna=a_19 start-time=103.888 end-time=110.156 speaker=speaker_1
audio=meeting.wav lna=a_20 start-time=110.156 end-time=114.2 speaker=speaker_2
audio=meeting.wav lna=a_21 start-time=119.936 end-time=124.256 speaker=speaker_2
audio=meeting.wav lna=a_22 start-time=124.256 end-time=126.512 speaker=speaker_3
audio=meeting.wav lna=a_23 start-time=126.512 end-time=140.956 speaker=speaker_2
audio=meeting.wav lna=a_24 start-time=140.956 end-time=143.256 speaker=speaker_3
audio=meeting.wav lna=a_25 start-time=148.76 end-time=152.472 speaker=speaker_3
audio=meeting.wav lna=a_26 start-time=157.208 end-time=166.98 speaker=speaker_2
audio=meeting.wav lna=a_27 start-time=166.98 end-time=171.5 speaker=speaker_3
audio=meeting.wav lna=a_28 start-time=171.5 end-time=173.588 speaker=speaker_2
audio=meeting.wav lna=a_29 start-time=173.588 end-time=190.016 speaker=speaker_3
audio=meeting.wav lna=a_30 start-time=190.016 end-time=193.208 speaker=speaker_2
audio=meeting.wav lna=a_31 start-time=195.176 end-time=195.88 speaker=speaker_4
audio=meeting.wav lna=a_32 start-time=195.88 end-time=199.672 speaker=speaker_2
audio=meeting.wav lna=a_33 start-time=201.888 end-time=203.436 speaker=speaker_2
audio=meeting.wav lna=a_34 start-time=203.436 end-time=209.304 speaker=speaker_3
audio=meeting.wav lna=a_35 start-time=210.912 end-time=212.88 speaker=speaker_1
audio=meeting.wav lna=a_36 start-time=215.256 end-time=216.708 speaker=speaker_2
audio=meeting.wav lna=a_37 start-time=216.708 end-time=218.912 speaker=speaker_2
audio=meeting.wav lna=a_38 start-time=224.424 end-time=226.968 speaker=speaker_2
audio=meeting.wav lna=a_39 start-time=226.968 end-time=227.448 speaker=speaker_2
audio=meeting.wav lna=a_40 start-time=227.448 end-time=240.544 speaker=speaker_2
audio=meeting.wav lna=a_41 start-time=242.92 end-time=243.628 speaker=speaker_2
audio=meeting.wav lna=a_42 start-time=243.628 end-time=257.08 speaker=speaker_3
audio=meeting.wav lna=a_43 start-time=257.08 end-time=259.384 speaker=speaker_2
audio=meeting.wav lna=a_44 start-time=261.096 end-time=293.136 speaker=speaker_2
audio=meeting.wav lna=a_45 start-time=298.96 end-time=301.064 speaker=speaker_2
audio=meeting.wav lna=a_46 start-time=301.064 end-time=304.952 speaker=speaker_2
audio=meeting.wav lna=a_47 start-time=304.952 end-time=306.896 speaker=speaker_2
audio=meeting.wav lna=a_48 start-time=339.76 end-time=357.404 speaker=speaker_4
audio=meeting.wav lna=a_49 start-time=357.404 end-time=360.664 speaker=speaker_1
audio=meeting.wav lna=a_50 start-time=360.664 end-time=365.416 speaker=speaker_4
audio=meeting.wav lna=a_51 start-time=369.728 end-time=370.428 speaker=speaker_4
audio=meeting.wav lna=a_52 start-time=370.428 end-time=382.376 speaker=speaker_4
audio=meeting.wav lna=a_53 start-time=382.376 end-time=390.176 speaker=speaker_5
audio=meeting.wav lna=a_54 start-time=390.176 end-time=414.136 speaker=speaker_4
audio=meeting.wav lna=a_55 start-time=417.936 end-time=448.504 speaker=speaker_4
audio=meeting.wav lna=a_56 start-time=451.032 end-time=465.808 speaker=speaker_4
audio=meeting.wav lna=a_57 start-time=473.504 end-time=487.584 speaker=speaker_4
audio=meeting.wav lna=a_58 start-time=492.048 end-time=493.64 speaker=speaker_4
audio=meeting.wav lna=a_59 start-time=495.992 end-time=499.336 speaker=speaker_4
audio=meeting.wav lna=a_60 start-time=501.68 end-time=525.328 speaker=speaker_4
audio=meeting.wav lna=a_61 start-time=537.92 end-time=545.268 speaker=speaker_4
audio=meeting.wav lna=a_62 start-time=545.268 end-time=549.18 speaker=speaker_5
audio=meeting.wav lna=a_63 start-time=549.18 end-time=549.768 speaker=speaker_2
audio=meeting.wav lna=a_64 start-time=549.768 end-time=565.584 speaker=speaker_4

Not too bad. But what can I do with that? Hmm....

Perhaps if I can just get the labels then I can figure out what to do with them. Here I go:

First, it seems that the aalto scripts only work with WAV so I first have to export my audio from M4A to WAV. This is trivial in Audacity:

Then I need to make sure the Docker image can see the folder containing the new WAV file.

docker run -it --mount type=bind,source=/Users/pkorir/Downloads,destination=/data blabbertabber/aalto-speech-diarizer bash

Now let's try and diarise the audio.

[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav
Reading file: sprint_review.wav
Writing output to: stdout
Using feacat from: /speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /speaker-diarization/lna
Writing exp files in: /speaker-diarization/exp
Writing features in: /speaker-diarization/fea
Performing exp generation and feacat concurrently
tokenpass: ./VAD/tokenpass/test_token_pass
Reading recipe: /tmp/initYH6FfW.recipe
Using model: ./hmms/mfcc_16g_11.10.2007_10
Writing `.lna` files in: /speaker-diarization/lna
Writing `.exp` files in: /speaker-diarization/exp
Processing file 1/1
Input: sprint_review.wav
Output: /speaker-diarization/lna/sprint_review.lna
exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.
Traceback (most recent call last):
 File "./generate_exp.py", line 264, in <module>
   shift_dec_bord(lnas, arguments['--exppath'])
 File "./generate_exp.py", line 181, in shift_dec_bord
   num_models, l = _read_lna(lna)
 File "./generate_exp.py", line 123, in _read_lna
   with open(lna, 'r') as f:
IOError: [Errno 2] No such file or directory: '/speaker-diarization/lna/sprint_review.lna'
Calling voice-detection2.py
Reading recipe from: /tmp/initYH6FfW.recipe
Reading .exp files from: /speaker-diarization/exp
Writing output to: /tmp/vadK0SiCg.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Error, /speaker-diarization/exp/sprint_review.exp does not exist
Waiting for feacat to end.
^CTraceback (most recent call last):
 File "./spk-diarization2.py", line 116, in <module>
   child2.wait()
 File "/usr/lib64/python2.7/subprocess.py", line 1099, in wait
   pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
 File "/usr/lib64/python2.7/subprocess.py", line 125, in _eintr_retry_call
   return func(*args)
KeyboardInterrupt

Oops! There's a mismatch in the sampling rate used indicated by the line

exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.

So let's go back to Audacity and fix this. It took me a while to figure this out though all the while it was right under my nose:

then