I recently attended a meeting in which I had to write notes from the discussions. I was feeling lazy so I decided that I would record the audio on my phone then make notes from that. That part worked really well until I sat down to make the notes.
The meeting was structured such that there were several speakers each of whom would be interrupted with questions or comments. It was the comments that I was after. Overall the meeting lasted just under an hour with interruptions occurring at any point.
I started by playing the audio directly off my phone then taking notes. It didn't take long to realise that I couldn't keep up and that I had very little control over playback. I wanted to be able to move forward and back precisely but that wasn't possible on the phone. Furthermore, I wish I didn't have to listen to the whole thing again. I then began to wonder if there was a simple way to segment the audio then only skip to the start of each new segment...
Surely someone has integrated this into a popular tool.
I've used Audacity at church to record sermons but I've never really played with it. It has a respectable number of filters and effects and I had a hunch that it would have a way to do some sort of automatic segmentation.
Unfortunately it didn't have. But it has a facility to label points or portions of audio. There's was no way I was going to listen to the whole piece just to add labels - that would take an awful chunk of time. What I needed was an automatic way to add labels.
My first search was for 'audio segmentation' which turned up a bunch of research articles on various approaches to doing so. Clearly I was not going to spend hours (far more exciting than rote listening) implementing these. I wanted a complete solution.
I then searched for 'segment audio by speaker' which had python - Audio Analysis : Segment audio based on speaker recognition - Data Science Stack Exchange as my first result which had two links in the question: the first one (https://github.com/tyiannak/pyAudioAnalysis/wiki/5.-Segmentation) was more descriptive and less practical while the second one (https://github.com/aalto-speech/speaker-diarization) was a repository to a collection of Python scripts that could do various automatic tasks.
I was not yet convinced that I wanted to learn how to, let alone, run a bunch of unknown scripts, I still hoped to get a quick and dirty solution especially one which involved click-click-click (yeah, I know!).
I next tried to search for 'audacity audio segmentation' which had results on how to 'split' audio and how to 'label' audio but not how to perform automatic segmentation by speaker.
By this point I figured that I would have to use Audacity anyway so I downloaded and installed it. Since the audio was recorded as M4A (AAC) format and Audacity doesn't ship with the encoder pre-installed I had to download and install the ffmpeg library which would support this.
After toying around for a bit it became clear that I would have to invest the time in getting the speaker-diarization scripts to work. The fastest way would involve running the scripts from a Docker container (as outline in the helpful README). So off I went in search of Docker...
Once this was set up I tried out the example.
~$ docker run -it blabbertabber/aalto-speech-diarizer bash [[email protected] /]# cd /speaker-diarization [[email protected] speaker-diarization]# curl -k -OL https://nono.io/meeting.wav % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17.2M 100 17.2M 0 0 1661k 0 0:00:10 0:00:10 --:--:-- 1882k [[email protected] speaker-diarization]# ./spk-diarization2.py meeting.wav Reading file: meeting.wav Writing output to: stdout Using feacat from: /speaker-diarization/feacat Writing temporal files in: /tmp Writing lna files in: /speaker-diarization/lna Writing exp files in: /speaker-diarization/exp Writing features in: /speaker-diarization/fea Performing exp generation and feacat concurrently tokenpass: ./VAD/tokenpass/test_token_pass Reading recipe: /tmp/init91Tj5U.recipe Using model: ./hmms/mfcc_16g_11.10.2007_10 Writing `.lna` files in: /speaker-diarization/lna Writing `.exp` files in: /speaker-diarization/exp Processing file 1/1 Input: meeting.wav Output: /speaker-diarization/lna/meeting.lna FAN OUT: 0 nodes, 0 arcs FAN IN: 0 nodes, 0 arcs Prefix tree: 3 nodes, 6 arcs WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam. Calling voice-detection2.py Reading recipe from: /tmp/init91Tj5U.recipe Reading .exp files from: /speaker-diarization/exp Writing output to: /tmp/vadkNIKcZ.recipe Sample rate set to: 125 Minimum speech turn duration: 0.5 seconds Minimum nonspeech between-turns duration: 1.5 seconds Segment before expansion set to: 0.0 seconds Segment end expansion set to: 0.0 seconds Waiting for feacat to end. Calling spk-change-detection.py Reading recipe from: /tmp/vadkNIKcZ.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: /tmp/spkcKQXuDX.recipe Conversion rate set to frame rate: 125.0 Using a growing window Deltaws set to: 0.096 seconds Using BIC as distance measure, lambda = 1.0 Window size set to: 1.0 seconds Window step set to: 3.0 seconds Threshold distance: 0.0 Useful metrics for determining the right threshold: --------------------------------------------------- Average between windows distance: -370.524559348 Maximum between windows distance: 2039.1026531722064 Minimum between windows distance: -1222.9105555792084 Total windows: 346 Total segments: 64 Average between detected segments distance: 327.139634263 Maximum between detected segments distance: 2043.4163627520657 Minimum between detected segments distance: 11.18226920088864 Total detected speaker changes: 41 Calling spk-clustering.py Reading recipe from: /tmp/spkcKQXuDX.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: stdout Conversion rate set to frame rate: 125.0 Using hierarchical clustering Using BIC as distance measure, lambda = 1.3 Threshold distance: 0.0 Maximum speakers: 0 Initial cluster with: 64 speakers Merging: 38 and 44 distance: -2921.769470541185 Merging: 38 and 40 distance: -2951.2135406549614 Merging: 38 and 43 distance: -2871.713460678535 Merging: 38 and 44 distance: -2917.2587434102315 Merging: 51 and 53 distance: -2871.0543575817246 Merging: 51 and 54 distance: -2940.8351154446605 Merging: 28 and 38 distance: -2852.6957536639875 Merging: 50 and 51 distance: -2850.9433078229094 Merging: 28 and 39 distance: -2759.0140977398596 Merging: 49 and 52 distance: -2695.0033931089265 Merging: 44 and 49 distance: -2756.0663835379573 Merging: 44 and 49 distance: -2710.348483406712 Merging: 28 and 36 distance: -2667.4482367885075 Merging: 28 and 39 distance: -2660.3583427837257 Merging: 20 and 28 distance: -2657.115016038645 Merging: 20 and 35 distance: -2715.1799624734203 Merging: 18 and 20 distance: -2710.654666705846 Merging: 17 and 18 distance: -2684.771867092685 Merging: 17 and 19 distance: -2617.350598439542 Merging: 15 and 17 distance: -2620.851348572456 Merging: 15 and 28 distance: -2607.3123879864024 Merging: 18 and 24 distance: -2471.3269385791946 Merging: 1 and 4 distance: -2433.1455909440465 Merging: 10 and 14 distance: -2358.085004075313 Merging: 16 and 25 distance: -2350.1272441177916 Merging: 16 and 21 distance: -2387.941421214947 Merging: 16 and 19 distance: -2393.6868377148476 Merging: 16 and 18 distance: -2417.338076661149 Merging: 20 and 28 distance: -2339.7809645759626 Merging: 3 and 22 distance: -2321.1849558166023 Merging: 28 and 30 distance: -2288.3740035107658 Merging: 10 and 12 distance: -2284.9976499143118 Merging: 7 and 10 distance: -2266.637970303841 Merging: 18 and 25 distance: -2181.9640388075186 Merging: 7 and 29 distance: -2173.480848412663 Merging: 7 and 9 distance: -2130.329377092388 Merging: 4 and 7 distance: -2130.82077204709 Merging: 4 and 15 distance: -2110.6807496186516 Merging: 4 and 14 distance: -2115.528322770778 Merging: 4 and 16 distance: -2171.816660672457 Merging: 4 and 6 distance: -2114.09627666888 Merging: 3 and 8 distance: -1928.8540169733815 Merging: 3 and 7 distance: -2005.493368380382 Merging: 2 and 4 distance: -1928.8040567328771 Merging: 3 and 15 distance: -1922.8509757421589 Merging: 17 and 19 distance: -1843.9285212133682 Merging: 14 and 15 distance: -1815.3507925949698 Merging: 2 and 9 distance: -1797.8386215631135 Merging: 2 and 4 distance: -1908.8801149227675 Merging: 2 and 5 distance: -1845.968505681797 Merging: 6 and 9 distance: -1655.75556196682 Merging: 2 and 4 distance: -1625.9641887100834 Merging: 2 and 7 distance: -1404.1768198308237 Merging: 2 and 7 distance: -1317.0797306828836 Merging: 3 and 4 distance: -1299.294455103457 Merging: 7 and 9 distance: -1179.8781203555072 Merging: 5 and 8 distance: -1144.8095559456488 Merging: 1 and 3 distance: -741.7540523122689 Merging: 4 and 5 distance: -618.7548791396439 Final speakers: 5 Useful metrics for determining the right threshold: --------------------------------------------------- Maximum between segments distance: 21370.577321699926 Minimum between segments distance: -2951.2135406549614 Total segments: 64 Total detected speakers: 5
And what does the output look like?
[[email protected] speaker-diarization]# cat stdout audio=meeting.wav lna=a_1 start-time=0.384 end-time=5.82 speaker=speaker_1 audio=meeting.wav lna=a_2 start-time=5.82 end-time=31.648 speaker=speaker_2 audio=meeting.wav lna=a_3 start-time=31.648 end-time=58.272 speaker=speaker_1 audio=meeting.wav lna=a_4 start-time=60.032 end-time=66.536 speaker=speaker_1 audio=meeting.wav lna=a_5 start-time=66.536 end-time=68.748 speaker=speaker_2 audio=meeting.wav lna=a_6 start-time=68.748 end-time=70.576 speaker=speaker_2 audio=meeting.wav lna=a_7 start-time=70.576 end-time=78.264 speaker=speaker_2 audio=meeting.wav lna=a_8 start-time=79.84 end-time=80.248 speaker=speaker_2 audio=meeting.wav lna=a_9 start-time=80.248 end-time=82.792 speaker=speaker_2 audio=meeting.wav lna=a_10 start-time=82.792 end-time=83.372 speaker=speaker_2 audio=meeting.wav lna=a_11 start-time=83.372 end-time=88.96 speaker=speaker_2 audio=meeting.wav lna=a_12 start-time=88.96 end-time=93.288 speaker=speaker_1 audio=meeting.wav lna=a_13 start-time=93.288 end-time=93.9 speaker=speaker_2 audio=meeting.wav lna=a_14 start-time=93.9 end-time=96.436 speaker=speaker_1 audio=meeting.wav lna=a_15 start-time=96.436 end-time=98.436 speaker=speaker_2 audio=meeting.wav lna=a_16 start-time=98.436 end-time=102.736 speaker=speaker_2 audio=meeting.wav lna=a_17 start-time=102.736 end-time=103.284 speaker=speaker_2 audio=meeting.wav lna=a_18 start-time=103.284 end-time=103.888 speaker=speaker_2 audio=meeting.wav lna=a_19 start-time=103.888 end-time=110.156 speaker=speaker_1 audio=meeting.wav lna=a_20 start-time=110.156 end-time=114.2 speaker=speaker_2 audio=meeting.wav lna=a_21 start-time=119.936 end-time=124.256 speaker=speaker_2 audio=meeting.wav lna=a_22 start-time=124.256 end-time=126.512 speaker=speaker_3 audio=meeting.wav lna=a_23 start-time=126.512 end-time=140.956 speaker=speaker_2 audio=meeting.wav lna=a_24 start-time=140.956 end-time=143.256 speaker=speaker_3 audio=meeting.wav lna=a_25 start-time=148.76 end-time=152.472 speaker=speaker_3 audio=meeting.wav lna=a_26 start-time=157.208 end-time=166.98 speaker=speaker_2 audio=meeting.wav lna=a_27 start-time=166.98 end-time=171.5 speaker=speaker_3 audio=meeting.wav lna=a_28 start-time=171.5 end-time=173.588 speaker=speaker_2 audio=meeting.wav lna=a_29 start-time=173.588 end-time=190.016 speaker=speaker_3 audio=meeting.wav lna=a_30 start-time=190.016 end-time=193.208 speaker=speaker_2 audio=meeting.wav lna=a_31 start-time=195.176 end-time=195.88 speaker=speaker_4 audio=meeting.wav lna=a_32 start-time=195.88 end-time=199.672 speaker=speaker_2 audio=meeting.wav lna=a_33 start-time=201.888 end-time=203.436 speaker=speaker_2 audio=meeting.wav lna=a_34 start-time=203.436 end-time=209.304 speaker=speaker_3 audio=meeting.wav lna=a_35 start-time=210.912 end-time=212.88 speaker=speaker_1 audio=meeting.wav lna=a_36 start-time=215.256 end-time=216.708 speaker=speaker_2 audio=meeting.wav lna=a_37 start-time=216.708 end-time=218.912 speaker=speaker_2 audio=meeting.wav lna=a_38 start-time=224.424 end-time=226.968 speaker=speaker_2 audio=meeting.wav lna=a_39 start-time=226.968 end-time=227.448 speaker=speaker_2 audio=meeting.wav lna=a_40 start-time=227.448 end-time=240.544 speaker=speaker_2 audio=meeting.wav lna=a_41 start-time=242.92 end-time=243.628 speaker=speaker_2 audio=meeting.wav lna=a_42 start-time=243.628 end-time=257.08 speaker=speaker_3 audio=meeting.wav lna=a_43 start-time=257.08 end-time=259.384 speaker=speaker_2 audio=meeting.wav lna=a_44 start-time=261.096 end-time=293.136 speaker=speaker_2 audio=meeting.wav lna=a_45 start-time=298.96 end-time=301.064 speaker=speaker_2 audio=meeting.wav lna=a_46 start-time=301.064 end-time=304.952 speaker=speaker_2 audio=meeting.wav lna=a_47 start-time=304.952 end-time=306.896 speaker=speaker_2 audio=meeting.wav lna=a_48 start-time=339.76 end-time=357.404 speaker=speaker_4 audio=meeting.wav lna=a_49 start-time=357.404 end-time=360.664 speaker=speaker_1 audio=meeting.wav lna=a_50 start-time=360.664 end-time=365.416 speaker=speaker_4 audio=meeting.wav lna=a_51 start-time=369.728 end-time=370.428 speaker=speaker_4 audio=meeting.wav lna=a_52 start-time=370.428 end-time=382.376 speaker=speaker_4 audio=meeting.wav lna=a_53 start-time=382.376 end-time=390.176 speaker=speaker_5 audio=meeting.wav lna=a_54 start-time=390.176 end-time=414.136 speaker=speaker_4 audio=meeting.wav lna=a_55 start-time=417.936 end-time=448.504 speaker=speaker_4 audio=meeting.wav lna=a_56 start-time=451.032 end-time=465.808 speaker=speaker_4 audio=meeting.wav lna=a_57 start-time=473.504 end-time=487.584 speaker=speaker_4 audio=meeting.wav lna=a_58 start-time=492.048 end-time=493.64 speaker=speaker_4 audio=meeting.wav lna=a_59 start-time=495.992 end-time=499.336 speaker=speaker_4 audio=meeting.wav lna=a_60 start-time=501.68 end-time=525.328 speaker=speaker_4 audio=meeting.wav lna=a_61 start-time=537.92 end-time=545.268 speaker=speaker_4 audio=meeting.wav lna=a_62 start-time=545.268 end-time=549.18 speaker=speaker_5 audio=meeting.wav lna=a_63 start-time=549.18 end-time=549.768 speaker=speaker_2 audio=meeting.wav lna=a_64 start-time=549.768 end-time=565.584 speaker=speaker_4
Not too bad. But what can I do with that? Hmm....
Perhaps if I can just get the labels then I can figure out what to do with them. Here I go:
First, it seems that the aalto scripts only work with WAV so I first have to export my audio from M4A to WAV. This is trivial in Audacity:
Then I need to make sure the Docker image can see the folder containing the new WAV file.
docker run -it --mount type=bind,source=/Users/pkorir/Downloads,destination=/data blabbertabber/aalto-speech-diarizer bash
Now let's try and diarise the audio.
[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav Reading file: sprint_review.wav Writing output to: stdout Using feacat from: /speaker-diarization/feacat Writing temporal files in: /tmp Writing lna files in: /speaker-diarization/lna Writing exp files in: /speaker-diarization/exp Writing features in: /speaker-diarization/fea Performing exp generation and feacat concurrently tokenpass: ./VAD/tokenpass/test_token_pass Reading recipe: /tmp/initYH6FfW.recipe Using model: ./hmms/mfcc_16g_11.10.2007_10 Writing `.lna` files in: /speaker-diarization/lna Writing `.exp` files in: /speaker-diarization/exp Processing file 1/1 Input: sprint_review.wav Output: /speaker-diarization/lna/sprint_review.lna exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree. Traceback (most recent call last): File "./generate_exp.py", line 264, in <module> shift_dec_bord(lnas, arguments['--exppath']) File "./generate_exp.py", line 181, in shift_dec_bord num_models, l = _read_lna(lna) File "./generate_exp.py", line 123, in _read_lna with open(lna, 'r') as f: IOError: [Errno 2] No such file or directory: '/speaker-diarization/lna/sprint_review.lna' Calling voice-detection2.py Reading recipe from: /tmp/initYH6FfW.recipe Reading .exp files from: /speaker-diarization/exp Writing output to: /tmp/vadK0SiCg.recipe Sample rate set to: 125 Minimum speech turn duration: 0.5 seconds Minimum nonspeech between-turns duration: 1.5 seconds Segment before expansion set to: 0.0 seconds Segment end expansion set to: 0.0 seconds Error, /speaker-diarization/exp/sprint_review.exp does not exist Waiting for feacat to end. ^CTraceback (most recent call last): File "./spk-diarization2.py", line 116, in <module> child2.wait() File "/usr/lib64/python2.7/subprocess.py", line 1099, in wait pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) File "/usr/lib64/python2.7/subprocess.py", line 125, in _eintr_retry_call return func(*args) KeyboardInterrupt
Oops! There's a mismatch in the sampling rate used indicated by the line
exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.
So let's go back to Audacity and fix this. It took me a while to figure this out though all the while it was right under my nose: