I recently attended a meeting in which I had to write notes from the discussions. I was feeling lazy so I decided that I would record the audio on my phone then make notes from that. That part worked really well until I sat down to make the notes.
The meeting was structured such that there were several speakers each of whom would be interrupted with questions or comments. It was the comments that I was after. Overall the meeting lasted just under an hour with interruptions occurring at any point.
I started by playing the audio directly off my phone then taking notes. It didn't take long to realise that I couldn't keep up and that I had very little control over playback. I wanted to be able to move forward and back precisely but that wasn't possible on the phone. Furthermore, I wish I didn't have to listen to the whole thing again. I then began to wonder if there was a simple way to segment the audio then only skip to the start of each new segment...
Surely someone has integrated this into a popular tool.
I've used Audacity at church to record sermons but I've never really played with it. It has a respectable number of filters and effects and I had a hunch that it would have a way to do some sort of automatic segmentation.
Unfortunately it didn't have. But it has a facility to label points or portions of audio. There's was no way I was going to listen to the whole piece just to add labels - that would take an awful chunk of time. What I needed was an automatic way to add labels.
My first search was for 'audio segmentation' which turned up a bunch of research articles on various approaches to doing so. Clearly I was not going to spend hours (far more exciting than rote listening) implementing these. I wanted a complete solution.
I then searched for 'segment audio by speaker' which had python - Audio Analysis : Segment audio based on speaker recognition - Data Science Stack Exchange as my first result which had two links in the question: the first one (https://github.com/tyiannak/pyAudioAnalysis/wiki/5.-Segmentation) was more descriptive and less practical while the second one (https://github.com/aalto-speech/speaker-diarization) was a repository to a collection of Python scripts that could do various automatic tasks.
I was not yet convinced that I wanted to learn how to, let alone, run a bunch of unknown scripts, I still hoped to get a quick and dirty solution especially one which involved click-click-click (yeah, I know!).
I next tried to search for 'audacity audio segmentation' which had results on how to 'split' audio and how to 'label' audio but not how to perform automatic segmentation by speaker.
By this point I figured that I would have to use Audacity anyway so I downloaded and installed it. Since the audio was recorded as M4A (AAC) format and Audacity doesn't ship with the encoder pre-installed I had to download and install the ffmpeg library which would support this.
After toying around for a bit it became clear that I would have to invest the time in getting the speaker-diarization scripts to work. The fastest way would involve running the scripts from a Docker container (as outline in the helpful README). So off I went in search of Docker...
Once this was set up I tried out the example.
~$ docker run -it blabbertabber/aalto-speech-diarizer bash [root@3d30f04ba1d7 /]# cd /speaker-diarization [root@3d30f04ba1d7 speaker-diarization]# curl -k -OL https://nono.io/meeting.wav % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17.2M 100 17.2M 0 0 1661k 0 0:00:10 0:00:10 --:--:-- 1882k [root@3d30f04ba1d7 speaker-diarization]# ./spk-diarization2.py meeting.wav Reading file: meeting.wav Writing output to: stdout Using feacat from: /speaker-diarization/feacat Writing temporal files in: /tmp Writing lna files in: /speaker-diarization/lna Writing exp files in: /speaker-diarization/exp Writing features in: /speaker-diarization/fea Performing exp generation and feacat concurrently tokenpass: ./VAD/tokenpass/test_token_pass Reading recipe: /tmp/init91Tj5U.recipe Using model: ./hmms/mfcc_16g_11.10.2007_10 Writing `.lna` files in: /speaker-diarization/lna Writing `.exp` files in: /speaker-diarization/exp Processing file 1/1 Input: meeting.wav Output: /speaker-diarization/lna/meeting.lna FAN OUT: 0 nodes, 0 arcs FAN IN: 0 nodes, 0 arcs Prefix tree: 3 nodes, 6 arcs WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam. Calling voice-detection2.py Reading recipe from: /tmp/init91Tj5U.recipe Reading .exp files from: /speaker-diarization/exp Writing output to: /tmp/vadkNIKcZ.recipe Sample rate set to: 125 Minimum speech turn duration: 0.5 seconds Minimum nonspeech between-turns duration: 1.5 seconds Segment before expansion set to: 0.0 seconds Segment end expansion set to: 0.0 seconds Waiting for feacat to end. Calling spk-change-detection.py Reading recipe from: /tmp/vadkNIKcZ.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: /tmp/spkcKQXuDX.recipe Conversion rate set to frame rate: 125.0 Using a growing window Deltaws set to: 0.096 seconds Using BIC as distance measure, lambda = 1.0 Window size set to: 1.0 seconds Window step set to: 3.0 seconds Threshold distance: 0.0 Useful metrics for determining the right threshold: --------------------------------------------------- Average between windows distance: -370.524559348 Maximum between windows distance: 2039.1026531722064 Minimum between windows distance: -1222.9105555792084 Total windows: 346 Total segments: 64 Average between detected segments distance: 327.139634263 Maximum between detected segments distance: 2043.4163627520657 Minimum between detected segments distance: 11.18226920088864 Total detected speaker changes: 41 Calling spk-clustering.py Reading recipe from: /tmp/spkcKQXuDX.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: stdout Conversion rate set to frame rate: 125.0 Using hierarchical clustering Using BIC as distance measure, lambda = 1.3 Threshold distance: 0.0 Maximum speakers: 0 Initial cluster with: 64 speakers Merging: 38 and 44 distance: -2921.769470541185 Merging: 38 and 40 distance: -2951.2135406549614 Merging: 38 and 43 distance: -2871.713460678535 Merging: 38 and 44 distance: -2917.2587434102315 Merging: 51 and 53 distance: -2871.0543575817246 Merging: 51 and 54 distance: -2940.8351154446605 Merging: 28 and 38 distance: -2852.6957536639875 Merging: 50 and 51 distance: -2850.9433078229094 Merging: 28 and 39 distance: -2759.0140977398596 Merging: 49 and 52 distance: -2695.0033931089265 Merging: 44 and 49 distance: -2756.0663835379573 Merging: 44 and 49 distance: -2710.348483406712 Merging: 28 and 36 distance: -2667.4482367885075 Merging: 28 and 39 distance: -2660.3583427837257 Merging: 20 and 28 distance: -2657.115016038645 Merging: 20 and 35 distance: -2715.1799624734203 Merging: 18 and 20 distance: -2710.654666705846 Merging: 17 and 18 distance: -2684.771867092685 Merging: 17 and 19 distance: -2617.350598439542 Merging: 15 and 17 distance: -2620.851348572456 Merging: 15 and 28 distance: -2607.3123879864024 Merging: 18 and 24 distance: -2471.3269385791946 Merging: 1 and 4 distance: -2433.1455909440465 Merging: 10 and 14 distance: -2358.085004075313 Merging: 16 and 25 distance: -2350.1272441177916 Merging: 16 and 21 distance: -2387.941421214947 Merging: 16 and 19 distance: -2393.6868377148476 Merging: 16 and 18 distance: -2417.338076661149 Merging: 20 and 28 distance: -2339.7809645759626 Merging: 3 and 22 distance: -2321.1849558166023 Merging: 28 and 30 distance: -2288.3740035107658 Merging: 10 and 12 distance: -2284.9976499143118 Merging: 7 and 10 distance: -2266.637970303841 Merging: 18 and 25 distance: -2181.9640388075186 Merging: 7 and 29 distance: -2173.480848412663 Merging: 7 and 9 distance: -2130.329377092388 Merging: 4 and 7 distance: -2130.82077204709 Merging: 4 and 15 distance: -2110.6807496186516 Merging: 4 and 14 distance: -2115.528322770778 Merging: 4 and 16 distance: -2171.816660672457 Merging: 4 and 6 distance: -2114.09627666888 Merging: 3 and 8 distance: -1928.8540169733815 Merging: 3 and 7 distance: -2005.493368380382 Merging: 2 and 4 distance: -1928.8040567328771 Merging: 3 and 15 distance: -1922.8509757421589 Merging: 17 and 19 distance: -1843.9285212133682 Merging: 14 and 15 distance: -1815.3507925949698 Merging: 2 and 9 distance: -1797.8386215631135 Merging: 2 and 4 distance: -1908.8801149227675 Merging: 2 and 5 distance: -1845.968505681797 Merging: 6 and 9 distance: -1655.75556196682 Merging: 2 and 4 distance: -1625.9641887100834 Merging: 2 and 7 distance: -1404.1768198308237 Merging: 2 and 7 distance: -1317.0797306828836 Merging: 3 and 4 distance: -1299.294455103457 Merging: 7 and 9 distance: -1179.8781203555072 Merging: 5 and 8 distance: -1144.8095559456488 Merging: 1 and 3 distance: -741.7540523122689 Merging: 4 and 5 distance: -618.7548791396439 Final speakers: 5 Useful metrics for determining the right threshold: --------------------------------------------------- Maximum between segments distance: 21370.577321699926 Minimum between segments distance: -2951.2135406549614 Total segments: 64 Total detected speakers: 5
And what does the output look like?
[root@3d30f04ba1d7 speaker-diarization]# cat stdout audio=meeting.wav lna=a_1 start-time=0.384 end-time=5.82 speaker=speaker_1 audio=meeting.wav lna=a_2 start-time=5.82 end-time=31.648 speaker=speaker_2 audio=meeting.wav lna=a_3 start-time=31.648 end-time=58.272 speaker=speaker_1 audio=meeting.wav lna=a_4 start-time=60.032 end-time=66.536 speaker=speaker_1 audio=meeting.wav lna=a_5 start-time=66.536 end-time=68.748 speaker=speaker_2 audio=meeting.wav lna=a_6 start-time=68.748 end-time=70.576 speaker=speaker_2 audio=meeting.wav lna=a_7 start-time=70.576 end-time=78.264 speaker=speaker_2 audio=meeting.wav lna=a_8 start-time=79.84 end-time=80.248 speaker=speaker_2 audio=meeting.wav lna=a_9 start-time=80.248 end-time=82.792 speaker=speaker_2 audio=meeting.wav lna=a_10 start-time=82.792 end-time=83.372 speaker=speaker_2 audio=meeting.wav lna=a_11 start-time=83.372 end-time=88.96 speaker=speaker_2 audio=meeting.wav lna=a_12 start-time=88.96 end-time=93.288 speaker=speaker_1 audio=meeting.wav lna=a_13 start-time=93.288 end-time=93.9 speaker=speaker_2 audio=meeting.wav lna=a_14 start-time=93.9 end-time=96.436 speaker=speaker_1 audio=meeting.wav lna=a_15 start-time=96.436 end-time=98.436 speaker=speaker_2 audio=meeting.wav lna=a_16 start-time=98.436 end-time=102.736 speaker=speaker_2 audio=meeting.wav lna=a_17 start-time=102.736 end-time=103.284 speaker=speaker_2 audio=meeting.wav lna=a_18 start-time=103.284 end-time=103.888 speaker=speaker_2 audio=meeting.wav lna=a_19 start-time=103.888 end-time=110.156 speaker=speaker_1 audio=meeting.wav lna=a_20 start-time=110.156 end-time=114.2 speaker=speaker_2 audio=meeting.wav lna=a_21 start-time=119.936 end-time=124.256 speaker=speaker_2 audio=meeting.wav lna=a_22 start-time=124.256 end-time=126.512 speaker=speaker_3 audio=meeting.wav lna=a_23 start-time=126.512 end-time=140.956 speaker=speaker_2 audio=meeting.wav lna=a_24 start-time=140.956 end-time=143.256 speaker=speaker_3 audio=meeting.wav lna=a_25 start-time=148.76 end-time=152.472 speaker=speaker_3 audio=meeting.wav lna=a_26 start-time=157.208 end-time=166.98 speaker=speaker_2 audio=meeting.wav lna=a_27 start-time=166.98 end-time=171.5 speaker=speaker_3 audio=meeting.wav lna=a_28 start-time=171.5 end-time=173.588 speaker=speaker_2 audio=meeting.wav lna=a_29 start-time=173.588 end-time=190.016 speaker=speaker_3 audio=meeting.wav lna=a_30 start-time=190.016 end-time=193.208 speaker=speaker_2 audio=meeting.wav lna=a_31 start-time=195.176 end-time=195.88 speaker=speaker_4 audio=meeting.wav lna=a_32 start-time=195.88 end-time=199.672 speaker=speaker_2 audio=meeting.wav lna=a_33 start-time=201.888 end-time=203.436 speaker=speaker_2 audio=meeting.wav lna=a_34 start-time=203.436 end-time=209.304 speaker=speaker_3 audio=meeting.wav lna=a_35 start-time=210.912 end-time=212.88 speaker=speaker_1 audio=meeting.wav lna=a_36 start-time=215.256 end-time=216.708 speaker=speaker_2 audio=meeting.wav lna=a_37 start-time=216.708 end-time=218.912 speaker=speaker_2 audio=meeting.wav lna=a_38 start-time=224.424 end-time=226.968 speaker=speaker_2 audio=meeting.wav lna=a_39 start-time=226.968 end-time=227.448 speaker=speaker_2 audio=meeting.wav lna=a_40 start-time=227.448 end-time=240.544 speaker=speaker_2 audio=meeting.wav lna=a_41 start-time=242.92 end-time=243.628 speaker=speaker_2 audio=meeting.wav lna=a_42 start-time=243.628 end-time=257.08 speaker=speaker_3 audio=meeting.wav lna=a_43 start-time=257.08 end-time=259.384 speaker=speaker_2 audio=meeting.wav lna=a_44 start-time=261.096 end-time=293.136 speaker=speaker_2 audio=meeting.wav lna=a_45 start-time=298.96 end-time=301.064 speaker=speaker_2 audio=meeting.wav lna=a_46 start-time=301.064 end-time=304.952 speaker=speaker_2 audio=meeting.wav lna=a_47 start-time=304.952 end-time=306.896 speaker=speaker_2 audio=meeting.wav lna=a_48 start-time=339.76 end-time=357.404 speaker=speaker_4 audio=meeting.wav lna=a_49 start-time=357.404 end-time=360.664 speaker=speaker_1 audio=meeting.wav lna=a_50 start-time=360.664 end-time=365.416 speaker=speaker_4 audio=meeting.wav lna=a_51 start-time=369.728 end-time=370.428 speaker=speaker_4 audio=meeting.wav lna=a_52 start-time=370.428 end-time=382.376 speaker=speaker_4 audio=meeting.wav lna=a_53 start-time=382.376 end-time=390.176 speaker=speaker_5 audio=meeting.wav lna=a_54 start-time=390.176 end-time=414.136 speaker=speaker_4 audio=meeting.wav lna=a_55 start-time=417.936 end-time=448.504 speaker=speaker_4 audio=meeting.wav lna=a_56 start-time=451.032 end-time=465.808 speaker=speaker_4 audio=meeting.wav lna=a_57 start-time=473.504 end-time=487.584 speaker=speaker_4 audio=meeting.wav lna=a_58 start-time=492.048 end-time=493.64 speaker=speaker_4 audio=meeting.wav lna=a_59 start-time=495.992 end-time=499.336 speaker=speaker_4 audio=meeting.wav lna=a_60 start-time=501.68 end-time=525.328 speaker=speaker_4 audio=meeting.wav lna=a_61 start-time=537.92 end-time=545.268 speaker=speaker_4 audio=meeting.wav lna=a_62 start-time=545.268 end-time=549.18 speaker=speaker_5 audio=meeting.wav lna=a_63 start-time=549.18 end-time=549.768 speaker=speaker_2 audio=meeting.wav lna=a_64 start-time=549.768 end-time=565.584 speaker=speaker_4
Not too bad. But what can I do with that? Hmm....
Perhaps if I can just get the labels then I can figure out what to do with them. Here I go:
First, it seems that the aalto scripts only work with WAV so I first have to export my audio from M4A to WAV. This is trivial in Audacity:
Then I need to make sure the Docker image can see the folder containing the new WAV file.
docker run -it --mount type=bind,source=/Users/pkorir/Downloads,destination=/data blabbertabber/aalto-speech-diarizer bash
Now let's try and diarise the audio.
[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav Reading file: sprint_review.wav Writing output to: stdout Using feacat from: /speaker-diarization/feacat Writing temporal files in: /tmp Writing lna files in: /speaker-diarization/lna Writing exp files in: /speaker-diarization/exp Writing features in: /speaker-diarization/fea Performing exp generation and feacat concurrently tokenpass: ./VAD/tokenpass/test_token_pass Reading recipe: /tmp/initYH6FfW.recipe Using model: ./hmms/mfcc_16g_11.10.2007_10 Writing `.lna` files in: /speaker-diarization/lna Writing `.exp` files in: /speaker-diarization/exp Processing file 1/1 Input: sprint_review.wav Output: /speaker-diarization/lna/sprint_review.lna exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree. Traceback (most recent call last): File "./generate_exp.py", line 264, in <module> shift_dec_bord(lnas, arguments['--exppath']) File "./generate_exp.py", line 181, in shift_dec_bord num_models, l = _read_lna(lna) File "./generate_exp.py", line 123, in _read_lna with open(lna, 'r') as f: IOError: [Errno 2] No such file or directory: '/speaker-diarization/lna/sprint_review.lna' Calling voice-detection2.py Reading recipe from: /tmp/initYH6FfW.recipe Reading .exp files from: /speaker-diarization/exp Writing output to: /tmp/vadK0SiCg.recipe Sample rate set to: 125 Minimum speech turn duration: 0.5 seconds Minimum nonspeech between-turns duration: 1.5 seconds Segment before expansion set to: 0.0 seconds Segment end expansion set to: 0.0 seconds Error, /speaker-diarization/exp/sprint_review.exp does not exist Waiting for feacat to end. ^CTraceback (most recent call last): File "./spk-diarization2.py", line 116, in <module> child2.wait() File "/usr/lib64/python2.7/subprocess.py", line 1099, in wait pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) File "/usr/lib64/python2.7/subprocess.py", line 125, in _eintr_retry_call return func(*args) KeyboardInterrupt
Oops! There's a mismatch in the sampling rate used indicated by the line
exception: Audio file sample rate (44100 Hz) and model configuration (16000 Hz) don't agree.
So let's go back to Audacity and fix this. It took me a while to figure this out though all the while it was right under my nose:
then
which changes
to
Now we try running the script again.
[root@f0c9dbb6dfcf speaker-diarization]# ./spk-diarization2.py sprint_review.wav Reading file: sprint_review.wav Writing output to: stdout Using feacat from: /speaker-diarization/feacat Writing temporal files in: /tmp Writing lna files in: /speaker-diarization/lna Writing exp files in: /speaker-diarization/exp Writing features in: /speaker-diarization/fea Performing exp generation and feacat concurrently tokenpass: ./VAD/tokenpass/test_token_pass Reading recipe: /tmp/init6YosNd.recipe Using model: ./hmms/mfcc_16g_11.10.2007_10 Writing `.lna` files in: /speaker-diarization/lna Writing `.exp` files in: /speaker-diarization/exp Processing file 1/1 Input: sprint_review.wav Output: /speaker-diarization/lna/sprint_review.lna FAN OUT: 0 nodes, 0 arcs FAN IN: 0 nodes, 0 arcs Prefix tree: 3 nodes, 6 arcs WARNING: No tokens in final nodes. The result will be incomplete. Try increasing beam. Calling voice-detection2.py Reading recipe from: /tmp/init6YosNd.recipe Reading .exp files from: /speaker-diarization/exp Writing output to: /tmp/vad9FZr1b.recipe Sample rate set to: 125 Minimum speech turn duration: 0.5 seconds Minimum nonspeech between-turns duration: 1.5 seconds Segment before expansion set to: 0.0 seconds Segment end expansion set to: 0.0 seconds Waiting for feacat to end. Calling spk-change-detection.py Reading recipe from: /tmp/vad9FZr1b.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: /tmp/spkcOhyXK4.recipe Conversion rate set to frame rate: 125.0 Using a growing window Deltaws set to: 0.096 seconds Using BIC as distance measure, lambda = 1.0 Window size set to: 1.0 seconds Window step set to: 3.0 seconds Threshold distance: 0.0 Useful metrics for determining the right threshold: --------------------------------------------------- Average between windows distance: -741.884289478 Maximum between windows distance: 2056.233417322154 Minimum between windows distance: -1634.5738624276773 Total windows: 1693 Total segments: 164 Average between detected segments distance: 295.898429423 Maximum between detected segments distance: 2062.7827747870438 Minimum between detected segments distance: 5.3772136141556075 Total detected speaker changes: 40 Calling spk-clustering.py Reading recipe from: /tmp/spkcOhyXK4.recipe Reading feature files from: /speaker-diarization/fea Feature files extension: .fea Writing output to: stdout Conversion rate set to frame rate: 125.0 Using hierarchical clustering Using BIC as distance measure, lambda = 1.3 Threshold distance: 0.0 Maximum speakers: 0 Initial cluster with: 164 speakers Merging: 108 and 135 distance: -3887.955395546159 Merging: 1 and 108 distance: -3829.558634475854 Merging: 1 and 133 distance: -3786.060190436657 Merging: 1 and 93 distance: -3770.7263608466424 Merging: 79 and 131 distance: -3749.047306462543 Merging: 1 and 92 distance: -3717.2981892853973 Merging: 1 and 113 distance: -3677.9809939843003 Merging: 1 and 91 distance: -3636.012753072265 Merging: 1 and 105 distance: -3628.2522779525566 Merging: 1 and 129 distance: -3581.0688774678083 Merging: 1 and 107 distance: -3575.992436049587 Merging: 1 and 94 distance: -3568.4366839299973 Merging: 1 and 3 distance: -3527.330699268212 Merging: 1 and 90 distance: -3522.014658249364 Merging: 74 and 116 distance: -3493.7359497637935 Merging: 1 and 102 distance: -3465.499077569202 Merging: 1 and 107 distance: -3445.3836108861806 Merging: 1 and 52 distance: -3451.722730361522 Merging: 1 and 33 distance: -3387.876121993725 Merging: 51 and 63 distance: -3384.4959699685132 Merging: 48 and 75 distance: -3354.5669035578567 Merging: 71 and 79 distance: -3349.328440026835 Merging: 71 and 79 distance: -3314.5059704949717 Merging: 71 and 108 distance: -3253.5329670281417 Merging: 44 and 69 distance: -3237.6886447327024 Merging: 44 and 49 distance: -3386.3979517532134 Merging: 44 and 49 distance: -3371.639373859509 Merging: 44 and 55 distance: -3385.7581267712203 Merging: 44 and 53 distance: -3380.7840683752183 Merging: 44 and 63 distance: -3376.7621000717972 Merging: 24 and 44 distance: -3403.54812645695 Merging: 24 and 40 distance: -3436.4345257878567 Merging: 19 and 24 distance: -3476.843817278601 Merging: 3 and 19 distance: -3475.193588642015 Merging: 3 and 55 distance: -3474.0118563581327 Merging: 3 and 42 distance: -3578.643272853399 Merging: 3 and 52 distance: -3548.303482323008 Merging: 3 and 42 distance: -3577.0468024010906 Merging: 3 and 47 distance: -3598.6588382225755 Merging: 3 and 85 distance: -3608.2916045281318 Merging: 3 and 36 distance: -3603.7144836529587 Merging: 3 and 37 distance: -3617.6925098052097 Merging: 3 and 50 distance: -3582.978842707711 Merging: 3 and 51 distance: -3608.363897509921 Merging: 3 and 14 distance: -3616.959015783171 Merging: 3 and 14 distance: -3673.1394593964087 Merging: 3 and 6 distance: -3645.692757898267 Merging: 3 and 21 distance: -3651.831659533852 Merging: 3 and 45 distance: -3635.4377651669965 Merging: 3 and 42 distance: -3625.768902747991 Merging: 3 and 16 distance: -3637.3145431461453 Merging: 3 and 12 distance: -3656.1644745539015 Merging: 3 and 39 distance: -3619.0243746250035 Merging: 3 and 70 distance: -3617.532832619885 Merging: 3 and 31 distance: -3538.1118926415274 Merging: 3 and 24 distance: -3588.4551846544327 Merging: 3 and 23 distance: -3595.3136668354373 Merging: 3 and 22 distance: -3563.4762067514275 Merging: 3 and 34 distance: -3473.1897660250816 Merging: 3 and 66 distance: -3497.4745288209124 Merging: 3 and 51 distance: -3488.191266448631 Merging: 3 and 5 distance: -3435.2971353689245 Merging: 3 and 5 distance: -3412.6258974336815 Merging: 3 and 33 distance: -3421.702443699097 Merging: 3 and 10 distance: -3347.9573937274645 Merging: 3 and 4 distance: -3275.507261738103 Merging: 3 and 28 distance: -3222.640697116016 Merging: 32 and 63 distance: -3220.758971141154 Merging: 79 and 85 distance: -3209.354815380604 Merging: 79 and 81 distance: -3206.8351002940353 Merging: 3 and 6 distance: -3204.3674454124757 Merging: 3 and 43 distance: -3188.818815699572 Merging: 3 and 18 distance: -3148.030529647065 Merging: 3 and 6 distance: -3205.4135632110892 Merging: 3 and 46 distance: -3231.040880173908 Merging: 3 and 24 distance: -3208.884162691228 Merging: 28 and 31 distance: -3133.263655995979 Merging: 28 and 58 distance: -3115.8873861076936 Merging: 28 and 56 distance: -3164.7155330911028 Merging: 70 and 73 distance: -3100.665370054021 Merging: 3 and 12 distance: -3092.46044930492 Merging: 3 and 40 distance: -3087.035097019593 Merging: 3 and 20 distance: -3081.700056095887 Merging: 3 and 24 distance: -3086.8293437925913 Merging: 3 and 11 distance: -3097.1391768792882 Merging: 73 and 78 distance: -3075.370397383048 Merging: 24 and 25 distance: -3047.826910527925 Merging: 24 and 25 distance: -3175.594527744134 Merging: 24 and 34 distance: -3125.426855246352 Merging: 24 and 31 distance: -3143.175391292376 Merging: 1 and 63 distance: -3031.692234416072 Merging: 3 and 41 distance: -3020.1835212450815 Merging: 3 and 35 distance: -3011.633718292628 Merging: 58 and 59 distance: -3001.264566856854 Merging: 1 and 49 distance: -2978.7819505136677 Merging: 3 and 39 distance: -2965.2039816969127 Merging: 55 and 56 distance: -2939.950569232509 Merging: 24 and 44 distance: -2893.0393418751846 Merging: 20 and 21 distance: -2881.333358393781 Merging: 2 and 20 distance: -2891.8227857940656 Merging: 14 and 18 distance: -2880.5349814279652 Merging: 6 and 14 distance: -2944.2111947274816 Merging: 6 and 24 distance: -2991.7338627010076 Merging: 6 and 24 distance: -3058.018780854746 Merging: 49 and 58 distance: -2876.5058351204634 Merging: 49 and 56 distance: -2915.1050006814294 Merging: 13 and 49 distance: -2921.3550738196755 Merging: 6 and 27 distance: -2876.4117349206253 Merging: 19 and 20 distance: -2845.2434796312136 Merging: 19 and 24 distance: -2937.9409080638825 Merging: 2 and 48 distance: -2822.8020023667686 Merging: 6 and 11 distance: -2822.2471799865234 Merging: 4 and 6 distance: -2834.310032658269 Merging: 17 and 19 distance: -2793.0188621542147 Merging: 16 and 17 distance: -2900.7354074650893 Merging: 16 and 22 distance: -2862.6145736734416 Merging: 16 and 22 distance: -2869.23538299506 Merging: 16 and 18 distance: -2850.6370966794266 Merging: 4 and 22 distance: -2781.876035270495 Merging: 3 and 14 distance: -2741.736072613906 Merging: 3 and 30 distance: -2759.259362664974 Merging: 3 and 8 distance: -2757.8665416641434 Merging: 21 and 24 distance: -2719.112465081248 Merging: 14 and 16 distance: -2707.5240582398374 Merging: 14 and 23 distance: -2660.882175893672 Merging: 3 and 9 distance: -2649.7183508639337 Merging: 3 and 7 distance: -2651.3659778753836 Merging: 12 and 13 distance: -2616.9054823202387 Merging: 4 and 5 distance: -2582.178114937249 Merging: 3 and 13 distance: -2554.2970459399385 Merging: 1 and 13 distance: -2548.118699103049 Merging: 1 and 21 distance: -2546.7834587369316 Merging: 1 and 23 distance: -2521.634724829474 Merging: 1 and 6 distance: -2649.776882152113 Merging: 1 and 19 distance: -2603.222960608062 Merging: 1 and 17 distance: -2495.2817102370227 Merging: 1 and 19 distance: -2602.05798529557 Merging: 19 and 20 distance: -2347.927030688841 Merging: 6 and 23 distance: -2228.488638502052 Merging: 6 and 25 distance: -2431.776312676784 Merging: 19 and 21 distance: -2160.2725334151246 Merging: 19 and 20 distance: -2220.9533237121896 Merging: 1 and 12 distance: -2138.357070794871 Merging: 8 and 16 distance: -2113.677056401817 Merging: 17 and 18 distance: -2041.0840088828236 Merging: 12 and 13 distance: -2038.2327352681687 Merging: 11 and 13 distance: -1944.9097104167731 Merging: 1 and 9 distance: -1921.8353706803282 Merging: 14 and 15 distance: -1872.455349311388 Merging: 3 and 5 distance: -1842.473173390661 Merging: 1 and 12 distance: -1663.8068944748202 Merging: 5 and 6 distance: -1236.5990251376343 Merging: 10 and 12 distance: -247.4109391082884 Merging: 4 and 5 distance: -41.04158619899499 Final speakers: 10 Useful metrics for determining the right threshold: --------------------------------------------------- Maximum between segments distance: 167571.74992590738 Minimum between segments distance: -3887.955395546159 Total segments: 164 Total detected speakers: 10 [root@f0c9dbb6dfcf speaker-diarization]# less stdout
Which looks like so:
~$ head ~/Downloads/sprint_review_speakers.txt audio=sprint_review.wav lna=a_1 start-time=5.448 end-time=6.504 speaker=speaker_1 audio=sprint_review.wav lna=a_2 start-time=11.352 end-time=14.504 speaker=speaker_2 audio=sprint_review.wav lna=a_3 start-time=16.408 end-time=17.0 speaker=speaker_1 audio=sprint_review.wav lna=a_4 start-time=24.28 end-time=28.768 speaker=speaker_3 audio=sprint_review.wav lna=a_5 start-time=34.528 end-time=45.712 speaker=speaker_3 audio=sprint_review.wav lna=a_6 start-time=49.608 end-time=55.16 speaker=speaker_3 audio=sprint_review.wav lna=a_7 start-time=57.032 end-time=62.144 speaker=speaker_3 audio=sprint_review.wav lna=a_8 start-time=66.272 end-time=73.08 speaker=speaker_3 audio=sprint_review.wav lna=a_9 start-time=75.04 end-time=79.952 speaker=speaker_4 audio=sprint_review.wav lna=a_10 start-time=83.32 end-time=90.732 speaker=speaker_4
It then hit me that if I am able to create and save labels in Audacity then I should certainly be able import them. The only question is the form of the labels: hopefully Audacity does not require some complex format.
So I fired up Google with the search 'audacity import labels' which turned up this page: https://ttmanual.audacityteam.org/man/Label_Tracks. Right at the bottom was exactly what I was looking for:
Brilliant! Now to work out how to massage the output of the aalto scripts into Audacity labels. For this I whipped up a dirty Python script.
#!/usr/bin/env python3.8 import sys with open(sys.argv[1]) as f: for row in f: l = row.strip().split(' ') start = float(l[2].split('=')[-1]) stop = float(l[3].split('=')[-1]) label = l[-1].split('=')[-1] diff = 0.0 print(start - diff, stop - diff, label, sep='\t') #print(l[2].split('=')[-1], l[3].split('=')[-1], l[-1].split('=')[-1], sep='\t')
which I ran as follows:
~$ ./sprint_review_labels.py sprint_review_speakers.txt > sprint_review_speaker_labels.txt
whose output now looks like
~$ head sprint_review_speaker_labels.txt 5.448 6.504 speaker_1 11.352 14.504 speaker_2 16.408 17.0 speaker_1 24.28 28.768 speaker_3 34.528 45.712 speaker_3 49.608 55.16 speaker_3 57.032 62.144 speaker_3 66.272 73.08 speaker_3 75.04 79.952 speaker_4 83.32 90.732 speaker_4
I was now in a position to import the labels into Audacity alongside the audio track. This is what the final output looks like:
The labels now meant that I did not have to listen to the full audio and I could simply skip to the change of speaker to get their reactions and the ensuing discussion.
I think this can be an awesome feature to integrate into Audacity. Doing so will take advantage of the underlying implementation which benefits from the integrated features such as resampling and filters that can improve detection quality. I can certainly see myself having to do this again and I'm sure there are numerous other scenarios that others would need to do some quick scanning.
In this application I was not super-interested in having accuracy: a quick heuristic was sufficient to save me tonnes of time. Granted, discovering the whole process above took me a couple of hours, but I think it was well spent if I will in future just take at most 20 minutes to segment the audio. I noticed that whenever several speakers spoke at once the algorithm would get confused. The assignment of labels to speakers was not always correct but what really mattered for me was the transition from speaker to speaker rather than the identity of each speaker.
If you need something more accurate you will need to ensure that the recording is of sufficient quality, perhaps do some post-processing such as minimising echos. Using better quality mics is also necessary.
When I first imported the labels I suspected that they were offset by some value. This is why the script has a diff
variable: by setting a small positive value to diff
(<5.448) I was able to shift the labels to the left and see if they matched better with the audio. However, after examining the sync it occurred to me that the beginning might not have been the best quality. Further down the line the labels were very reliable.
One little quirk is that the labels are defined in such a way that the total labeled length is less that the total running length meaning there are unlabeled intervals where there is silence. This can be annoying at times because it means there will be times when the same labels associated with the same speaker are split up. It would be nice to merge them to the minimum number of segments (hence the --merge-similar
argument below).
Here's how I would really like to process the data. Personally, I prefer working on the command line. I envision a single command
~$ some_audio_cmd segment --merge-similar file.m4a -o file_labels.txt
That's it!
Note that the intermediate conversion into WAV would be automatic: the user never even needs to know it happened.
If this is to work from Audacity there would be a filter which simply say 'Segment audio...' with a few default options which produces a labels track.
I hope that someone finds this useful. If I had a lot of time on my hands (which I definitely don't have) I would repackage the aalto scripts into a more user-friendly package as well as make it more robust against audio sampling frequency.