Klaske van Sluis

Long-term stability of tracheoesophageal voices 87 Speakers Intelligibility Rating: T2 vs T1 (pairs) -400 -300 -200 -100 0 100 200 300 400 K9S KRH L5Y M6S N5H WWL Z6J 23K B85 FC1 UCX ZQ1 ZOU * * * * * * * * I II + Average T2 better > < T1 better | Mean SLP resp. Scaled ELIS ratings *: p≤ 0.004 Conf. int.: p≤ 0.996 Corr. SLP x ELIS p≤ 0.046 R = 0.610 (R² = 0.372 ) N = 10 listeners I: 1996/1999-2007 II: 2007-2014, +: 1999-2014 Speakers Voice Quality Rating: T2 vs T1 (pairs) -400 -300 -200 -100 0 100 200 300 400 K9S KRH L5Y M6S N5H WWL Z6J 23K B85 FC1 UCX ZQ1 ZOU * * * * * * * I II + Average T2 better > < T1 better | Mean SLP resp. Scaled AVQI ratings *: p≤ 0.004 Conf. int.: p≤ 0.996 Corr. SLP x AVQI p≤ 0.001 R = 0.866 (R² = 0.750 ) N = 10 listeners I: 1996/1999-2007 II: 2007-2014, +: 1999-2014 Figure 5.2: Pairwise comparisons, experts and ASISTO ratings. Left: Intelligibility and ELIS, Right: Voice quality and AVQI. Statistics based on Student t-test and Pearson’s product-moment correlation. 5.3.3 Consistency between recordings For one speaker, KRH, there were three evaluated recordings over a span of 18 years, one from each recording period. All three recordings were used to get a rough ( n =1) estimate of the variability in evaluation outcomes (Table 5.3, Bonferroni correction p ≤ .01). It appears that the experts can judge the speech samples quite consistently. Only the voice quality results for period I in experiment 1 differed from the other periods (I versus II and III, p < .01). None of the other evaluations differed between periods ( p > .01). Pairwise comparisons showed significant differences in experiment 2 ( p < 0004, underlined), except for voice quality between periods II-III. The automatic scores for this speaker, ELIS and AVQI, were rather stable over this time course (Table 5.3, Experiment 1), but the difference scores were variable (Table 5.3, Experiment 2). 5.4 Discussion Long-term stability of voice quality and intelligibility in tracheoesophageal speakers, to our knowledge, has not yet been described. This study presents a unique dataset, in which perceptual and automatic voice assessment com- plement each other. It must be noted, though, that speech samples of only a small group of speakers was available. Differences in surgical techniques and treatment modalities are not included in this study because of the small sam- ple size. The voice recordings were made in three time periods, and different audio recording equipment was used each time (Table 5.2). Our analysis did