Klaske van Sluis

Long-term stability of tracheoesophageal voices 85 use of different recording equipment and procedures can introduce a bias in the evaluations. A test comparing the T1 results between periods I and II and the T2 results between periods II and III (experiment 1) showed that aver- aged ratings between these periods were not different (Student t-test, p > .05). However, the small number of speakers makes the power of these tests low (c.f. Table 5.1). To determine for which of the 13 speakers the evaluations differed, a level of significance of p ≤ .004 is used (Bonferroni correction). Statistical tests were performed in R [23]. 5.3.1 Experiment 1 The variation in the perceptual scores in experiment 1 was high (see Figure 5.1). Only two speakers had statistically significant lower perceptual intelligibil- ity scores for T2 than T1 ( p ≤ .004, not shown). The ELIS and ELISALF scores were strongly correlated with pooled perceptual intelligibility scores ( R > .80, p < .001, n =24) and for T1 and T2 separately ( R > 0.78, p < .005, n =12 each). The perceptual voice quality scores differed for three speakers, two speakers had lower scores for T2 than T1 and one had higher scores ( p ≤ .004). For all other speakers, the differences were not statistically significant ( p > .004). The AVQI scores were moderately correlated with pooled perceptual voice quality scores ( | R | > 0.60, p < .002, n =24) and for T1 separately ( | R | = 0.70, p < .02, n =12), but not for T2 ( | R | =0.45, p > .05, n =12). Perceptual intelligibility and voice quality T2-T1 difference scores were strongly correlated ( R =0.89, p < .001, n =13). ELIS and AVQI T2-T1 differences were also correlated ( R =0.75, p < .01, n =11). The consistency of the evaluations was estimated by correlating the scores of individual experts against the average score of all the other experts ( n =9). The correlations were between R =0.6 and R =0.9 for both Intelligibility and Voice Quality. Automatic scores were correlated with the average of all ten experts. The consistency of the scores for ELIS and ELISALF compared favorably against the intelligibility scores of individual experts: R ≥ 0.8. Corre- lation of automatic AVQI scores was comparable to the least consistent expert: R =0.6. 5.3.2 Experiment 2 Eight speakers showed a statistical significant difference in intelligibility be- tween T1 and T2 in experiment 2 and seven of them also showed a difference in voice quality (see Figure 5.2). The ELIS speaker difference scores were mod- estly correlated with the average pairwise perceptual ratings for intelligibility ( R =0.61, p ≤ .05). The correlation of the ELISALF difference scores with the perceptual ratings was even marginally lower ( R =0.58, p > .05). The correlation between ELIS and ELISALF difference scores was statistically not significant ( R =0.56, p > .05). Because of this, we focus on the ELIS scores for the re- mainder of this paper. Intelligibility and voice quality were strongly correlated ( R =0.99, p < .001). AVQI scores were strongly correlated to voice quality and

RkJQdWJsaXNoZXIy ODAyMDc0