**7. Discussion**

Two general approaches have been used to evaluate expressive oral reading, automated assessment and human judgment. Bolaños et al. have completed a study using both [54]. They examined oral reading of first through sixth graders who read one of 20 passages written at their grade level in a one-minute timed reading. The 783 total recordings were evaluated for rate, accuracy, and expression using both automated and human scoring systems. The automated approach was developed by the authors who also trained teachers to calculate wcpm and then rate prosody using the NAEP scale [55].

Results showed high agreemen<sup>t</sup> between scores obtained using automated assessments and human ratings for both wcpm and prosody. The agreemen<sup>t</sup> between human and automated ratings of expressive oral reading were higher (90.93%) when judging between the fluent (levels 3 and 4) and non-fluent (levels 1 and 2) NAEP categories (76.05%). Nevertheless, the agreements were high enough that the authors conclude that we could settle for using automated ratings of all three aspects of fluency.

The concerns raised by other researchers are that digital measures may not accurately assess all dimensions of fluency [56]. For example, Smith and Paige reported higher reliability results for NAEP than those reported in the Bolaños study [44]. The MDFS, not used in the Bolaños study, has also been shown to yield highly reliable scores [43]. Automated tools are limited to measuring only pitch and pauses, leaving out other dimensions of prosody like smoothness and phrasing, which human raters may be better equipped to evaluate. Automated assessment can minimize qualitative relationships. For example, pitch and pausing contours may not always be correctly associated with the linguistic elements in the text. The same passage can be appropriately read with multiple prosodic patterns by proficient readers, while struggling readers may exhibit pauses and pitch di fferences that are not appropriate. There are times when words in a text can be grouped into more than one acceptable way and still preserve meaning. These groupings may not all be grammatically correct, and a spectrographic analysis generally would not accept such variations. A human rater can pay attention to why students are pausing as they read (e.g., to check what they've read, to decode the word they are currently reading, to anticipate an upcoming word, to take a break in their reading because they are tired). Spectrographic research can document these pauses, but only a human rater has the potential of understanding why.

On the flipside, use of human rating scales of prosody also has potential drawbacks. Whenever multiple raters are used, reliability and validity issues are drawn into question. Researchers have

shown that training of raters is essential to ensure inter- and intra-rater reliability. But that takes time and e ffort, which may not always be practical in a school setting.

Di fferent rating scales have included di fferent traits of prosody, showing disagreement about what constitutes expressive oral reading. Human raters make subjective judgments. Although such judgments can be tempered with multiple raters, passages, and rating occasions, educators may not have the luxury to go to such lengths. Human raters can make a holistic judgment using NAEP, but when using MDFS or CORFS they must make multiple decisions at the same time. It is di fficult to make separate judgments for each dimension, and time consuming to listen to the recording multiple times to rate each dimension separately.
