Author: Nancy Manasse
University of Nebraska, Lincoln
Date: May 3, 1999
Introduction
Technology continually amazes society with the improvements and capabilities that computer systems exceed annually. When it seems the top of the line speech recognition (SR) program has been put on the market another company is in the process of developing a more sophisticated product. In fact, today, you can buy a computer with SR capabilities built into the hard drive. Donít be mistaken though, SR is not a new entity. It first evolved over 30 years ago (Stevens, 1960). This continued growth in technology opened the doors to various applications of SR. Not only is SR a popular medium for use by professionals in the working world, it is also an exceptional tool for people with disabilities.
Four main areas of interest have received attention by professionals working with individuals with disabilities. The first area researched was use of SR systems by people having physical disabilities but no speech impairment. Investigation in a second area expanded to include use of SR by individuals with impaired or dysarthric speech. SR was viewed to have the potential to make hard to understand speech more easily recognizable. The third area explored was use of SR for drill practice. Could individuals with motor speech disorders, as well as those with hearing impairments, improve their intelligibility using SR programs that proved feedback of some sort? The last and most recent area studied by researchers is use of SR by students with learning disabilities as an aid to improve the effectiveness of written comprehension. This paper will provide preliminary information about the basic components of speech recognition systems; however, the primary content will focus on a bit of the history behind each of the four areas, recent research findings, and suggestions for future research.
System
Features and Capabilities
Features of SR systems have progressively advanced over the last 35 years. Initially, in 1972, dictation and word processing systems were combined to formulate the first SR system (Lange, 1993; Meisel, 1993). At this point, systems could only handle discrete speech dictation where pauses between every word spoken were required for the signal to be processed. Today, it is difficult to find any programs that still use discrete speech. VoicePad Platinum (Kurzweil Educational Systems, Inc.) is one of the last programs of this kind. Most programs have the capacity to handle continuous speech where the speaker talks naturally, without the need to pause between every word. One disadvantage is that continuous systems do not make ongoing adaptations to the userís speech whereas discrete systems do (De La Paz, in press). That is, the stored templates of the userís speech will not be automatically adjusted in continuous systems. Rather, repeated training sessions are a prerequisite for effective use. The capacity for vocabulary size ranges anywhere from 13 ? 40,000 words (Lee, Hauptmann, & Rudinicky, 1990). SR programs may be speaker dependent or speaker independent. The former requires the user to train the system to develop a recognition template of words. This template is user specific and is accessed each time the user activates the system. Speaker independent systems, on the other hand, use previously stored templates that are provided by the manufacturer.
In addition to the many internal components, there are external factors that will effect the accuracy of SR as well. A microphone headset is worn by the user and adjusted properly to minimize background noise and maximize the quality of the speech signal. Some SR systems are packaged with their own headset while other products require the user to purchase the headset separately. Several types of headsets are available on the market. They range in quality as well as price. Unlike face-to-face speech communication, a SR system depends solely on the speech signal. Limited knowledge about the speaker and context of topic can be used to assist in deciphering the message. Each system has a complex set of commands that must be mastered in order to effectively execute the program and correct errors. Learning to use a SR program takes time and patience.
In addition to mastering the components and language of the system, the user must also know how to set up and run the computer on which it is based. It can be an overwhelming and frustrating process to learn for users without disabilities let alone those individuals faced with additional challenges. Many pros and cons have arisen when exploring the use of SR systems by those with physical disabilities, those with dysarthria, as a tool for drill practice and use by students with learning disabilities.
Speech
recognition use by the physically disabled
Speech recognition systems were first used by severely disabled individuals with normal speech. The goal was to promote independence whereby SR was used to convert human speech signals into effective actions. Frequently, speech is the only remaining means of communication left for these individuals. The first voice activated wheelchair with an environmental control unit (ECU) was developed in the late 1970s at Rehabilitation Medicine in New York (Youdin, et al., 1980). The user could operate multiple items including the telephone, radio, fans, curtains, intercom, page-turner and more. A group of individuals with cerebral palsy rated the wheelchair as superior to breath control systems because it eliminated the need for scanning, allowing the user quicker access by directly selecting the desired function with voice.
Grunza and Cohen (as cited in Noyes & Frankish, 1992) developed another system in 1977, the Voice Activated Control System (VACS). The VACS is composed of a microphone, hardware preprocessor and feature extractor, minicomputer, electronic display, a Teletype and relay interface. It is a speaker dependent system that the user must train with a capacity for 99 words. The visual display allows the user to confirm all commands before they are executed. It is composed of three functional modes: control, type, and calculate. Only those words valid for the current mode will be accepted by the system; all other words will be rejected and an indication light will alert the user of an invalid function. Control mode activates up to 12 automated devices that are interfaced with the VACS. The type mode allows the user to compose written text using the phonetic alphabet. There is a capacity for up to 20 vocabulary words to be pre-stored and entered by a single utterance. All other words must be spelled letter by letter, a fatiguing and time consuming process. If the user attempts to speak a word that is stored for use in another mode, it may be rejected by the system. Finally, calculate mode will execute addition, subtraction, multiplication and division. Although not perfect, the introduction of this device in the 1970s gave continued promise that technology could enhance the lives of individuals with disabilities.
Application of ECUs in the home was later investigated. This was first considered by Damper 1984 with the development of the Voice Activated Domestic Appliance System (VADAS). The VADAS is a speaker dependent, isolated word recognizer with the capacity to control up to 16 household appliances (Noyes & Frankish, 1992). A similar notion was the use of speech to control robotic arms. Research was initiated in this area in 1981 by Engelhardt, et al. (1984) and more recently by the Palo Alto Veterans Administration Medical Center (Noyes & Frankish, 1992). The goal of the robotic arm was to assist individuals with household tasks including kitchen preparation, recreational tasks and even therapeutic skills. Few complaints were voiced about the actual functioning of the system; rather greater concern was with frequent misrecognitions of speech.
Finally, a more recent consideration of SR use by individuals with severe disabilities occurred in the medical setting. In this environment SR can serve two purposes. First, it is used to activate functional electronic stimulation (FES) systems. This allows the patient to maintain a routine as directed by the physical therapist without the consistent involvement of a nurse. This can be motivational to the patient, providing a degree of control in their rehabilitation recovery process. Secondly, the use of computers to assist disabled patients to complete surveys or questionnaires has also been considered (Noyes & Frankish, 1992). Verbal responses would eliminate the need for writing, a difficult and sometimes impossible task for many individuals with physical disabilities.
While systems that combine SR with ECUs have alleviated many frustrations for users, several issues still require resolution. Frequent misrecognitions and non-recognitions are an unresolved problem. A temporary solution is for the user to make note of repeated failures and make changes to the stored template as necessary. This may or may not be related to fatigue. As fatigue increases accuracy may decrease. Perhaps the systems should be used for limited time periods initially, to maximize accuracy, rather than using the systems for lengthy periods but with multiple errors. As endurance improves, length of use could be gradually increased. This may positively influence the userís perspective in relation to the quality and accuracy of a system. Another contributing factor to mis-recognitions is due to background noise. While placement of the microphone close to the mouth can help to alleviate this to some degree, it is not avoided entirely (Noyes & Frankish, 1992). In addition, running SR systems require mastery of a complex set of rules and sequences, especially to switch between modes as in the VACS. Maintenance and upkeep of expensive devices or programs is also a consideration.
Even if the flaws could
be perfected, would ECUs continue to be used? Unfortunately, according to respondents
of a survey conducted in 1997, occupational therapists recommended ECUs for
fewer than 25% of their clients. The primary reasons cited for lack of referral
were high cost and lack of third-party payer reimbursement (Holmes, Kanny, Gutherie,
& Johnson, 1997). While this is an alarming statistic, the use of SR, with
or without interfacing an ECU, should not be ruled out as an option for individuals
with physical disabilities as a way to reestablish and maintain their independence.
Speech
recognition use by people with dysarthria
A second goal is to use SR as an interface to type or send signals to a speech synthesizer that would translate difficult to understand dysarthric speech into a more recognizable form. Using SR in this manner could benefit individuals with Cerebral Palsy, survivors of stroke and traumatic brain injury, and those with degenerative neurologic diseases such as Parkinsonís and ALS. SR may also eliminate or minimize the challenges that persons with motor disorders face when attempting to manipulate controls of augmentative communication devices. Additionally, SR could allow improved interactions by improving rate efficiency of responses.
A barrier to successful use of SR by individuals with dysarthria is inconsistency. Severity of dysarthria not only varies across individuals; it can vary for a single speaker depending on the time of day, fatigue, stress or other personal and environmental factors. Thus, the effectiveness of SR may vary at any time or any place. Research has demonstrated some success of SR with speakers with dysarthria, but reports indicate a rapid decrease in performance for vocabulary sizes exceeding 30 words (Schmitt, D.G., & Tobias, 1986). Several more recent articles investigating efforts to assist speakers with dysarthria are discussed in detail.
Coleman and Meyers (1991) used a structured set of stimuli to compare computer recognition capabilities for dysarthric and nondisabled speakers. A total of 23 Australian speaking subjects were used, 10 dysarthric speakers with cerebral palsy (CP) and 13 nondisabled speakers ranging in age from 20-53 years. Severity of dysarthria for subjects with CP was established using the Assessment of Intelligibility of Dysarthric Speech. Average intelligibility for words was 63.6% and 52.0% for sentences. The Shadow VET/2 speech recognition system used was installed on an Apple IIe computer. Stimuli included 12 consonants paired with a neutral vowel, all 12 Australian vowels in an h-d environment, 12 hard words and 12 easy words from Tikofskyís list. For example, one of the hard words was ëplatformí and an easy word was ëchantí. Each list was randomized and made up a set of stimuli. During training, participants repeated each item five times. In recognition testing, each item was verbalized three times in random order. The ëtesterí read stimulus items in random order and each subject repeated the item. The tester documented whether the system recognized the intended item correctly. If another item was recognized, a note was made. Training and testing of each stimuli set occurred during the same session.
The researchers concluded
that while the total number of correct recognitions was fewer for the dysarthric
speakers, a similar pattern of recognition was present for both groups. Both
had significantly fewer correct recognitions for consonants than for vowels;
in addition, errors of place were greater for both groups as compared to recognition
errors associated with manner and voicing. These similar patterns give hope
that general improvement in speech recognition systems to help nondisabled speakers
will also improve recognition for dysarthric speakers. At present, for speech
recognition to be successful with dysarthric speakers, Coleman and Meyers indicate
a need for adjustments to be made to the input signal or the instrumentation
of the recognition system. Future research needs to address the specifics of
each and determine what adjustments may facilitate improvement.
Ferrier, L.J., Jarrell, N., Carpenter, T., Shane, H.C. (1992) did a case study
of a dysathric speaker using the DragonDictate Voice Recognition System (Dragon
Systems, Inc.). They hypothesized that there is a range of speech intelligibility
that is most recognizable by the DragonDictate Voice Recognition System. Specifically,
their research questions were as follows. What is the potential recognition
level that can be achieved by a dysarthric user with cerebral palsy compared
to normal speakers? How much time is involved in reaching the maximal level
of recognition? What are the speech and voice features associated with lower
levels of recognition? What phonetic characteristics predict lower recognition
level? Does word length affect recognition? Finally, is there a difference in
a subjectís rate and accuracy when using DragonDictate compared to a
manual access computer? A longitudinal study of recognition levels in five subjects,
one male with a diagnosis of cerebral palsy (CP) and mild dysarthria, and four
normals, two male, two female, was done first. This was followed by analysis
of the speech and voice features of the unrecognizable words spoken by the dysarthric
speaker.
Results to the research questions were as follows. It took six sessions for the user with CP to reach 100% intelligibility where it took normals two-four sessions. The overall learning pattern for speech recognition of the dysarthric speaker was similar to that of the normal speakers. After the first session intelligibility was between 85-95% for the dysarthric speaker, where it reached between 90-100% for normals. Baseline intelligibility, using the Computerized Assessment of Intelligibility of Dysarthric Speakers (CAIDS) by Beukelman, Yorkston, & Traynor (as cited in Ferrier, Jarrell, Carpenter, & Shane, 1992), was 84%, slightly below the performance obtained at the second session using DragonDictate. This is a positive indication that the CAIDS may be an accurate predictor of how successful a user will be with the DragonDictate (Dragon Systems, Inc.). Imprecise consonants, low loudness, hypernasality, insufficient prosody, slow rate, equal stress and final consonant deletion were all characteristics associated with lower recognition scores.
Doyle, et al. (1997) compared the recognition of dysarthric speech by a computerized voice recognition system and non-hearing impaired human adult listeners. Intelligibility ratings were obtained for six dysarthric speakers and six matched controls. The researchers were interested in patterns of recognition rather than accuracy alone. The IBM VoiceType recognized non-dysarthric speakers with greater accuracy than the age and gender matched dysarthric speakers; however, the learning curves between both groups were not significantly different. Gradual improvements were made at each of the five sessions across both groups of speakers. Had training continued beyond five sessions the pattern of increasing accuracy of recognition would be likely to occur. The human adult listeners were 100% accurate for stimuli produced by control speakers. Intelligibility scores for dysarthric speakers were 94-96% for mildly dysarthric, 90-94% for moderate and 18-85% for severely dysarthric speakers. Of interest is that the human listeners judged the mild and moderately dysarthric speakers to be quite similar. This is contradictory to the results obtained on the CAIDS used to classify the subjects. The presence of this discrepancy needs to be further investigated. No trends of improvement observed over sessions occurred with human listeners as occurred with the VoiceType recognition system. Overall results indicate that the IBM VoiceType system gradually improves in recognition accuracy across session while human adult listenersí judgments of intelligibility remain stable.
A different perspective using SR with people with severe dysarthria is the use of a small set of utterances to elicit reliable recognition. The long-term goal is to improve the individualís ease of computer access and execution (Goodenough-Trapagnier & Rosen, 1991). The user does not have to speak recognizable words. Rather, a set of vocalizations is recognized by the system to improve the performance of job related tasks. Goodenough-Trapagnier and Rosen hoped to distinguish classes of participants for which particular strategies might be successful. For example, individuals with pitch and volume difficulties may achieve higher recognition with a specified set of vocalizations than speakers whose primary deficit is extended duration of sounds. Each speaker would require individualized assessment; however, if ëclassesí could be established, it may decrease the time required during evaluation to identify optimal sets of "speech acts" that a speaker may effectively use.
Finally, a similar study by Treviranus, Shein, Haataja, Parnes & Milner (1991) examined the use of speech recognition in combination with scanning to increase the rate of computer input by individuals who are functionally non-speaking. As previously mentioned, not all SR systems require intelligible words be used as stimuli. Specific vocalizations can be assigned to perform direct selection tasks. The goal was to increase the rate of access while minimizing any additional cognitive processing demands. Two methods of access, scanning alone and scanning combined with SR were compared for six participants ranging in age from 5-21 years old. Results for a 12-year-old boy are described. He could consistently produce 3 discrete, repeatable vocalizations. The productions "ma", "hey", and "heya" were assigned either delete the last selection, skip to the second half of the rows or second half of the columns and to select the row of verbs. Results indicated that the participant made 3.6 errors when scanning alone and 3.9 errors using scanning combined with SR. Vocalizations were repeated on average 1.6 times before they were recognized. After five sessions, the participant was making only 2.5 selections on average per minute using scanning only as opposed to 3.4 selections with scanning and voice combined. Thus far, it seems there are gains to be made when using a combination of scanning and SR even with limited vocalizations. Clearly, further research is required with a variety of individuals to draw any conclusions about the significance of such a program.
The use of SR with speakers with dysarthria has much potential for growth. Continued research assessing the effectiveness of various SR systems with speakers of varying severity and intelligibility of speech is needed. In addition, replication of all studies discussed across larger numbers of individuals is necessary before any concrete conclusions can be drawn regarding the benefits of using SR with speakers with dysarthria of varying severity.
Speech
Recognition for Drill Practice
Another area of interest is computer-based training for impaired speech. The goal is implementation of a low cost speech training aid using commercial technology. Research was initiated over 25 years ago with the focus toward improving the speech of hearing-impaired individuals. (Watson, Reed, Kewley-Port, & Maki, 1989). Bernstein, Goldstein and Mashie (1988) cited linguistic, cognitive and attention components as the sources of initial failures. With those issues in mind, several researchers have developed a taxonomy for classification of computer based speech training systems over the last 10 years (Mahshie, Vari-Alquist, Waddy-Smith, & Bernstein, 1988; Bernstein, 1989; Watson & Kewley-Port, 1989). Classification systems and deliberations of assessment and intervention of various training systems are discussed.
Bernstein (1989) and Watson and Kewley-Port (1989) each suggested a taxonomy for speech training systems. Bernstein classified speech training systems into three categories based on the kinds of knowledge about speech that they incorporate. Class A systems utilize calibration of physical measures, acoustic or physiologic, as well as knowledge of production and perception. That is, text display can be used to demonstrate a relationship between acoustic or physiologic measures and recognition accuracy. These systems also rely on knowledge or perceptions obtained from listeners or ëjudgesí. Class B systems incorporate calibrated analytical displays, such as spectrograms, that would be more typical of engineers. Physical signal attributes can be obtained but no information related to accuracy of the perception of the signal. These systems do not compensate for attention, cognitive and perceptual limitations of children. Finally, Class C systems consist of a speech signal-to-visual transformations but lack explanations for the speech production accuracy. A Class C system may be used to teach volume control where the display shows the user level of intensity as defined by color. The Visi-Pitch by Kay Elemetrics is one of many Class C systems on the market. Bernstein views Class A systems as the "goal for the future".
Watson and Kewley-Port have developed a much more complex taxonomy for classification of computer-based speech training systems. They used categorizations based on physical source of feedback, standards of evaluation against which new productions are judged and the amount and type of detail to classify 48 systems. The physical source was further divided into electrophysiological, articulatory and acoustic information. Standard of evaluation was either speech produced by someone other than the trainee, often the speech professional, but it could also be selected exemplars from the traineeís ëbest speechí. In fact, the ISTRA (Watson, Reed, Kewley-Port, & Maki, 1989) uses the traineeís speech as the standard of evaluation. Finally, detail may be limited to pitch or amplitude comparisons across time, or may be more complex as in a spectrogram. Readers interested in obtaining more information about either taxonomy are referred to the sources as referenced.
Mahshie, Vari-Alquist, Waddy-Smith, and Bernstein (1988) developed two interrelated computer-based speech training aids: the Speech Training Station (STS) for assessment and intervention in the clinic, and the Speech Practice Station (SPS) for independent practice in the home incorporating a game format. These would be Class A systems according to Bernsteinís taxonomy. The goals in mind during development of these systems were two-fold: 1) assessment of skills through an objective measure 2) practice through drill in a game format. The STS had the capability to provide the therapist with feedback of physiological parameters that was not incorporated into the SPS. Six games were generated to teach vocalization, production of repeated syllables and control of voice intensity and fundamental frequency (Fo).
A limited clinical evaluation of STS and SPS was completed over fifteen months. Fifteen subjects participated in the evaluation. All children routinely wore hearing aids and had no other known handicaps. Either individual or small group treatment occurred twice a week for 20 minutes by one of two clinicians.
Subjective and objective observations of the STS were as follows. Both clinicians reported the system as easy to learn and found its capacity to individualize to the needs of the child as favorable. Two factors contributed to inconsistent reliability ratings by the computer system and clinicians. First, the sensitivity of the computer program enabled it to detect continuous voicing that the clinicians could not hear. Secondly, the computer focused on a single attribute at one time while the clinician could provide feedback about other speech characteristics regardless if it was the target of the game. Because a fixed placement of the microphone was not used initially, this resulted in variations of loudness levels mixed with ambient noise. Misreadings occurred because of this inconsistency as well.
Mahshie, Vari-Alquist, Waddy-Smith, and Bernstein felt the observations of the childrenís behavior indicated a positive response to the system. Children practiced independently and used the computer even when supervision was not provided. It was determined that the children spent a greater length of time practicing on their speech than they might have otherwise had games not been incorporated. The capacity to individualize to each childís needs minimized frustration yet presented a challenge.
Several clinical benefits of the aid were noted. It could be used with a wide age range. Time devoted to developing ëfuní therapy activities was no longer required. The aid could be used alone or combined with other, more traditional therapy techniques. If two or more children worked together, pragmatics such as turn taking, were indirectly enhanced. The objective measures of Fo duration and intensity levels provided feedback to both child and clinician. Because visual displays were based on a single speech parameter, the clinician learned to focus on the target. Both child and clinician seemed to benefit from the format of the training aid.
Supplemental use of the SPS in the home was found to enhance results of speech production that otherwise would not have occurred. Use of the practice station ranged from 82 minutes to 185 minutes. The ability to adjust the parameters of the system allowed for consistency between training and home practice activities. The entire family frequently participated in practice sessions. According to Ling, this is of significance because family interest is critical for facilitation of spoken language (as cited in
Mahshie, Vari-Alquist, Waddy-Smith, & Bernstein, 1988). Although the clinicians observed the children to be thoroughly engaged in the activities during training, parents, on the other hand, felt a greater variety of games should be provided. When interpreting the results, it should be kept in mind that parents were not trained on what behaviors and responses to observe. The existence of any parental concerns warrants the need for further investigation of the effectiveness of speech training if they are to be used on a regular basis for home based practice.
Watson and Kewley-Port developed The Indiana Speech Training Aid I (ISTRA) (1989) to compare computer-based ratings of words with average ratings from a jury of human listeners. Specifically, they were interested in determining whether or not the correlation between the ratings of speech quality by the speech recognizer and by humans was high enough for the goodness-of-fit metric to be considered a reasonable alternative to human feedback in certain types of drill sessions.
Results demonstrated that the CSRB was more consistent than human listeners. However, the overall average correlation showed that humans and the CSRB agreed equally well about quality of productions. Although a moderate to strong correlation was obtained, the authors recommended limited clinical use of the CSRB for practice/drill under carefully controlled conditions until further research completed.
Most research related to computer based speech drills was focused toward individuals with hearing impairments. However, Jones (1998, manuscript in preparation) is one of the first to assess the use of similar drill practice with traumatic brain injury survivors with persistent dysarthria. The purpose of the study was twofold. First the researcher wanted to determine if a relationship exists between trained human listeners, untrained human listeners, and computer-based evaluations of phonetic intelligibility. Secondly, the researcher wanted to analyze recognition patterns of the articulatory performance of individuals with dysarthria.
Distinguishable, although not always significant, differences were observed. For all four speakers, there were no significant differences noted among observations of trained and untrained human listeners. The SR system, however was significantly (p<.05) lower than human listeners in accuracy of recognition for two of four speakers. The speech of the fourth participant was perceived at a higher level of accuracy by the SR system than it was by human listeners. Overall patterns for correct and incorrect recognitions were similar across trained and untrained human listeners but differed from the SR system. It seems that the human listeners, whether or not they are trained, were using a different criteria than the SR system to ascertain intelligibility. Determining these differences may positively influence improvements to be made with recognition accuracy of dysarthric speech by SR systems in the future.
While the area of using SR to enhance drill practice is not new, it is not yet perfected either. Using SR for drill practice does not have to be limited to individuals with hearing impairment or dysarthria. Perhaps adults with varying degrees of aphasia could benefit from using SR in a similar fashion. Considerations to minimize the cognitive load, as suggested by Bernstein, Goldstein, and Mashie (1988), are still necessary to improve the success of drill practice across multiple etiologies.
Speech
Recognition and Learning Disabled Students
The use of speech recognition as a compensatory strategy for individuals with learning disabilities did not receive much attention until the early to mid 1980ís making it a relatively new area. Implementation of SR has the capacity to enhance the learning disabled (LD) studentís composition. According to MacArthur, Graham and Schwartz (as cited in De La Paz, in press), by the nature of LD alone, these students are prone to making more spelling, punctuation, and capitalization errors. Continual pausing to correct these frequent errors during content formation interrupts the train of thought. This may result in forgetting the initial message the student wanted to express, leaving him/her frustrated. Additionally, it then takes the LD student a longer time to generate written work as opposed to peers who are not LD. According to De La Paz and Graham (as cited in De La Paz, in press), LD students use a simplified vocabulary when writing to avoid spelling errors even though they may want to use a more difficult word. Finally, these students tend to have a negative attitude about writing in general (De La Paz, in press). Utilization of SR may allow the student to focus on the planning and content generation of text rather than the mechanics of writing. In addition, SR has the potential to increase the rate of production and positively enhance the overall writing experiences that LD students traditionally avoid.
The most recent research is by De La Paz (in press). The purpose of her study was to provide a rationale for using dictation with LD students as a means to enhance their written composition skills. De La Paz included several suggestions for improving the quality of writing via dictation. She emphasizes the need for advanced planning. Ideas or key words should be generated in an outline format or as notes to refer to when dictating (De La Paz & Graham as cited in De La Paz, in press; Reece as cited in De La Paz, in press; Wetzel, 1996). Support from teachers or special educators will be needed initially as the students learn pre-planning techniques as well as help to set up and execute the SR program appropriately.
Combining SR with speech synthesis is another way to help LD students learn how to correct their own errors. Programs such as the Kurzweil 3000 (Kurzweil Educational Systems, Inc.) can read text that has been scanned into its system. The auditory component may improve the studentís awareness of grammatical and spelling errors that would otherwise be overlooked when silently reading.
Most importantly, De La Paz emphasizes that SR is not a substitute for learning the rules of written grammar. These skills need to be mastered; SR is simply a supplementary device to make the writing process less fearful, more appealing and motivating to the LD student.
De La Paz also reminds the reader of the cognitive component that must be considered when determining if SR is an appropriate alternative for individuals with learning disabilities. Careful, precise, speech must be used to obtain the greatest accuracy in recognition. It may be difficult for students to remember that even the smallest cough or throat clear will be picked up by the microphone and interpreted, resulting in words that were not intended to be part of the manuscript in progress. SR systems are complex with an abundant number of commands that must be learned to achieve efficient and accurate results. The process varies depending if the user is dictating or editing text. Initial training processes can become long, frustrating and even tedious. However, with a positive attitude, and support from educators gradual mastery of the process is attainable.
Several researchers have compared the effectiveness of SR systems to other modes of composition across an array of age levels. Higgins and Raskind (1995) assessed the performance of LD college students. Writing without assistance, writing with assistance of a transcriber and writing using SR were compared. Findings showed that the use of SR was significantly more effective than writing without assistance. Students used more words containing seven or more letters during dictation than in writing without assistance. However, writing with the assistance of a human transcriber was just as effective as SR. Although there were no significant differences between the transcriber and SR, SR may foster greater independence and eliminates additional cost of payment required with transcribers.
Wetzel (1996) described a case study of one sixth-grade student learning to use the VoiceType (IBM) system. After only four sessions, recognition accuracy was 74%. It is plausible that with continued sessions this would have increased to a higher level. Dictation rate was 5.5 words per minute at the final session. An average rate of written composition was not provided, thus a cross comparison could not be made and no conclusion about improved rate with SR was stated.
Learning correction procedures produced mixed results. If the desired word was present on the selection menu, the student learned how to correctly select it. However, if the match for the desired word was not found, the student had to attempt to spell the target word. Recall that spelling is usually a challenge for students with LD. VoiceType (IBM) used word prediction as a way to alleviate the problems with spelling. That is, as the student began spelling the intended word, a correction menu appeared on the screen. This menu displayed a list of guesses from which the user could select the correct word if present. Occasional frustration was displayed during this process by sighs. The sighs and other extraneous sounds such as sniffling or throat clearing were then misread by the system as unwanted words. This could lead to a cycle of increasing frustration. The entire correction procedure required great concentration and ongoing monitoring. For this particular student, the positives of using SR did not seem to be a superior method of traditional writing after the limited number of training sessions.
Because of the cognitive demands to use the system efficiently and accurately, Wetzel recommended that recognition accuracy be improved before middle school aged students with learning disabilities use dictation as an alternative. Other suggestions from Wetzel can be applied to all students with LD when incorporating SR as a means to facilitate composition. He recommends more frequent training sessions over a longer period of time. Providing the student with strategies to enhance self-monitoring of extraneous noises, such as turning off the microphone temporarily may also help. Ultimately, Wetzel recommended that SR should be an alternative only when the system can recognize the user's speech at a level of 90% accuracy or better. Fewer errors equal fewer corrections, a procedure that in and of itself often causes students to lose concentration. Finally, the rate of transcription should exceed that of pencil. These guidelines should be modified as appropriate depending on the userís abilities. All students are individuals. What works best for one student may not work for the next.
Future
Research
Overall improvements in speech recognition have the capacity to positively effect the lives of all individuals who use it. Regardless of the nature of the diagnosis, improved technology, engineering and design give potential to improved accuracy and rate of dictation. Present fluctuations are not strictly a result of the engineering of the system itself, but may be due to the inconsistencies in the userís speech. Therefore, the userís preferences and expectations seem vital in the development and research of new tools. Suggestions and future research related to the technology cannot be associated with one genre of users. Haigh and Clarke (1988) recommended that ongoing research assess the following issues: ways to reduce the effects of voice drift, development of simplified instructions, and improved hardware design.
If use of SR systems by people with cognitive deficits and learning disabilities is to prove beneficial, additional research must be done to determine if and how the cognitive load can be minimized. Assessment of effectiveness, efficiency and independent execution across diagnoses, age ranges, and multiple systems is needed as well. This issue also applies to use of SR by individuals who are functionally non-speaking. The amount of training required, frequency of training sessions and methods to promote generalization of SR to activities of daily living pertinent to each user are also in need of investigation.
References
Bernstein, L., Goldstein, M.,& Mashie, J. (1988). Speech training aids for hearing-impaired individuals. Journal of Rehabilitation Research and Development, 25,53-62.
Coleman, C. & Meyers, L. (1991). Computer recognition of the speech of adults with cerebral palsy and dysarthria. Augmentative and Alternative Communication, 7,34-42, 1991.
Damper, R. (1984). Voice-input aids for the physically disabled. International Journal of Man-Machine Studies, 21, 541-553.
De La Paz, (in press). Composing via dictation and speech recognition systems: compensatory technology for students with learning disabilities. Learning Disabilities Quarterly.
Doyle, P., Leeper, H., Kotler, A., Thomas-Stonell, N., OíNeill, C., Dylke, M., & Rolls, K. (1997). Dysarthric speech: A comparison of computerized speech recognition and listener intelligibility. Journal of Rehabilitation Research and Development,34, (3), 309-316.
DragonDictate [computer software]. (1990). Newton, MA: Dragon Systems.
Ferrier, L.J., Jarrell, N., Carpenter, T., & Shane, H.C. (1992). A case study of a dysarthric speaker using the DragonDictate Voice Recognition System. CUSH, **.
Trapagnier, C. & Rosen, M.J. (1991). Towards a method for computer interface design using speech recognition. Paper presented at RESNA 14th Annual Conference, Kansas City, MO.
Holme, S., Kanny, E., Gutherie, M., & Johnson, K. (1997). The use of environmental control units by occupational therapists in spinal cord injury and disease services. American Journal of Occupational Therapy, 51,42-48.
Jones, B. (1998). Acoustic variability and computer recognition in dysarthria. Manuscript in preparation, University of Nebraska at Lincoln.
Kurzweil 3000 [computer software]. (1998). Waltham, MA: Kurzweil Educational Systems, Inc.
Lange, H. (1993). Speech synthesis and speech recognition: Tomorrowís human-computer interfaces? Annual Review of Information Science and Technology (ARIST), 28,153-185.
Lee, K., Hauptmann, A., & Rudinicky, A. (1990). The spoken word: Replace the "look and feel" of GUIs with the "ask and tell" of voice interfaces. Byte,July.
Ling, D. (1976). Speech and the Hearing Impaired Child: Theory and Practice. Washington DC: The Alexander Graham Bell Association of the Deaf, Inc.
Mahshie, J., Vari-Alquist, D., Waddy-Smith, B., & Bernstein, L. (1988). Speech training aids for hearing-impaired individuals: III. Preliminary observations in the clinic and childrenís homes. Journal of Rehabilitation Research and Development, 25,69-82.
Meisel, W. (1993). Talk to your computer: voice technology lets you verbally command your computer or convert speech to text. Byte, 18,113-120.
Noyes, J. & Frankish, C. (1992). Speech recognition technology for individuals with disabilities. AAC Augmentative and Alternative Communication, 8,297-303.
Schmitt, D., & Tobias, J. (1986). Enhanced communication for a severely disabled dysarthric individual using voice recognition and speech synthesis. Proceedings of the 9th Annual RESNA Conference, Minneapolis, MN. Washington, DC: RESNA Press,1986, 304-306.
Stevens, K. (1960). Toward a model for speech recognition. The Journal of Acoustical Society of America, 32, 47-55.
Treviranus, J., Shein, F. Haataja, S., Parnes, P., & Milner, M. (1991, month unknown). Speech recognition to enhance computer access for children and young adults who are functionally non-speaking. RESNA 14th Annual Conference, Kansas City, Mo.
VoicePad Platinum [computer software]. (1997). Waltham, MA: Kurzweil Educational Systems, Inc.
VoiceType [computer software]. (1992). Dayton, NJ: IBM Direct.
Watson, C. & Kewley-Port, D. (1989). Advances in computer-based speech training: Aids for the profoundly hearing impaired. Volta-Review, 91,29-45.
Watson, C. & Kewley-port, D. (1989). Advances in computer-based speech training: Aids for the profoundly hearing impaired. Volta-Review, 91,29-45.
Watson, C., Reed, J., Kewley-Port, D., & Maki, D. (1989). The Indiana Speech Training Aid (ISTRA) I: Comparisons between human and computer-based evaluation of speech quality. Journal of Speech and Hearing Research, 32, 245-251.
Wetzel, K. (1996). Speech-recognizing computers: A written-communication tool for students with learning disabilities? Journal of Learning Disabilities, 29,371-380.
Youdin, M., Sell, G., Reich, T., Clagnaz, M., Louie, H., & Kolwicz, R., (1980). A voice controlled powered wheelchair and environmental control system for the severely disabled. Medical Progress Through Technology, 7, 139-143.