Introduction
Technology continually amazes
society with the improvements and capabilities that computer systems exceed
annually. When it seems the top of the
line speech recognition (SR) program has been put on the market another
company is in the process of developing a more sophisticated product. In
fact, today, you can buy a computer with SR capabilities built into the
hard drive. Don’t be mistaken though, SR is not a new entity. It first
evolved over 30 years ago (Stevens, 1960). This continued growth
in technology opened the doors to various applications of SR. Not only
is SR a popular medium for use by professionals in the working world, it
is also an exceptional tool for people with disabilities.
Four main areas of interest
have received attention by professionals working with individuals with
disabilities. The first area researched was use of SR systems by people
having physical disabilities but no speech impairment. Investigation in
a second area expanded to include use of SR by individuals with impaired
or dysarthric speech. SR was viewed to have the potential to make hard
to understand speech more easily recognizable. The third area explored
was use of SR for drill practice. Could individuals with motor speech disorders,
as well as those with hearing impairments, improve their intelligibility
using SR programs that proved feedback of some sort? The last and most
recent area studied by researchers is use of SR by students with learning
disabilities as an aid to improve the effectiveness of written comprehension.
This paper will provide preliminary information about the basic components
of speech recognition systems; however, the primary content will focus
on a bit of the history behind each of the four areas, recent research
findings, and suggestions for future research.
System
Features and Capabilities
Features of SR systems have
progressively advanced over the last 35 years. Initially, in 1972, dictation
and word processing systems were combined to formulate the first SR system
(Lange, 1993; Meisel, 1993). At this point, systems could only handle discrete
speech dictation where pauses between every word spoken were required for
the signal to be processed. Today, it is difficult to find any programs
that still use discrete speech. VoicePad Platinum (Kurzweil Educational
Systems, Inc.) is one of the last programs of this kind. Most programs
have the capacity to handle continuous speech where the speaker talks naturally,
without the need to pause between every word. One disadvantage is
that continuous systems do not make ongoing adaptations to the user’s speech
whereas discrete systems do (De La Paz, in press). That is, the stored
templates of the user’s speech will not be automatically adjusted in continuous
systems. Rather, repeated training sessions are a prerequisite for effective
use. The capacity for vocabulary size ranges anywhere from 13 ? 40,000
words (Lee, Hauptmann, & Rudinicky, 1990). SR programs may be speaker
dependent or speaker independent. The former requires the user to train
the system to develop a recognition template of words. This template is
user specific and is accessed each time the user activates the system.
Speaker independent systems, on the other hand, use previously stored templates
that are provided by the manufacturer.
In addition to the many internal
components, there are external factors that will effect the accuracy of
SR as well. A microphone headset is worn by the user and adjusted properly
to minimize background noise and maximize the quality of the speech signal.
Some SR systems are packaged with their own headset while other products
require the user to purchase the headset separately. Several types of headsets
are available on the market. They range in quality as well as price. Unlike
face-to-face speech communication, a SR system depends solely on the speech
signal. Limited knowledge about the speaker and context of topic can be
used to assist in deciphering the message. Each system has a complex set
of commands that must be mastered in order to effectively execute the program
and correct errors. Learning to use a SR program takes time and patience.
In addition to mastering
the components and language of the system, the user must also know how
to set up and run the computer on which it is based. It can be an overwhelming
and frustrating process to learn for users without disabilities let alone
those individuals faced with additional challenges. Many pros and cons
have arisen when exploring the use of SR systems by those with physical
disabilities, those with dysarthria, as a tool for drill practice and use
by students with learning disabilities.
Speech
recognition use by the physically disabled
Speech recognition systems
were first used by severely disabled individuals with normal speech.
The goal was to promote independence whereby SR was used to convert human
speech signals into effective actions. Frequently, speech is the only remaining
means of communication left for these individuals. The first voice activated
wheelchair with an environmental control unit (ECU) was developed in the
late 1970s at Rehabilitation Medicine in New York (Youdin, et al., 1980).
The user could operate multiple items including the telephone, radio, fans,
curtains, intercom, page-turner and more. A group of individuals
with cerebral palsy rated the wheelchair as superior to breath control
systems because it eliminated the need for scanning, allowing the user
quicker access by directly selecting the desired function with voice.
Grunza and Cohen (as cited
in Noyes & Frankish, 1992) developed another system in 1977, the Voice
Activated Control System (VACS). The VACS is composed of a microphone,
hardware preprocessor and feature extractor, minicomputer, electronic display,
a Teletype and relay interface. It is a speaker dependent system that the
user must train with a capacity for 99 words. The visual display allows
the user to confirm all commands before they are executed. It is composed
of three functional modes: control, type, and calculate. Only those words
valid for the current mode will be accepted by the system; all other words
will be rejected and an indication light will alert the user of an invalid
function. Control mode activates up to 12 automated devices that are interfaced
with the VACS. The type mode allows the user to compose written text using
the phonetic alphabet. There is a capacity for up to 20 vocabulary words
to be pre-stored and entered by a single utterance. All other words must
be spelled letter by letter, a fatiguing and time consuming process. If
the user attempts to speak a word that is stored for use in another mode,
it may be rejected by the system. Finally, calculate mode will execute
addition, subtraction, multiplication and division. Although not perfect,
the introduction of this device in the 1970s gave continued promise that
technology could enhance the lives of individuals with disabilities.
Application of ECUs in the
home was later investigated. This was first considered by Damper 1984 with
the development of the Voice Activated Domestic Appliance System (VADAS).
The VADAS is a speaker dependent, isolated word recognizer with the capacity
to control up to 16 household appliances (Noyes & Frankish, 1992).
A similar notion was the use of speech to control robotic arms. Research
was initiated in this area in 1981 by Engelhardt, et al. (1984) and more
recently by the Palo Alto Veterans Administration Medical Center (Noyes
& Frankish, 1992). The goal of the robotic arm was to assist individuals
with household tasks including kitchen preparation, recreational tasks
and even therapeutic skills. Few complaints were voiced about the actual
functioning of the system; rather greater concern was with frequent misrecognitions
of speech.
Finally, a more recent consideration
of SR use by individuals with severe disabilities occurred in the medical
setting. In this environment SR can serve two purposes. First, it is used
to activate functional electronic stimulation (FES) systems. This allows
the patient to maintain a routine as directed by the physical therapist
without the consistent involvement of a nurse. This can be motivational
to the patient, providing a degree of control in their rehabilitation recovery
process. Secondly, the use of computers to assist disabled patients to
complete surveys or questionnaires has also been considered (Noyes &
Frankish, 1992). Verbal responses would eliminate the need for writing,
a difficult and sometimes impossible task for many individuals with physical
disabilities.
While systems that combine
SR with ECUs have alleviated many frustrations for users, several issues
still require resolution. Frequent misrecognitions and non-recognitions
are an unresolved problem. A temporary solution is for the user to make
note of repeated failures and make changes to the stored template as necessary.
This may or may not be related to fatigue. As fatigue increases accuracy
may decrease. Perhaps the systems should be used for limited time periods
initially, to maximize accuracy, rather than using the systems for lengthy
periods but with multiple errors. As endurance improves, length of use
could be gradually increased. This may positively influence the user’s
perspective in relation to the quality and accuracy of a system. Another
contributing factor to mis-recognitions is due to background noise. While
placement of the microphone close to the mouth can help to alleviate this
to some degree, it is not avoided entirely (Noyes & Frankish, 1992).
In addition, running SR systems require mastery of a complex set of rules
and sequences, especially to switch between modes as in the VACS. Maintenance
and upkeep of expensive devices or programs is also a consideration.
Even if the flaws could be
perfected, would ECUs continue to be used? Unfortunately, according to
respondents of a survey conducted in 1997, occupational therapists recommended
ECUs for fewer than 25% of their clients. The primary reasons cited for
lack of referral were high cost and lack of third-party payer reimbursement
(Holmes, Kanny, Gutherie, & Johnson, 1997). While this is an alarming
statistic, the use of SR, with or without interfacing an ECU, should not
be ruled out as an option for individuals with physical disabilities as
a way to reestablish and maintain their independence.
Speech
recognition use by people with dysarthria
A second goal is to use SR
as an interface to type or send signals to a speech synthesizer that would
translate difficult to understand dysarthric speech into a more recognizable
form. Using SR in this manner could benefit individuals with Cerebral Palsy,
survivors of stroke and traumatic brain injury, and those with degenerative
neurologic diseases such as Parkinson’s and ALS. SR may also eliminate
or minimize the challenges that persons with motor disorders face when
attempting to manipulate controls of augmentative communication devices.
Additionally, SR could allow improved interactions by improving rate efficiency
of responses.
A barrier to successful use
of SR by individuals with dysarthria is inconsistency. Severity of dysarthria
not only varies across individuals; it can vary for a single speaker depending
on the time of day, fatigue, stress or other personal and environmental
factors. Thus, the effectiveness of SR may vary at any time or any place.
Research has demonstrated some success of SR with speakers with dysarthria,
but reports indicate a rapid decrease in performance for vocabulary sizes
exceeding 30 words (Schmitt, D.G., & Tobias, 1986). Several more recent
articles investigating efforts to assist speakers with dysarthria are discussed
in detail.
Coleman and Meyers (1991)
used a structured set of stimuli to compare computer recognition capabilities
for dysarthric and nondisabled speakers. A total of 23 Australian speaking
subjects were used, 10 dysarthric speakers with cerebral palsy (CP) and
13 nondisabled speakers ranging in age from 20-53 years. Severity of dysarthria
for subjects with CP was established using the Assessment of Intelligibility
of Dysarthric Speech. Average intelligibility for words was 63.6% and 52.0%
for sentences. The Shadow VET/2 speech recognition system used was installed
on an Apple IIe computer. Stimuli included 12 consonants paired with a
neutral vowel, all 12 Australian vowels in an h-d environment, 12 hard
words and 12 easy words from Tikofsky’s list. For example, one of the hard
words was ‘platform’ and an easy word was ‘chant’. Each list was randomized
and made up a set of stimuli. During training, participants repeated each
item five times. In recognition testing, each item was verbalized
three times in random order. The ‘tester’ read stimulus items in random
order and each subject repeated the item. The tester documented whether
the system recognized the intended item correctly. If another item
was recognized, a note was made. Training and testing of each stimuli set
occurred during the same session.
The researchers concluded
that while the total number of correct recognitions was fewer for the dysarthric
speakers, a similar pattern of recognition was present for both groups.
Both had significantly fewer correct recognitions for consonants than for
vowels; in addition, errors of place were greater for both groups as compared
to recognition errors associated with manner and voicing. These similar
patterns give hope that general improvement in speech recognition systems
to help nondisabled speakers will also improve recognition for dysarthric
speakers. At present, for speech recognition to be successful with dysarthric
speakers, Coleman and Meyers indicate a need for adjustments to be made
to the input signal or the instrumentation of the recognition system.
Future research needs to address the specifics of each and determine what
adjustments may facilitate improvement.
Ferrier, L.J., Jarrell, N., Carpenter, T., Shane, H.C. (1992) did a case
study of a dysathric speaker using the DragonDictate Voice Recognition
System (Dragon Systems, Inc.). They hypothesized that there is a range
of speech intelligibility that is most recognizable by the DragonDictate
Voice Recognition System. Specifically, their research questions were as
follows. What is the potential recognition level that can be achieved by
a dysarthric user with cerebral palsy compared to normal speakers? How
much time is involved in reaching the maximal level of recognition? What
are the speech and voice features associated with lower levels of recognition?
What phonetic characteristics predict lower recognition level? Does word
length affect recognition? Finally, is there a difference in a subject’s
rate and accuracy when using DragonDictate compared to a manual access
computer? A longitudinal study of recognition levels in five subjects,
one male with a diagnosis of cerebral palsy (CP) and mild dysarthria, and
four normals, two male, two female, was done first. This was followed by
analysis of the speech and voice features of the unrecognizable words spoken
by the dysarthric speaker.
Results to the research questions
were as follows. It took six sessions for the user with CP to reach 100%
intelligibility where it took normals two-four sessions. The overall learning
pattern for speech recognition of the dysarthric speaker was similar to
that of the normal speakers. After the first session intelligibility was
between 85-95% for the dysarthric speaker, where it reached between 90-100%
for normals. Baseline intelligibility, using the Computerized Assessment
of Intelligibility of Dysarthric Speakers (CAIDS) by Beukelman, Yorkston,
& Traynor (as cited in Ferrier, Jarrell, Carpenter, & Shane,
1992), was 84%, slightly below the performance obtained at the second session
using DragonDictate. This is a positive indication that the CAIDS may be
an accurate predictor of how successful a user will be with the DragonDictate
(Dragon Systems, Inc.). Imprecise consonants, low loudness, hypernasality,
insufficient prosody, slow rate, equal stress and final consonant deletion
were all characteristics associated with lower recognition scores.
Doyle, et al. (1997) compared
the recognition of dysarthric speech by a computerized voice recognition
system and non-hearing impaired human adult listeners. Intelligibility
ratings were obtained for six dysarthric speakers and six matched controls.
The researchers were interested in patterns of recognition rather than
accuracy alone. The IBM VoiceType recognized non-dysarthric speakers with
greater accuracy than the age and gender matched dysarthric speakers; however,
the learning curves between both groups were not significantly different.
Gradual improvements were made at each of the five sessions across both
groups of speakers. Had training continued beyond five sessions the pattern
of increasing accuracy of recognition would be likely to occur. The human
adult listeners were 100% accurate for stimuli produced by control speakers.
Intelligibility scores for dysarthric speakers were 94-96% for mildly dysarthric,
90-94% for moderate and 18-85% for severely dysarthric speakers. Of interest
is that the human listeners judged the mild and moderately dysarthric speakers
to be quite similar. This is contradictory to the results obtained on the
CAIDS used to classify the subjects. The presence of this discrepancy needs
to be further investigated. No trends of improvement observed over sessions
occurred with human listeners as occurred with the VoiceType recognition
system. Overall results indicate that the IBM VoiceType system gradually
improves in recognition accuracy across session while human adult listeners’
judgments of intelligibility remain stable.
A different perspective using
SR with people with severe dysarthria is the use of a small set of utterances
to elicit reliable recognition. The long-term goal is to improve the individual’s
ease of computer access and execution (Goodenough-Trapagnier & Rosen,
1991). The user does not have to speak recognizable words. Rather, a set
of vocalizations is recognized by the system to improve the performance
of job related tasks. Goodenough-Trapagnier and Rosen hoped to distinguish
classes of participants for which particular strategies might be successful.
For example, individuals with pitch and volume difficulties may achieve
higher recognition with a specified set of vocalizations than speakers
whose primary deficit is extended duration of sounds. Each speaker would
require individualized assessment; however, if ‘classes’ could be established,
it may decrease the time required during evaluation to identify optimal
sets of "speech acts" that a speaker may effectively use.
Finally, a similar study
by Treviranus, Shein, Haataja, Parnes & Milner (1991) examined the
use of speech recognition in combination with scanning to increase the
rate of computer input by individuals who are functionally non-speaking.
As previously mentioned, not all SR systems require intelligible words
be used as stimuli. Specific vocalizations can be assigned to perform direct
selection tasks. The goal was to increase the rate of access while minimizing
any additional cognitive processing demands. Two methods of access, scanning
alone and scanning combined with SR were compared for six participants
ranging in age from 5-21 years old. Results for a 12-year-old boy are described.
He could consistently produce 3 discrete, repeatable vocalizations. The
productions "ma", "hey", and "heya" were assigned either delete the
last selection, skip to the second half of the rows or second half of the
columns and to select the row of verbs. Results indicated that the participant
made 3.6 errors when scanning alone and 3.9 errors using scanning combined
with SR. Vocalizations were repeated on average 1.6 times before they were
recognized. After five sessions, the participant was making only 2.5 selections
on average per minute using scanning only as opposed to 3.4 selections
with scanning and voice combined. Thus far, it seems there are gains to
be made when using a combination of scanning and SR even with limited vocalizations.
Clearly, further research is required with a variety of individuals to
draw any conclusions about the significance of such a program.
The use of SR with speakers
with dysarthria has much potential for growth. Continued research
assessing the effectiveness of various SR systems with speakers of varying
severity and intelligibility of speech is needed. In addition, replication
of all studies discussed across larger numbers of individuals is necessary
before any concrete conclusions can be drawn regarding the benefits of
using SR with speakers with dysarthria of varying severity.
Speech
Recognition for Drill Practice
Another area of interest
is computer-based training for impaired speech. The goal is implementation
of a low cost speech training aid using commercial technology. Research
was initiated over 25 years ago with the focus toward improving the speech
of hearing-impaired individuals. (Watson, Reed, Kewley-Port, & Maki,
1989). Bernstein, Goldstein and Mashie (1988) cited linguistic, cognitive
and attention components as the sources of initial failures. With those
issues in mind, several researchers have developed a taxonomy for classification
of computer based speech training systems over the last 10 years (Mahshie,
Vari-Alquist, Waddy-Smith, & Bernstein, 1988; Bernstein, 1989; Watson
& Kewley-Port, 1989).
Classification systems and deliberations of assessment and intervention
of various training systems are discussed.
Bernstein (1989) and Watson and Kewley-Port
(1989) each suggested a taxonomy for speech training systems. Bernstein
classified speech training systems into three categories based on the kinds
of knowledge about speech that they incorporate. Class A systems utilize
calibration of physical measures, acoustic or physiologic, as well as knowledge
of production and perception. That is, text display can be used to demonstrate
a relationship between acoustic or physiologic measures and recognition
accuracy. These systems also rely on knowledge or perceptions obtained
from listeners or ‘judges’. Class B systems incorporate calibrated analytical
displays, such as spectrograms, that would be more typical of engineers.
Physical signal attributes can be obtained but no information related to
accuracy of the perception of the signal. These systems do not compensate
for attention, cognitive and perceptual limitations of children. Finally,
Class C systems consist of a speech signal-to-visual transformations but
lack explanations for the speech production accuracy. A Class C system
may be used to teach volume control where the display shows the user level
of intensity as defined by color. The Visi-Pitch by Kay Elemetrics is one
of many Class C systems on the market. Bernstein views Class A systems
as the "goal for the future".
Watson and Kewley-Port have developed
a much more complex taxonomy for classification of computer-based speech
training systems. They used categorizations based on physical source of
feedback, standards of evaluation against which new productions are judged
and the amount and type of detail to classify 48 systems. The physical
source was further divided into electrophysiological, articulatory and
acoustic information. Standard of evaluation was either speech produced
by someone other than the trainee, often the speech professional, but it
could also be selected exemplars from the trainee’s ‘best speech’. In fact,
the ISTRA (Watson, Reed, Kewley-Port, & Maki, 1989) uses the trainee’s
speech as the standard of evaluation. Finally, detail may be limited to
pitch or amplitude comparisons across time, or may be more complex as in
a spectrogram. Readers interested in obtaining more information about either
taxonomy are referred to the sources as referenced.
Mahshie, Vari-Alquist, Waddy-Smith,
and Bernstein (1988) developed two interrelated computer-based speech training
aids: the Speech Training Station (STS) for assessment and intervention
in the clinic, and the Speech Practice Station (SPS) for independent practice
in the home incorporating a game format. These would be Class A systems
according to Bernstein’s taxonomy. The goals in mind during development
of these systems were two-fold: 1) assessment of skills through an objective
measure 2) practice through drill in a game format. The STS had the capability
to provide the therapist with feedback of physiological parameters that
was not incorporated into the SPS. Six games were generated to teach vocalization,
production of repeated syllables and control of voice intensity and fundamental
frequency (Fo).
A limited clinical evaluation of STS and
SPS was completed over fifteen months. Fifteen subjects participated in
the evaluation. All children routinely wore hearing aids and had no other
known handicaps. Either individual or small group treatment occurred twice
a week for 20 minutes by one of two clinicians.
Subjective and objective
observations of the STS were as follows. Both clinicians reported the system
as easy to learn and found its capacity to individualize to the needs of
the child as favorable. Two factors contributed to inconsistent reliability
ratings by the computer system and clinicians. First, the sensitivity of
the computer program enabled it to detect continuous voicing that the clinicians
could not hear. Secondly, the computer focused on a single attribute at
one time while the clinician could provide feedback about other speech
characteristics regardless if it was the target of the game. Because a
fixed placement of the microphone was not used initially, this resulted
in variations of loudness levels mixed with ambient noise. Misreadings
occurred because of this inconsistency as well.
Mahshie, Vari-Alquist, Waddy-Smith,
and Bernstein felt the observations of the children’s behavior indicated
a positive response to the system. Children practiced independently and
used the computer even when supervision was not provided. It was determined
that the children spent a greater length of time practicing on their speech
than they might have otherwise had games not been incorporated. The capacity
to individualize to each child’s needs minimized frustration yet presented
a challenge.
Several clinical benefits
of the aid were noted. It could be used with a wide age range. Time devoted
to developing ‘fun’ therapy activities was no longer required. The aid
could be used alone or combined with other, more traditional therapy techniques.
If two or more children worked together, pragmatics such as turn taking,
were indirectly enhanced. The objective measures of Fo duration and intensity
levels provided feedback to both child and clinician. Because visual displays
were based on a single speech parameter, the clinician learned to focus
on the target. Both child and clinician seemed to benefit from the format
of the training aid.
Supplemental use of the SPS
in the home was found to enhance results of speech production that otherwise
would not have occurred. Use of the practice station ranged from 82 minutes
to 185 minutes. The ability to adjust the parameters of the system allowed
for consistency between training and home practice activities. The entire
family frequently participated in practice sessions. According to Ling,
this is of significance because family interest is critical for facilitation
of spoken language (as cited in Mahshie, Vari-Alquist, Waddy-Smith, &
Bernstein, 1988). Although the clinicians observed the children to be thoroughly
engaged in the activities during training, parents, on the other hand,
felt a greater variety of games should be provided. When interpreting the
results, it should be kept in mind that parents were not trained on what
behaviors and responses to observe. The existence of any parental concerns
warrants the need for further investigation of the effectiveness of speech
training if they are to be used on a regular basis for home based practice.
Watson and Kewley-Port developed
The Indiana Speech Training Aid I (ISTRA) (1989) to compare computer-based
ratings of words with average ratings from a jury of human listeners. Specifically,
they were interested in determining whether or not the correlation between
the ratings of speech quality by the speech recognizer and by humans was
high enough for the goodness-of-fit metric to be considered a reasonable
alternative to human feedback in certain types of drill sessions.
Results demonstrated that
the CSRB was more consistent than human listeners. However, the overall
average correlation showed that humans and the CSRB agreed equally well
about quality of productions. Although a moderate to strong correlation
was obtained, the authors recommended limited clinical use of the CSRB
for practice/drill under carefully controlled conditions until further
research completed.
Most research related to computer based
speech drills was focused toward individuals with hearing impairments.
However, Jones (1998, manuscript in preparation) is one of the first to
assess the use of similar drill practice with traumatic brain injury survivors
with persistent dysarthria. The purpose of the study was twofold. First
the researcher wanted to determine if a relationship exists between trained
human listeners, untrained human listeners, and computer-based evaluations
of phonetic intelligibility. Secondly, the researcher wanted to analyze
recognition patterns of the articulatory performance of individuals with
dysarthria.
Distinguishable, although
not always significant, differences were observed. For all four speakers,
there were no significant differences noted among observations of trained
and untrained human listeners. The SR system, however was significantly
(p<.05) lower than human listeners in accuracy of recognition for two
of four speakers. The speech of the fourth participant was perceived at
a higher level of accuracy by the SR system than it was by human listeners.
Overall patterns for correct and incorrect recognitions were similar across
trained and untrained human listeners but differed from the SR system.
It seems that the human listeners, whether or not they are trained, were
using a different criteria than the SR system to ascertain intelligibility.
Determining these differences may positively influence improvements to
be made with recognition accuracy of dysarthric speech by SR systems in
the future.
While the area of using SR to enhance
drill practice is not new, it is not yet perfected either. Using SR for
drill practice does not have to be limited to individuals with hearing
impairment or dysarthria. Perhaps adults with varying degrees of aphasia
could benefit from using SR in a similar fashion. Considerations to minimize
the cognitive load, as suggested by Bernstein, Goldstein, and Mashie (1988),
are still necessary to improve the success of drill practice across multiple
etiologies.
Speech
Recognition and Learning Disabled Students
The use of speech recognition as
a compensatory strategy for individuals with learning disabilities did
not receive much attention until the early to mid 1980’s making it a relatively
new area. Implementation of SR has the capacity to enhance the learning
disabled (LD) student’s composition. According to MacArthur, Graham and
Schwartz (as cited in De La Paz, in press), by the nature of LD alone,
these students are prone to making more spelling, punctuation, and capitalization
errors. Continual pausing to correct these frequent errors during content
formation interrupts the train of thought. This may result in forgetting
the initial message the student wanted to express, leaving him/her frustrated.
Additionally, it then takes the LD student a longer time to generate written
work as opposed to peers who are not LD. According to De La Paz and Graham
(as cited in De La Paz, in press), LD students use a simplified vocabulary
when writing to avoid spelling errors even though they may want to use
a more difficult word. Finally, these students tend to have a negative
attitude about writing in general (De La Paz, in press). Utilization of
SR may allow the student to focus on the planning and content generation
of text rather than the mechanics of writing. In addition, SR has the potential
to increase the rate of production and positively enhance the overall writing
experiences that LD students traditionally avoid.
The most recent research
is by De La Paz (in press). The purpose of her study was to provide a rationale
for using dictation with LD students as a means to enhance their written
composition skills. De La Paz included several suggestions for improving
the quality of writing via dictation. She emphasizes the need for advanced
planning. Ideas or key words should be generated in an outline format or
as notes to refer to when dictating (De La Paz & Graham as cited in
De La Paz, in press; Reece as cited in De La Paz, in press; Wetzel, 1996).
Support from teachers or special educators will be needed initially as
the
students learn pre-planning techniques as well as help to set up and execute
the SR program appropriately.
Combining SR with speech synthesis is
another way to help LD students learn how to correct their own errors.
Programs such as the Kurzweil 3000 (Kurzweil Educational Systems, Inc.)
can read text that has been scanned into its system. The auditory component
may improve the student’s awareness of grammatical and spelling errors
that would otherwise be overlooked when silently reading.
Most importantly, De La Paz
emphasizes that SR is not a substitute for learning the rules of written
grammar. These skills need to be mastered; SR is simply a supplementary
device to make the writing process less fearful, more appealing and motivating
to the LD student.
De La Paz also reminds the
reader of the cognitive component that must be considered when determining
if SR is an appropriate alternative for individuals with learning disabilities.
Careful, precise, speech must be used to obtain the greatest accuracy in
recognition. It may be difficult for students to remember that even the
smallest cough or throat clear will be picked up by the microphone and
interpreted, resulting in words that were not intended to be part of the
manuscript in progress. SR systems are complex with an abundant number
of commands that must be learned to achieve efficient and accurate results.
The process varies depending if the user is dictating or editing text.
Initial training processes can become long, frustrating and even tedious.
However, with a positive attitude, and support from educators gradual mastery
of the process is attainable.
Several researchers have
compared the effectiveness of SR systems to other modes of composition
across an array of age levels. Higgins and Raskind (1995) assessed the
performance of LD college students. Writing without assistance, writing
with assistance of a transcriber and writing using SR were compared.
Findings showed that the use of SR was significantly more effective than
writing without assistance. Students used more words containing seven or
more letters during dictation than in writing without assistance. However,
writing with the assistance of a human transcriber was just as effective
as SR. Although there were no significant differences between the transcriber
and SR, SR may foster greater independence and eliminates additional cost
of payment required with transcribers.
Wetzel (1996) described a
case study of one sixth-grade student learning to use the VoiceType (IBM)
system. After only four sessions, recognition accuracy was 74%. It is plausible
that with continued sessions this would have increased to a higher level.
Dictation rate was 5.5 words per minute at the final session. An average
rate of written composition was not provided, thus a cross comparison could
not be made and no conclusion about improved rate with SR was stated.
Learning correction procedures produced
mixed results. If the desired word was present on the selection menu, the
student learned how to correctly select it. However, if the match for the
desired word was not found, the student had to attempt to spell the target
word. Recall that spelling is usually a challenge for students with LD.
VoiceType (IBM) used word prediction as a way to alleviate the problems
with spelling. That is, as the student began spelling the intended word,
a correction menu appeared on the screen. This menu displayed a list
of guesses from which the user could select the correct word if present.
Occasional frustration was displayed during this process by sighs. The
sighs and other extraneous sounds such as sniffling or throat clearing
were then misread by the system as unwanted words. This could lead
to a cycle of increasing frustration. The entire correction procedure required
great concentration and ongoing monitoring. For this particular student,
the positives of using SR did not seem to be a superior method of traditional
writing after the limited number of training sessions.
Because of the cognitive
demands to use the system efficiently and accurately, Wetzel recommended
that recognition accuracy be improved before middle school aged students
with learning disabilities use dictation as an alternative. Other suggestions
from Wetzel can be applied to all students with LD when incorporating SR
as a means to facilitate composition. He recommends more frequent training
sessions over a longer period of time. Providing the student with strategies
to enhance self-monitoring of extraneous noises, such as turning off the
microphone temporarily may also help. Ultimately, Wetzel recommended that
SR should be an alternative only when the system can recognize the user's
speech at a level of 90% accuracy or better. Fewer errors equal fewer corrections,
a procedure that in and of itself often causes students to lose concentration.
Finally, the rate of transcription should exceed that of pencil. These
guidelines should be modified as appropriate depending on the user’s abilities.
All students are individuals. What works best for one student may not work
for the next.
Future
Research
Overall improvements in speech
recognition have the capacity to positively effect the lives of all individuals
who use it. Regardless of the nature of the diagnosis, improved technology,
engineering and design give potential to improved accuracy and rate of
dictation. Present fluctuations are not strictly a result of the engineering
of the system itself, but may be due to the inconsistencies in the user’s
speech. Therefore, the user’s preferences and expectations seem vital in
the development and research of new tools. Suggestions and future research
related to the technology cannot be associated with one genre of users.
Haigh and Clarke (1988) recommended that ongoing research assess the following
issues: ways to reduce the effects of voice drift, development of simplified
instructions, and improved hardware design.
If use of SR systems by people
with cognitive deficits and learning disabilities is to prove beneficial,
additional research must be done to determine if and how the cognitive
load can be minimized. Assessment of effectiveness, efficiency and independent
execution across diagnoses, age ranges, and multiple systems is needed
as well. This issue also applies to use of SR by individuals who are functionally
non-speaking. The amount of training required, frequency of training sessions
and methods to promote generalization of SR to activities of daily living
pertinent to each user are also in need of investigation.
References
Bernstein, L. (1989) Computer-based speech
training for the profoundly hearing impaired: Some design considerations.
Volta-Review, 91, 19-28
Bernstein, L., Goldstein, M.,& Mashie,
J. (1988). Speech training aids for hearing-impaired individuals. Journal
of Rehabilitation Research and Development, 25, 53-62.
Coleman, C. & Meyers, L. (1991). Computer
recognition of the speech of adults with cerebral palsy and dysarthria.
Augmentative and Alternative Communication, 7, 34-42, 1991.
Damper, R. (1984). Voice-input aids for
the physically disabled. International Journal of Man-Machine Studies,
21, 541-553.
De La Paz, S. (in press). Composing via
dictation and speech recognition systems: compensatory technology for students
with learning disabilities. Learning Disabilities Quarterly.
Doyle, P., Leeper, H., Kotler, A., Thomas-Stonell,
N., O’Neill, C., Dylke, M., & Rolls, K. (1997). Dysarthric speech:
A comparison of computerized speech recognition and listener intelligibility.
Journal of Rehabilitation Research and Development,34, (3), 309-316.
DragonDictate [computer software]. (1990).
Newton, MA: Dragon Systems.
Ferrier, L.J., Jarrell, N., Carpenter,
T., & Shane, H.C. (1992). A case study of a dysarthric speaker
using the DragonDictate Voice Recognition System. CUSH, **.
Goodenough-Trapagnier, C. & Rosen,
M.J. (1991). Towards a method for computer interface design using speech
recognition. Paper presented at RESNA 14th Annual Conference, Kansas City,
MO.
Holme, S., Kanny, E., Gutherie, M., &
Johnson, K. (1997). The use of environmental control units by occupational
therapists in spinal cord injury and disease services. American Journal
of Occupational Therapy, 51, 42-48.
Jones, B. (1998). Acoustic variability
and computer recognition in dysarthria. Manuscript in preparation, University
of Nebraska at Lincoln.
Kurzweil 3000 [computer software]. (1998).
Waltham, MA: Kurzweil Educational Systems, Inc.
Lange, H. (1993). Speech synthesis and
speech recognition: Tomorrow’s human-computer interfaces? Annual Review
of Information Science and Technology (ARIST), 28, 153-185.
Lee, K., Hauptmann, A., & Rudinicky,
A. (1990). The spoken word: Replace the "look and feel" of GUIs with the
"ask and tell" of voice interfaces. Byte, July.
Ling, D. (1976). Speech and the Hearing
Impaired Child: Theory and Practice. Washington DC: The Alexander Graham
Bell Association of the Deaf, Inc.
Mahshie, J., Vari-Alquist, D., Waddy-Smith,
B., & Bernstein, L. (1988). Speech training aids for hearing-impaired
individuals: III. Preliminary observations in the clinic and children’s
homes. Journal of Rehabilitation Research and Development, 25, 69-82.
Meisel, W. (1993). Talk to your computer:
voice technology lets you verbally command your computer or convert speech
to text. Byte, 18, 113-120.
Noyes, J. & Frankish, C. (1992). Speech
recognition technology for individuals with disabilities. AAC Augmentative
and Alternative Communication, 8, 297-303.
Schmitt, D., & Tobias, J. (1986). Enhanced
communication for a severely disabled dysarthric individual using voice
recognition and speech synthesis. Proceedings of the 9th Annual RESNA Conference,
Minneapolis, MN. Washington, DC: RESNA Press, 1986, 304-306.
Stevens, K. (1960). Toward a model for
speech recognition. The Journal of Acoustical Society of America, 32, 47-55.
Treviranus, J., Shein, F. Haataja, S.,
Parnes, P., & Milner, M. (1991, month unknown). Speech recognition
to enhance computer access for children and young adults who are functionally
non-speaking. RESNA 14th Annual Conference, Kansas City, Mo.
VoicePad Platinum [computer software].
(1997). Waltham, MA: Kurzweil Educational Systems, Inc.
VoiceType [computer software]. (1992).
Dayton, NJ: IBM Direct.
Watson, C. & Kewley-Port, D. (1989).
Advances in computer-based speech training: Aids for the profoundly hearing
impaired. Volta-Review, 91, 29-45.
Watson, C. & Kewley-port, D. (1989).
Advances in computer-based speech training: Aids for the profoundly hearing
impaired. Volta-Review, 91, 29-45.
Watson, C., Reed, J., Kewley-Port, D.,
& Maki, D. (1989). The Indiana Speech Training Aid (ISTRA) I: Comparisons
between human and computer-based evaluation of speech quality. Journal
of Speech and Hearing Research, 32, 245-251.
Wetzel, K. (1996). Speech-recognizing computers:
A written-communication tool for students with learning disabilities? Journal
of Learning Disabilities, 29, 371-380.
Youdin, M., Sell, G., Reich, T., Clagnaz,
M., Louie, H., & Kolwicz, R., (1980). A voice controlled powered wheelchair
and environmental control system for the severely disabled. Medical Progress
Through Technology, 7, 139-143.
|