Reprinted with permission from NETAC Networks, a quarterly publication of the Northeast Technical Assistance Center. This article originally appeared in the Winter 2001 issue of that publication.
There's a lot of anticipation regarding applications for automatic speech recognition (ASR). As a member of a team at the National Technical Institute for the Deaf (NTID) that is conducting research on using ASR as a classroom access alternative, I am very mindful of that anticipation. The prospect of having spoken words or phrases automatically converted into text is extremely exciting! Accessibility, which for many has been a challenge, could be much more attainable. ASR could provide more options to schools that are struggling to find adequate accommodations for their students.
But before we put all of our eggs in one basket, it's important to learn more about this emerging technology. This article highlights some basic information about ASR technology, including a definition, how it works and some logistical information. It also addresses some advantages and drawbacks of using ASR for communication access. I am a proponent of using ASR for access and hope this introduction will help you weigh the pros and cons, and how you can best utilize this technology.
The Basics
I struggled to find a generic, lucid definition of ASR that was suitable for every technical level. Much of the published information assumes that the reader has some level of technical knowledge, which is not always the case. For now, lets define ASR as the technology involved in recognizing and converting spoken words into text via a computer.
How does ASR work? On the surface it seems relatively uncomplicated. Presuming you have a suitable computer (meaning the appropriate hardware specifications) and a high-quality microphone (a microphone comes with some ASR software programs), you would install the application, set up your microphone and create a voice model. Most promotional literature would have you believe it's that simple. However, we've found it to be a little more involved.
ASR systems are now able to handle continuous speech rather than discrete speech, allowing you to speak naturally. That's great news because you don't have to train yourself to speak in a different way, like - pausing - between - words. You can speak at a conversational pace, but you still need to speak clearly and evenly.
Some ASR software programs advertise speaking rates of up to 160 words per minute (wpm) with accuracy levels of 95 percent or above. That's quite impressive! Does that mean you can buy ASR software, install it, start dictating at a normal pace and get high accuracy? Not quite. The level of accuracy depends on how much training and dictionary building you do.
The software programs also maintain that 95 percent accuracy can be achieved with 15 minutes of training or less. That may be true in situations where ASR is being used for personal dictation or PC navigation. There are a number of testimonials to corroborate these claims. However, if you're reading this article, I'm making the assumption that you're interested in ASR for communication access. In which case our research indicates that 15 minutes of training is at best, passable. The reason is the high variability in classroom discourse.
This is just a small piece of what you'll need to address when considering ASR for classroom access. What about logistical considerations like multiple speakers or technical failure? ASR systems are still largely speaker dependent, which means that all speakers need to train the computer to recognize their voice. This is not too much of an issue unless there are multiple speakers in one session. Only one voice file can run at a time. As far as technical failure, if your voice file is on one computer and it crashes, odds are you'll have to retrain a new voice file. Or, if your voice is altered in any way (i.e., you have a cold) your recognition rate decreases and you could corrupt your good voice file.
About Access
Having heard many stories about classroom access problems from students, parents and service providers, I understand how eager people are to see ASR become an alternative for access. As with any support service option in an educational environment, there are appropriate situations for ASR use. Remember that it's best to tailor communication access to the individual student's needs.
With that in mind, let's look at some of the advantages of using ASR for access for students who are deaf and hard of hearing. Similar to some speech- to-text systems (e.g., C-Print TM), students can benefit from the real-time display and extensive notes after class. ASR requires no special skills such as typing or signing, and a person can use it for a longer period of time.
Different approaches are being taken regarding the application of ASR in the classroom. The two most familiar are having the instructor wear a microphone, or using an intermediary to shadow the spoken information. There are distinct advantages and drawbacks to both. You'll need to determine which approach is appropriate for your situation.
Having the instructor wear a microphone for direct input into the ASR system would seem to be most desirable. It's definitely an advantage to be able to provide a verbatim text display of what is being said without having to employ an additional person. There are no "distractions" and you don't have to pay additional wages.
But some drawbacks may prevent it from being a viable option in lecture situations. ASR systems currently do not automatically insert punctuation (i.e., periods, commas) and paragraph breaks. Some systems insert a line break when the speaker pauses, however, for the most part, the text runs without any visual indicators of where a thought starts and stops. This would tend to require considerable effort by the student to decipher the meaning of the text.
As I mentioned previously, ASR systems are still speaker dependent, which would make it necessary for each speaker to train a voice file. And what about group discussions or comments from students? One way to handle comments is to have the instructor repeat them.
NTID's research study is evaluating ASR use with an intermediary shadowing the spoken information. When using shadowing, there's the option to provide summary or verbatim text. Summarizing requires more cognitive effort, but may be easier for some students to follow. It also reduces length of notes. The intermediary can include all classroom discourse and identify multiple speakers as necessary.
We've found that a significant advantage to using an intermediary is the addition of punctuation and paragraph breaks. This is essential to the readability of the real-time text display. In addition, because the instructor is focused on the presentation of information and classroom maintenance, it's easier for the intermediary to focus on the text display, and therefore provide greater accuracy.
Again, there are drawbacks to this approach. There is the issue of having another person in the classroom and securing funds to pay that person. The intermediary would need to use a special microphone to muffle the sound of her/his voice. These microphones are called dictation masks, and it's recommended to have one mask per person for hygiene.
Regardless of whether you use ASR with or without an intermediary, training and dictionary building is key. Most ASR programs now have a large basic vocabulary, but that does not handle all of the subject-specific vocabulary that crops up in the classroom. The initial voice model training is only the beginning of the preparation and maintenance necessary for a high level of accuracy.
A research group evaluating ASR use with a microphone on the instructor recently reported a 90 percent accuracy rate. While 90 percent accuracy may sound good, it still represents one error in every 20 words. Would 10 percent be an acceptable error rate? Would this article be as readable if the editor allowed for a 10 percent error rate?
Conclusion
As you can see, there are a number of issues to address if you are considering using ASR for communication access. There's a lot of information available about the technology, especially on the Web. You'll need to sort through it to determine what pertains to you. ASR is no longer a remote possibility for access, but a real prospect. It will offer new alternatives for providing communication access in the future.