Friday, April 10, 2009

Voice Recognition Study Report

1 System Definition

1.1 Problem Definition

Voice Recognition systems have been developed by different vendors for the last 10 years. Even though the current systems available in the market are not reached its maturity, They are quite usable. This is an attempt to tap the potential of such systems to make the transcribers job much easier and that way improve their productivity in a mutually beneficial way

1.3 Goals of the Initial Study and the proposed system

13.1 To use VR systems effectively in Medical transcription.

13.2 Most VR systems available in the market depend on initial enrolment. This enrolment process is lengthy and it is not possible to insist the doctors to go through that. So we want to figure out a method to eliminate this initial enrollment process

13.3 The main advantage of using VR in medical transcription is the availability of archrivals. In our case we have lot of old transcripts and dictations available to train the system. We need to tap this advantage for systems training and improvement.

13.4 In most cases the Vocabulary used in Medical transcription is limited compared to a general day-to-day conversation. This vocabulary can be extracted from the transcript archrivals.

13.5 Integrate the system with VoiceSys . Voice Sys identifies the files from those doctors who are having a speech profile stored in it. It then transcribes those files automatically.

13.6 After the automatic transcription of the Voice files this will be send to transcribers for proofing. The corrections made during the proofing process will be then used for further training. We want to figure out a method to collect this data with out affecting productivity of the document proofing process.

1.4 Constraints on Proposed System

1.4.1 Usual direct enrolment is not possible in most cases.

1.4.2 The recordings of medical records are usually done in normal noisy environments. This usually cases lot of back ground noises.

1.4.3 Most doctors are not willing to follow best practices in recording.

1.4.4 Recording volume can vary some instances

1.4.5 Doctors can have difficult accents depending on their origin and speaking habits.

1.5 Suggested Environment for the system

1.5.1 Hardware

1.5.2 Software

Windows 2000 or XP/VoiceSys

IBM ViaVoice Engine

IBM ViaVoice SDK

1.5.3 Manpower and Skill Report

It requires a VC++ programmer with deep knowledge in IBM ViaVoice SDK,

1.6 User requirements

1.6.1 Automatic Doctor enrollment with use of archived transcripts and audio files.

1.6.2 Huge amount of Transcripts can be used to teach the dictation context of the speaker

1.6.3 Facility to build specialized Vocabulary (Topic in IBM SDK language)

1.6.4 Automatic Transcription as on files arrives in the system.

1.6.5 Automatic Template(work type) selection for formatting the transcribed documents.

1.6.6 Automatic formatting of the document to populate fields from the dictation to corresponding fields in the template.

1.6.7 Automatic collection of correction information while transcriber proof reads the ASR generated transcripts.

1.6.8 Automatic continues training to improve performance based the correction data collected during the proof reading.

1.7 Solution Strategy

Each requirement stated above can be considered as the main points to be addressed by this project. Based on that this project can be modularized as discussed below.

1.7.1 Enrolment: - This will be semi automatic process . It may not possible to apply VR to all doctors . After selecting the doctors to whom we can apply VR then we have to some process manually.

1.7.2 Transcription and Formatting: - This is a very important and tedious phase. Here we need a generic design to match with the dictation habits of different doctors.

1.7.3 Correction: - Design goal of this module is to collect correction information from proofreader with out adding extra burden to proof reader to mark the changes. Once the proofreader sends back the corrected document this module collects those information and saves it for future training of the engine.

1.7.4 Easing Proof reading:- Goal of this module is to improve the productivity of the proofreader by providing cool features like moving the cursor as play back advances.

1.7.5 System architecture:- VR is a process intensive process so it is recommended to run on separate system.

1.8 Priorities for System Features

Since all are mandatory, hence the features high lighted above have equal priorities. Anyway developers can concentrate more on OSS and foot switch side since it needs some expertise.

1.9 Sources of Information

IBM Via Voice Documentation and Wizzards Software

2 Standards and Procedures

We will follow standards published in ‘Synergy’

3 Risk Management

IBM Via Voice is poorly documented and IBM stopped development and support in this product. Wizzards Software is now supporting Via Voice.

4 Acceptance Criteria

Accuracy 60%> for selected doctors. Total integration with VoiceSys

5 Requirement Changes during Development

Please refer section 3.

6 Project Deliverables

VoiceSys VR Server


7 Project Estimation

2Programmers: Experienced in C++, knowledge in C++, GUI, IBM ViaVoice API and VoiceSys architecture

8 Glossary

Appendix A: Initial Experiment results

Appendix B: Architectural Design Guide lines

9 Appendix A Vocabulary Experiments:

As part of our initial study about voice recognition I did some experiments assess the possibility of using VR effectively.

Vocabulary Size:

the size of a normal doctors vocabulary. Unfortunately all the doctors selected from same specialty.

Below given table shows the results of experiment. It is showing that normal dictation vocabulary is somewhere near 20,000 words. This is only 1/3 rd of the normal vocabulary (64000) . This reduced size of vocabulary can improve the accuracy of voice recognition.


Vocabulary size

Number of dictations used

Doc1 (ortho)



Doc 2(ortho)



Doc 3(ortho)



Another important finding is that new words came in per dictation is drastically reducing after processing first 100 dictations. In the first file program collected 80 new words out of 266 words. This is not a mistake since in the same file words can repeat. When it iterates through file vocabulary of the program increases and so the new words found gradually reduces. Below given table shows first 10 rows from the experiment result. Full data is attached in the appendix.

File Number

New WordCount

Total words in the file

New Words Percentage









































This table shows size of vocabulary acquired by the program and the average new words found in the region. It is showing that after acquiring a vocabulary size of 3000 the chance of new words finding is limited to 1.5 %. In other words it is possible transcribe a file with 98.5 % accuracy with a Vocab size of 3000 words. Also its is noticeable that with a vocabsize of 5000 we can attain an accuracy of 99.4% (See full table in Index)


New words percentage





























**Above results are obtained with the help of an untested program . Any miss assumption or logical error can cause bad results.

Dictation style

When we went studied different dictation of a doctor it is found out that doctors normally follows same style during his all dictations. By tuning the program according to the dictation behavior of the doctor it is possible to improve the accuracy. For eg : - Some doctors always start by dictating his name, then Medical record number and other details in a specific order. Probably we can switch engine to command recognizing mode to recognize these things and fill it in the templates. While designing backend recognizing engine these things to be taken in to consideration. Following things are noticed while examining the dictations.

1. Specific patterns to start a dictation.

2. Some doctors say new paragraph and periods while some other ignores that.

3. Doctors divide dictation sections with different utterances. Objective analysis ,objective etc.

Appendix B: training the Engine from archives:

Un Attended Enrollment Experiments:

One of our main target in this VR attempt is to eliminate the required enrollment process by the dictator. Transcription is done as service and it is not possible to force the user to under go the time consuming and tedious enrollment process required by the most of the engines available. This process to be done at the transcription service with out disturbing the client. We conducted lot of research in the area with existing engines.

We did our experiments mostly with IBM Voice Center. A user is created for the intended user. Following steps are executed to prove that we can by pass user enrolling.

  1. A loop back wiring is used to connect from headphone jack to microphone input. A potentiometer is used to adjust the volume level.
  2. Use old archive transcripts of the same doctor to teach the context. This is done in two steps
    1. Set speech engine user created for the doctor as the default user. This can be done by editing the C:\Program Files\ViaVoice\users\client.dfl file. There is only one entry in the file which can be set to the needed user.
    2. Use vocabexp.exe to teach context. Start this program and select the text form of the transcript files for training
  3. Create Topic for the user. Topic can be created with the use of same set of files. After creating the topic use vati.exe to activate the topic.
    1. Now select this newly created topic as the default topic of the user. This can be done by the use of the control panel applet provided by IBM. Last combo in this control shows the topic.
  4. Now with the use of Dictation pad transcribe the file. Wave file can be played to the Dictation pad with the use of loop back.
  5. After the completion of transcription correct it in the dictation pad itself. Continue this step with 50 files or so. This will give a good amount of accuracy.

Above experiment proved that it is possible to avoid enrollment since after doing 20-50 files dictation pad starts giving good amount of accuracy. Encouraged from this result we start looking in to IBM API for features to automate manual correction.

Automating the correction process:

The above experiment proved that it is possible to use manual correction process to avoid pre enrollment. But this is a time consuming and indirect method. Our next attempt was to use the old archive audio files and its corresponding transcripts to use for automated training. Then we examined IBM API for correction functions. We also used IBM API log to analyze how dictation pad handled the correction process. API Log can be enabled by setting api_log_level = 2 in engine config file.

One of the main task in automated correction is to exactly determine the mis recognized words and its corresponding correct words from the original and recognized texts. This is a tedious task in some condition there is only limited words recognized correctly. We used a word sequence find method to locate correctly recognized words. Then we corrected those words where we are sure that it can be corrected. Otherwise those words are discarded. A sample program developed with use of this algorithm which we used for training purpose. This gave us a good result. See appendix for details of the algorithm and training results.


In the the light of experiment results we suggest that it will worth to develop a comprehensive transcription application on IBM Engine. IBM engine is comparatively low cost and considered other engines. Next step is detailed designing on the system to integrate with voiceSys.

New words Found in first 100 files (X Axis -- File number, Y axis – New Word Count)

New words Found in first 1000 file (X Axis -- File number, Y axis – New Word Count)

New words Found in first 10000 file (X Axis -- File number, Y axis – New Word Count)

New words Found in last 100 file (X Axis -- File number, Y axis – New Word Count)

Appendix D: Unattended training Progress in Accuracy results

This graph plotted based on the results obtained when trained with a batch of 100 files and its corresponding wave files. Before starting the experiment A topic is created for the doctor using 1000+ archive dictation transcripts. Context creation with use of Vocabexpert is also done with same files.

X Axis shows training counts. Training is done when 1000 phrases are collected from the training files.

Y Axis shows average percentage accuracy of the batch of files used for the iteration.

Below Graphs shows the accuracy changes in same file after training iterations.

X Axis : Iterations

Y Axis : Accuracy

Appendix D: Observation on IBM engine. This tips will be useful to developers on IBM engine. These are the point we stuck while our experiments.

  1. Training fails at setp 2 due to unavailability of bsf files in the corresponding folder (uns\[documetid]\bsf).

Train log shows failure at step 2. This problem is solved when we changed SmSet called with SM_SAVE_AUDIO without SM_SAVE_AUDIO_DEFAULT


  1. Train fails if SmDiscardData is called.

Solved when save all speech data argument is used in SmSaveSpeechData. It will save all audio data but discarded data reflects in the tags file.

nRc = SmSaveSpeechDataEx(NULL ,iUniqueDocumentID,SM_SAVE_FOR_ADAPTATION|SM_SAVE_ALL_TAGS,0,(unsigned long *)m_Tags,&msg);