Automatic Detection of Mispronounced Phonemes for Languag e Learning Tools

Automatic Detection of Mispronounced Phonemes for Languag e Learning Tools
of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  AUTOMATIC DETECTION OF MISPRONOUNCED PHONEMES FOR LANGUAGELEARNING TOOLS Olivier Deroo    Christophe Ris    Sofie Gielen    Johan Vanparys    Facult´e Polytechnique de Mons, 31, bld Dolez - B-7000 Mons - Belgique   Facult´e Universitaires Notre-Dame de la Paix, 61, rue de Bruxelles - B-5000 Namur - Belgique e-mail:    deroo,ris      sofie.gielen,johan.vanparys  ABSTRACT Automatic Speech Recognition (ASR) can be very usefulin language learning tools in order to correct mistakes inthe pronunciationof foreign words by non-nativespeakers.Most of the systems integrating ASR proposed on the mar-ket are just rejecting or accepting whole words or wholesentences. In this paper, we propose a method to identifythe pronunciationerrors at the phonemelevel. Indeed,mis-takes are often predictable and concern a particular sub-set of phonemes not present in the mother language of thespeaker. We describe two different approaches based onthe Hybrid HMM/ANN technology. The methodology forthe training of the recognizer is discussed, and we describea new approach where a mixed database is used to traina speech recognition system able to detect pronunciationerrors at the phoneme level. Preliminary but promising re-sults have been obtained on the  DEMOSTHENES database. 1. INTRODUCTION Acquiringa goodpronunciationof spoken sentences in anylanguage is a non-trivial task for most non-native speak-ers. Traditional audio-visual aids – in classrooms or lan-guage laboratories – have shown their limitations in cor-recting pronunciation (lack of systematic feed-back in anon-individualized environment). On the other hand, re-cent developments in continuous speech recognition makeitpossibletoprovidemultimediatoolsthatanalyzeandcor-rect the pronunciation of non-native speakers in a consis-tent and individualized approach. Unfortunately, most of the systems proposedso far on the market are taking binarydecisions on whole words or even whole sentences, whichbasically gives little help on the way to improve one’s pro-nunciation. The system proposedin this paper is able to lo-calize the pronunciation errors at the phoneme level. Suchan approachdistinguishes the applicationfrom commercialproducts currently available on the market, which providefeedback in a graphic, non-linguistic format. In section 2,we describe the DEMOSTHENES database collectedfor thisparticulartask. In section 3, we present and discuss the twodifferent approaches we developed. These methods havebeen tested on this database and are currently extended toother European languages through the  L-KIT 1 project. 2. THE DEMOSTHENES DATABASE This database has been collected in order to train our sys-tem and test it over a wide range of speakers in the frame-workofthe DEMOSTHENES project. Thegoalofthisprojectis to build language-learning tools integrating automaticspeech recognition for French speaking people learningDutch [2, 9].This database, recorded on microphone, consists in Dutchsentences that are representative of the typical pronunci-ation errors encountered by the learners (e.g. language-specific phonemes without equivalent in French, assimila-tions, confusion between long/short vowels, etc.). About22,000 items have been recorded by 135 native and non-native speakers. Those items have been carefully selectedfrom the basic vocabulary of the Dutch language (cover-ing the 2,000 most frequent words) in order to illustratethe most frequentpronunciationdifficulties encounteredbyFrench-speaking students.Basic phonetic units have been labelled in the specific con-textof  DEMOSTHENES ,thatis an extendedphoneticalpha-bet has been defined for the coding of the speech database,including erroneously pronounced phonemes, so that pro-nunciationmistakes are labelled as well. The processing of  1 DEMOSTHENES  and  L-KIT  are research projects sponsored by theWalloon Region of Belgium.  Dutch pronunciation features depends on the mother lan-guageofthe learners: are consideredrelevantonlythemostprobable mistakes committed by French speakers learningDutch. Othermistakesarenotlabelledas such,andarethusirrelevant in this context (due to their low expectation).The  DEMOSTHENES  database has been used in order totest the two methods described in the following sections.The test set consists of 12 native (6 males / 6 females) and12 non-native (6 males and 6 females) speakers (differentfrom those selected for the training of the speech recog-nition system) uttering approximately 2,750 sentences. Alinguist expert has manually identified and labelled about2,000 pronunciation errors covering the 11 most importantdifficulties in Dutch. 3. ASR-BASED APPROACH The basic pronunciation analysis algorithms proposed inthe literature [3, 6, 8] are based on phonetic segmentations ofthespeechsignalautomaticallygeneratedbyforcedViter-bi alignment through Hidden Markov Models (HMM) [5].Giventhesesegmentations,scoresare obtainedfromHMMlikelihoods, phone durations or a combination of both of them. These scores can then be used to decide whether thepronunciation is acceptable or not.In this paper, we propose to use the hybrid system combin-ing Hidden Markov Models (HMM) and Artificial NeuralNetworks (ANN) trained in a specific classification modein order to evaluate the quality of pronunciation and pre-cisely identify the pronunciation problems.Inthetwoapproachesintroducedinthispaper(andasin[8]),speech is modelled by phoneme-based HMMs modellingboth the correct and incorrect pronunciations. To detectmispronounced phonemes, we assume that we know thecorrect orthographictranscriptionof the sentence pronoun-ced (as it is the case in most language learning exerciseswhere sentences are prompted). A phoneme graph for thesentence is built taking into accountboth correct and incor-rect phoneme models and the most probable sequence of phonemes with their respective duration can be producedby a forced Viterbi alignment [5]. The phoneme graphmodels in parallel the right pronunciationof each phonemeand the corresponding pronunciation errors. Two differentforms of such graphs are used in the methods proposed inthis paper.Moreover a confidence score can be computed for each Phoneme 1NativePhoneme 2NativePhoneme 1Non-nativePhoneme 2Non-native Figure 1:  The phoneme graph where each phone has two possible pronunciations: one for native and another one for non-native. phoneme in order to help taking decisions on the presenceof a mispronounced phoneme. Indeed if the confidencescore is too low for a particular phoneme, this could be in-terpreted as a pronounciation error that has not been prop-erly modelled. This score can be computed from the log-posterior probabilities provided by the ANN (cf. equa-tion 1). This measure has already proven [3] to outperformother ones based on HMM log-likelihoods or segment du-rations.                                                  (1)Where                  is the posterior probability of state      attime    and    is the duration of phone      .Therefore,ifanon-nativephonemeisfoundinthephonemesequence, we are able to determine precisely the place andthe type of the error that has been made. 3.1. Competing models In the first approach, two hybrid HMM/ANN systems aretrained independently on Dutch speech data recorded bynative and non-native speakers. Each phoneme of the sen-tence can optionally be modelled either by a native HMMor by a non-nativeHMM. The phonemegraph is then com-posed of a sequel of competing models as show in figure 1All the ANNs used in the experiments reported in this pa-per have an input layer of 234 units spanning a windowof 9 frames, where each frame consists of 12 cepstral pa-rameters (log RASTA-PLP [4]), their first derivatives, thefirst andsecondderivativesof the energy. The log-RASTA-PLP parameters have been chosen because of their robust-ness against changes in the recording conditions (typicallythe use of different microphones). As we are working atthe phoneme level, we define an output layer of 42 units,corresponding to one unit per Dutch phoneme. The clas-sification accuracy (on both the training set and a cross-  Train CrossNative ANN 80.2% 76.7%Non-native ANN 76.7% 76.5%Table 1:  Phone classification rate at the frame level withthe ANNs trained on the native and non-native databases.Use of log-RASTA-PLP parameters. validation set) obtained at the frame level can be seen inTable 1.Unfortunately detection of mispronounced phonemes us-ing this approach was not efficient (about 35% of the la-belled pronunciation errors were correctly detected). Byanalysingthebehaviourofthesystem, wenoticedthatmostof the phonemes trained with the native or non-nativedata-bases were very closed to each other. Indeed, most of thephonemesoftheforeignlanguagearecorrectlypronouncedby the non-native speakers (such as plosives, nasals, ...) sothat the system is not able to discriminate between wrongand right pronunciations leading to many false mispronun-ciation detection. 3.2. Mixed model The second experiment makes the hypothesis that foreignspeakers always make the same kinds of pronunciation er-rors and that, when they mispronounced a phoneme in alanguage, they usually use a sound that is commonly used(or similar to one) in their native language. Therefore, thedetection of pronunciation errors can be handled the fol-lowing way. For each sentence prompted to the speaker,the phonetic transcription correspondingto the correct pro-nunciation is known. Based on linguistic knowledge, it ispossibletoidentifyin thosesentencesthemostlikelyerrorsat the phoneme level. For instance, in the case of a toolteaching Dutch to French-speaking people, the phoneme ’G’  (like in  gaan ,  dag ) is often mispronounced  ’g’ ,  ’x’  or ’k’ . We therefore build a phoneme graph (see figure 2)taking all these wrong pronunciations into account. Eachphoneme is modeled by a HMM for which emission prob-abilities are estimated by a neural network, trained on bothDutch speech data for estimating the posterior probabil-ities of the phonemes of the target language and Frenchspeech data for estimating the posterior probabilities of thelisted potential mistakes. Based on these probabilities, itis possible to find, by Viterbi alignment, the most probablephoneme sequence corresponding to the recorded speechand localize what has been mispronounced.This method requires to know in advance all the mistakesthat could be uttered by the non-native speakers. Practi- Phoneme: ’g’Phoneme: ’G’Phoneme: ’k’Phoneme: ’a’ Figure 2:  The phoneme graph where the sentence ’ga zit-ten’ is modelled as a sequence of phonemes pronounced inthe right way (’G’) and possible pronunciation errors (’g’,’k’, ...). Train CrossMixed ANN 77.1% 76.2%Table 2:  Phonemeclassificationrate at the frame level withthe ANN trainedon the mixed database. Use of logRASTA-PLP parameters. cally, only the most probable errors can be taken into ac-count. So to ensure the system to be able to detect un-predicted pronunciation errors, we add a garbage model inparallel as an alternative to listed errors, assuming that if apronunciation is too far from the good one, it will be de-tected by this garbage model as an  undefined   error. Thegarbage model is a particular HMM state which emissionprobability is computed as the average of the    best prob-abilities provided by the neural network  [1].As we are still working at the phoneme level, we definean output layer of 62 units corresponding to one unit perphonetic class for Dutch: 42 phonemes already defined inthe previoussection and 20 phonemesfrom the French lan-guage database covering the 11 most frequent pronuncia-tion errors. We used the BREF database[7] in orderto trainthose 20 phonemes extracted from the French language 2 .This model will be called the mixed model in the rest of this paper.As the ANNs are locally discriminant, we are now able todiscriminate between the phonemes that are correctly pro-nounced or not by the speaker. The classification rates atthe frame level for this ANN can be seen in table 2.We are able to evaluate at each frame, the probability of being in one particular phoneme with an accuracy of about 2 SAMPA format: e , a , o , y, u, H, S, Z, z, E, e, O, o, 2, 9, @, g, k, s,N  76%. This information will be used directly by our systemin order to evaluate the pronunciation.This system has been evaluated by native and non-native(French) people uttering Dutch sentences. The speakers(12 natives, 12 non-natives) were asked to pronounce sev-eral sentences (a total of 2,749 sentences) for which pro-nunciation errors were manually identified by linguist ex-perts. ElevendifferentpotentialdifficultiesforFrenchspeak-ers have been selected and around 2,000 pronunciation er-rors have been marked in the test database. The systemwas able to automatically detect and identify 70% of themanually labeled mistakes. 4. EXTENSION TO OTHER LANGUAGES The aim of the  L-KIT  project is to build a toolbox thatwill allow anyone to train specific speech recognition sys-tems to integrate pronunciation error detection in languagelearning tools and this in as many languages as possible(we are currently working on French-English and French-German). The approach proposed in this paper, of course,needs some stronglinguistic knowledgein orderto identifyas completelyas possiblethe potential pronunciationerrorsencountered by the speakers. Moreover, the system de-pends not only on the target language (the teacher) but alsoonthe sourcelanguage(the student). However,this methodproposes an efficient way to introduce ASR in languagelearning tools. In addition, we plan to incorporate genderdependent models to improve the speech recognition sys-tem and also additional tools able to detect the stress in asentence (using pitch, duration and energy),which is also asource of manypronunciationerrors in languagesas Italianor Spanish, ... 5. CONCLUSION This paperdiscusses an srcinalapproachforthe automaticdetection and correctionof pronunciationerrors for foreignlanguage learners. Particular attention has been dedicatedto the creation and labelling of a speech database in Dutch,pronounced by natives and non-native speakers. The finalapplication is able to identify errors at the phoneme level,with an accuracy of 70%. This result has been achievedby using the hybrid HMM/ANN speech recognition sys-tem, that combines Hidden Markov Models and ArtificialNeural Networks and by training the system on a mixeddatabase containing the phonemes of the target languageand possible sounds from the language of the learner. Ex-ercises can be prepared as series of sentences for whichmost probable mistakes have been identified. The systemcan also automatically generates competing phonetic tran-scriptions of the words in the sentence from a list of pre-defined pronunciation difficulties. The system is then abletodetect themispronouncedphonemesandgivebackmuchmore accurate advices to improve the pronunciation. 6. REFERENCES [1] J.-M. Boite, H. Bourlard, B. D’hoore, S. Accainoand J. Vantiegem,  “Task Independent and Dependent Training: Performance and Comparison of HMM and  Hybrid HMM/MLP Approaches” , Proc. ICASSP’94,Adelaide, Australia, Vol.1, pp. 617-620[2] O. Deroo and G. Deville and H. Leich and S. Gie-len and J. Vanparys,  “Automatic Detection and Cor-rection of Pronunciation Errors for Foreign Language Learners : the Demosthenes Application.” , Proc. Eu-rospeech’99, Budapest, Hungary, Volume 2, pp. 843-846.[3] H. Franco, L. Neumeyer, Y. Kim and O. Ronen,  “Au-tomatic Pronunciation Scoring for Language Instruc-tion” , Proc. ICASSP’97, Munich, Germany, pp. 1470-1474.[4] H. Hermansky and N. Morgan,  “RASTA Processing of Speech” , IEEE Trans. Speech and Audio Processing,vol. 2, no. 4, pp. 578–589, Oct. 1994.[5] F. Jelinek,  “Statistical Methods for Speech Recogni-tion” , The MIT Press.[6] Y. Kim, H. Franco and L. Neumeyer,  “AutomaticPronunciation Scoring of Specific Phone Segments for Language Instructions” , Proc. Eurospeech’97,Rhodes, Greece, pp. 645-649.[7] L.F. Lamel, J.-L. Gauvain and M. Eskenazi,  ”BREF, a LargeVocabularySpokenCorpusforFrench” , Proc.of European Conference on Speech Communication andTechnology, 1991, Vol.2, pp. 505-508[8] O. Ronen, L. Neumeyer and H. Franco,  “Automatic Detection of Mispronunciation for Language Instruc-tion” , Proc. Eurospeech’97, Rhodes, Greece, pp. 649-652.[9] J. Vanparys, G. Deville and S. Gielen,  “Demosthenes:narr uitspraakremedi¨ ering met de computer” , ANBF-nieuwsbrief, november 98, pp. 89-102.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!