<- Perception I



Perception II


Feb 7, 2008




Outline of Topics
  Concepts & terms  
  Machine hearing  
  Speech research  
  Levels of abstraction in dialogue  
  Speech generation  
  Speech recognition  
    Early recognition  
    Connected word recognition  
    Continuous speech recognition  
  Hidden Markov Models  
  HMM computations  
  Errors in speech recognition  











Concepts & Terms  
  Speech Sounds that a human makes with their throat that conveys information in the form of symbols to other humans
  Monolog One person speaking on their own
  Dialog Two people talking to each other, often used for n-people talking to each other
  Conversation Another word for dialog
  Representation The convoluted issue of how data and information are stored — not the medium, but the form (example: representing the relative position of the moon and Earth with absolute numbers or as differences)
  Phoneme "The smallest meaningful sound snippet in speech"
  Corpus A body of information that can be used for automatic and manual analysis, e.g. to extract probabilities of events in the real world













Machine hearing  
  Goal Get machines to hear sounds in a way that allows them to act intelligently
  Scene analysis Try to figure out a high-level understanding of where you are from the type of sounds heard all around, e.g. "restaurant", "ocean front", "indoors", etc.
  Muscial genre classification Identifying what type of music is playing, e.g. "disco", "rap", etc., directly by listening to the audio file.
  Tibral signature classification "Recognizing objects from their sound"
  Main focus in machine hearing Speech recognition
  Main focus in speech recognition phonemes, words and sentences
  Problem Even if you have figured out what words are being said, in what order, you do not necessarily understand what was meant








Speech research  
  Why study human speech production & perception? The human mechanisms used for speech productions have still not been matched using artificial means;
the human mechanisms for speech recognition are even less understood

We must study the phenomenon we are trying to mimic

This study can take the form of inspiration or faithful reproduction — both have been tried.
Speech is not useful for machine-machine interaction, it is useful for human-machine interaction.
  We must study the two together They evolved together
In all known practical applications they are used together.











Levels of abstraction in dialog  
  Acoustics Speech is just soundwaves
  Articulation Neural impulses control muscles to produce the sounds
  Phonemic Phonemes are the smallest set of sound that conveys meaning in a language.
The phonemes symbolized by the letters "h" and "y" separate "hello" and "yellow" semantically (i.e. change their meaning). Thus the sounds that make "hello" and "yellow" sound different are classified as phonemes.
  Lexical The dictionary
  Syntactic Syntax dictates legal ways to combine words
Noun phrase + verb phrase: [The door] [opened]
  Semantic "It is hot in here" = the temperature is such that the speaker believes it can be qualified as "hot"
  Pragmatic "It is hot in here" = open the window, please!
  Discourse Everything + turntaking and context













Speech Generation
  Must be understood for recognition  
  Voiced and unvoiced sounds Speech is a sequence of pitched sounds, or "tones", and noise bursts
The former are typically called vowels, the latter consonants
  Paraverbals Sounds made by throat and mouth; what separates paraverbals from speech is their inability to be used in the same way as words or phonemes
What gives them meaning is their use in the semantic and especially pragmatic layer of discourse
  Voiced Voiced speech sounds have typically the most energy of the speech sounds
Steady state — you can "say it forever"
Dipthongs: two vowels strung togther, e.g. hi (ha-ee)
Nasals: n, m
  Unvoiced Noise burst of some sort
Transient state
Plosives: p, t, k
Fricatives: s, th, f
  Coarticulation Effects of upcoming sounds on the production of the sounds being made at the present — e.g. "show" vs. "shine": "sh" sound is different in each of these words
Complicates speech synthes
  Humans generate sound by modifying physical structures Very different from how current speech synthesis technology works.
  Vocal tract is different in everyone Timbral quality of everyone's voice is different.
Timbre: the quality given to a sound by its (unique set of) overtones.
The attribute of auditory sensation that enables a listener to distinguish two similar sounds with the same pitch and loudness.











Early Recognition  
  Method: Template matching Early speech recognition was mostly this
  Typically used for single words, spoken in isolation "Isolated word recognition"
  How it works

Templates are created by transforming a training set into a feature vector

Templates {t0, ... tn} represented as points in feature space

Incoming sound t1 is transformed into the same feature space; distance between position of incoming sound to other points in the space is measured

The closest point is the best match

Threshold for rejection distance; Threshold for distance between closest and second closest may reject all

  This method is still used, for example in some personalized voice dialing services  










Connected Word Recognition  
  What it is the ... kind ... of ... speech ... recognition ... that ... requires ... pauses ... between ... each ... word
  When was it popular 70s to mid-80s
  Basic advance in computing power led to an increase in the speed in isolated word recognition This led to an improvement on the template matching systems people had built before
  Main problem Very unnatural; breaks the flow of speech; introduces hesitations and artifacts -- in short, people are not good at doing this in real situations












Continuous Speech Recognition  
  What is it You speak fluently (although: with slience before and after!)
  When was it popular Early 90s to present day
  Example Sphinx IV -- java-based open-source speech recognizer
Download: http://cmusphinx.sourceforge.net/sphinx4/
  Uses HMMs extensively Hidden Markov Models











Hidden Markov Models (HMMs)
  Good for analyzing temporal patterns Solid statistical foundation.
HMM can be used to train sequences, training is done using corpora.
  Represented by triple (π, A, B) π = Initial state probabilies vector
A ={aij} State transition matrix
B = {bij} Output (emission) matrix
Some states may be more probable as start states than others
  A ={aij} State transition matrix State i has probability a of transitioning to state j
  B = {bij} Output matrix State i has probability b of outputting symbol j













Bigram and Trigram Models for Words
  Speech recognition: goal Identify a sequence of words, (w1, w2, w3... wn)
  Ultimate goal of speech recognition P(words | signal)
  Probability of a particular word following another word - Bigram model P(wordi | wordi-1)
  Where the knowledge comes from Calculated using large corpora of words, e.g. newspapers
  Given wordi-1 we calculate P(wordi)
  Probability of a particular word given the two preceding words - Trigram model P(wordi | wordi-1, wordi-2)
  Bigram model for phones P(phone | signal)
  Where the knowledge comes from Calculated using large corpora of recorded speech, hand-coded for which phonemes appear where in the corpus













HMM computations (refer to figure 24.36 page 764 in your textbook for [m])
  States: O, M, E  
  Transition probabilities
  • START -> O = 1.0
  • O -> O = 0.3
  • O -> M = 0.7
  • M -> M = 0.9
  • M -> E = 0.1
  • E -> E = 0.4
  • E -> TERMINATE = 0.6
  Signal quantization (C1, C4, C6)
  Output probabilities (likelyhood of observation)
  • P(C1 | O) = 0.5
  • P(C4 | M) = 0.7
  • P(C6 | E) = 0.5
  Formula P([C1,C4,C6] | [m]) = 1 * P(O->M) * P(M->E) * (P(E->TERM) * P(C1 | O) * P(C4 | M) * P(C6 | E))
  Computations P([C1,C4,C6] | [m]) = (0.7 * 0.1 * 0.6) * (0.5 * 0.7 * 0.5)
= 0.00735












Errors in speech recognition  
  Main problems

- Environmental noise
- Heavy grammar restrictions force people to speak unnaturally
- Turn-taking model is not natural

  Noise - Included in this category are noises which actually count as communicative, for example tongue clicks and other meaningful non-speech mouth sounds
- Constant background noise deteriorates recognition rates; a threshold is quickly reached where recognition quality drops below useless
- Non-uniform background noises are even worse as they derail the HMMs down the wrong paths
  Error types - Rejection (false negative)
- Insertion (false positive)
- Substitution
  Interaction error types - Mistaken give-turn signal (initiated by silence before user is done speaking)
- Mistaken take-turn/interrupt signal (initiated by external nose e.g. tongue click)
- Missed take-turn/interrupt signal
- Misinterpreted take-turn/interrupt signal (e.g. as valid response to ongoing talk)
  The main source of misrecognition Lack of understanding