Getting word timestamps in audio
Speechace segments and aligns user audio at the word, syllable, and phoneme levels. The Speechace API provides detailed extent information for each level:
Syllable Level: Data is returned in the
syllable_score_list[]
array.Phoneme Level: Data is returned in the
phone_score_list[]
array.
The extent[]
field contains begin and end timestamps for that syllable or phoneme in units of 10 msec.
In the example below the phoneme /sh/ is at msec 250 to 350 in the user audio file:

Timestamp extent information can be used to zoom in and playback specific words, allowing for the demonstration of a test-taker's mistakes or the correct pronunciation of a word from a reference example.
To do so you need to iterate through the Speechace API JSON result as follows:
for each word in text_score.word_score_list[]
get first and last elements of phone_score_list[] for that word
start_timestamp is extent[0] for the first element
end_timestamp is extent[1] for the last element
# timestamps are in unit of 10 msec
Last updated