2.1. Lexical Stress Features in the Croatian Language
In Croatian, there are four types of stresses: short falling (SF), with a high tone on the initial short stressed syllable and low tones elsewhere; short rising (SR), with a high tone on the short-stressed syllable and the following syllable; long falling (LF), with a falling tone on the long-stressed syllable; and long rising (LR), with a high tone on the long-stressed syllable, passing high into the following syllable [
12]. The position of lexical stress in a word is also not fixed; it can be anywhere, except for on the last syllable.
In Croatian, a distinction is also made between prosodic and orthographic words. An orthographic word is a part of the text that remains an inseparable whole even with syntagmatic modifications (insertion or change). Parts of an orthographic word are morphemes: semantic core, prefixes, infixes, and suffixes. Orthographic words are written separately in a text. The prosodic word consists of syllables in a sequence, which syntagmatically refer to a single stressed syllable. It is usually the semantic nucleus and one or more morphemes that denote the linguistic, modal, and logical relations of the nucleus. Morphemes are grammatical morphemes that can be parts of an orthographic word or grammatical words that are separate word clitics or atonic words that do not have their own lexical stress and cannot occur as separate spoken words.
Clitics can depend on the following word for lexical stress and are called proclitics or they can depend on the preceding word and are called enclitics (e.g., auxiliary verbs, pronouns, and the particle “
li”). All other words that are not clitics are called tonic words.
Table 1 lists Croatian proclitics and enclitics according to [
13]. Enclitics do not have independent stress, e.g.,
vȉdīmga (Eng.
I see him) or
prȅdālisunamse (Eng.
they surrendered to us). On the other hand, proclitics lack lexical stress before words with rising stress, e.g.,
u vòdi, (Eng.
in the water) or
po ljepòti (Eng.
by the beauty), but possess one before words with falling accents, as the stress is shifted from the stressed word to the proclitic, e.g.,
ȕzoru (Eng.
at dawn) and
pȍvodu (Eng.
for water).
The prosody of a prosodic word consists of the number of syllables, the prosodic features of the syllables, and the mutual prosodic relations. A prosodic word is a whole regardless of the number of orthographic words that make it up. This is confirmed by the shift of lexical stress. The rule states that the falling stress can only be on the first syllable, and when a prefix or proclitic is added to the semantic nucleus, the stress shifts from the first syllable of the nucleus to the prefix or proclitic, e.g., znȁti (Eng. to know)—pòznati (Eng. to recognize)—nė znati (Eng. not know). Exceptions are compounds where the falling stress is also found in the middle part of the word, and sometimes the link with the proclitic is on the same level of the compound, e.g., poljoprìvreda (Eng. agriculture). In trisyllabic and multisyllabic words, the lexical stress does not shift, e.g., poȍpomenama (Eng. after warnings).
Since a prosodic word consists of one or more orthographic words, it is on average about 40% larger than the orthographic word. In the standard Croatian language, the average prosodic word has 3.12 syllables, and the average orthographic word has 2.25 syllables. Orthographic words are most often monosyllabic (43.42%), then two-syllabic (27.3%), three-syllabic (21.6%), four-syllabic (12.5%), five-syllabic (3%), and six-syllabic (0.85%). Prosodic words, according to [
9], are most frequently trisyllabic (31.5%), then two-syllabic (28.7%), followed by four-syllabic (22.4%), five-syllabic (8.6%), monosyllabic (4.9%), and six-syllabic (2.9%).
2.2. Related Work
In addition to text-to-speech systems, lexical stress assignments can be useful in a variety of domains. For example, in [
14], lexical stress classification was used in medicine to assess dysprosody in childhood apraxia of speech. The research showed promising results in automatically classifying lexical stress to detect errors in children’s speech during diagnosis or treatment-related changes, but the authors concluded that further training of the algorithms on larger datasets is needed. In [
15], lexical stress in language learning was used to detect errors in the speech of non-native speakers of English as a second language. The authors reported results of 94.8% precision and 49.2% recall for detecting incorrectly stressed words in the English L2 speech of Baltic and Slavic speakers. In [
16], automatic accent classification was performed and its use in forensic applications was described.
Although there is a large body of literature on the automatic assignment of lexical stress for well-resourced languages, research on Slavic languages (a family to which the Croatian language belongs) is sparse. The main reason for this is that these languages are not only under-resourced (i.e., lack speech or pronunciation corpora and language models) but also morphologically rich.
Although there are efforts to train multilingual models and build resources to support languages with insufficient resources [
17,
18,
19], the languages in the model must be related in some way to produce high-quality results. Since Slavic languages are similar but generally under-resourced and morphologically rich, it is difficult to create an environment in which they can be trained together with well-resourced languages.
One of the main problems of languages with insufficient resources and rich morphologies is how to deal with words outside the vocabulary (Out-Of-Vocabulary, OOV).
Therefore, the basic goal of developing rule-based text phoneme mapping systems is to handle stressed OOV words. The lexicon in which the lexical stress of basic and derived inflectional word forms is marked cannot cover all the words that may occur in the texts. In such situations, one of the statistical methods can be applied to find the most probable position of lexical stress and type of accent in a word. For this purpose, classification and regression decision trees, the support vector machine (SVM) method, hidden Markov models (HMMs), or naive Bayes classifiers for low-resource languages were used.
In terms of the related work in the field of automatic lexical stress detection and mapping for low-resource languages, Ni, Liu, and Xu [
20] developed a hierarchical model-based boosting classification and regression tree (CART) for Mandarin stress detection using acoustic evidence and textual information. Gharavian, Sheikhan, and Ghasemi [
21] developed a combined classification model for lexical stress detection in Farsi (HMM was used to segment stressed sentences, additional features were extracted from pitch and formant frequencies, and six feature sets were selected using fast correlation-based filter feature selection). For Hindi, a hybrid model (rule-based and statistical learning) was used [
22]. James et al. [
23] also used HMMs for the under-resourced language of Māori (New Zealand).
Ciobanu, Dinu, and Dinu [
24] used SVM to find the boundary between syllables and predict stressed syllables for Romanian. Lorincz et al. [
25] describe “RoLEX”—a dataset for the Romanian language containing over 330,000 entries with information on the lemma, morphosyntactic description, syllabification, lexical stress, and phonemic transcription. Moreover, Marinčič, Tušar, Gams and Šef [
26] used classification trees to determine the lexical stress position and type of accent in Slovenian. First, based on the context of each vowel, a model predicting whether it is stressed was created (a new model was created for each vowel), followed by a model predicting the type of accent. Unlike Croatian, where there are four types of accents that can be placed on any of the five vowels or “syllabic r”, in Slovenian, the vowel
e can have three types of accents, the vowel
o can have only two, and the other vowels are either stressed or unstressed.
For Croatian, there are works proposing acoustic modelling for speech recognition and speech synthesis [
27], intonation modelling [
28], and prototype systems for Croatian speech synthesis [
29], although in the mentioned studies, lexical stress was not considered.
There are also lexical resources for Croatian that are very useful for various tasks in processing the language, such as the Croatian Morphological Lexicon [
30] and hrLex v1.3 [
31], which are inflectional lexicons of Croatian. In recent years, resources for processing Croatian at the level of derivative morphology have been published, such as the Croatian Derivative Lexicon-CroDeriv [
32] and DerivBase HR [
33]. None of these lexicons, however, contain the lexical stress on words, which is a very important feature for the naturalness of synthesized speech and the performance of speech-to-text systems.
As far as we know, there is no comparable research in the field of automatic stress recognition and assignment for the Croatian language so far.