A Rule-Based Model for Stemming Hausa Words †
Abstract
:1. Introduction
2. Literature Review
2.1. Natural Language Processing (NLP)
2.2. Morphology
2.3. Stemmer
2.4. Application of Stemming Algorithm
2.5. Stemming Errors
2.6. Stemming Algorithms
- o
- N-Gram method;
- o
- HMM method;
- o
- YASS method.
- o
- Lovin’s stemmer;
- o
- Porter’s stemmer;
- o
- HUSK stemmer/PAICE;
- o
- Dawson stemmer.
3. Related Work
4. Method
4.1. Data Collection
4.1.1. Text Preprocessing
- Remove special characters: In this step, all the punctuation from the text is removed; the string library of Python version 2.0 contains some predefined list of punctuations such as “!” $% & ’() *+, -. /:; [ ]—’.
- Lowering case: This is one of the most common preprocessing steps where the text is converted into the same case, preferably lower case.
- Tokenization: This is the first step in the preprocessing of both phases used to tokenize a sequence of text or document into sentences and change sentences into words [18].
- Remove stop words: Stop word removal is also another step of preprocessing used to frequently remove words in the text that are not relevant or have no impact on determining classifying sentiments [18]. Examples of stop words: “a”, “abin”, “akan”, “ake”, “amma”, “an”, “ana”, “ba”, “babu”, “bai”, “ban”, “baya”, “bayan”, “bisa”, “can”, “ce”, “cewa”, “ci”, “cikin”, “da”, “dab”, “daga”, “dai”, “dama”, “dan”, “daya”, “din”, “domin”, “don”, “duba”, “duk”, “dukkan”, “fa”, “fi”, “fiye”, “ga”, “gaba”, “gaban”, “haka”, “hakan”.
4.1.2. Checking for Exceptional Cases
4.1.3. Root Words
4.1.4. Affix-Stripping Rules
4.2. Proposed Model
4.3. Algorithm
Algorithm 1. Processing Sequence of the Stemmer |
Input: Hausa raw text (α) Output: Stemmed word (W) Variables: - α: Input raw text - W: Stemmed word - N: Number of words in α - i: Current word index - word: Current word being processed - lookup_table: Table of known word stems - exception_table: Table of exceptional cases - min_word_length (2): Minimum word length for processing 1. Start 2. Read α 3. Preprocessing stage α ← StripIrrelevantCharacters(α) α ← ConvertToLowercase(α) α ← TokenizeWords(α) α ← RemoveStopWords(tokens) N ← Count(tokens) i ← 1 4. While i ≤ N, do steps 5-10 a. word ← tokens [i] 5. If Length(word) ≥ min_word_length, then Proceed to step 6 Else Increment i by 1 Continue to step 4 6. If a word is in lookup_table, then W ← LookupStem(lookup_table, word) Increment i by 1 Continue to step 4 Else Proceed to step 7 7. If a word is in exception_table, then W ← LookupStem(exception_table, word) Increment i by 1 Continue to step 4 Else Proceed to step 8 8. Apply affix stemming rules to the word If a word is stemmed (W found), then Increment i by 1 Continue to step 4 Else Proceed to step 9 9. If i = N, then Proceed to step 10 Else Increment i by 1 Continue to step 4 10. End |
5. Results and Discussion
6. Conclusions
7. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
API | Application Programming Interface |
ASR | Automatic Speech Recognition |
AWCF | Average Words Conflation Factor |
CSWF | Correctly Stemmed Words Factor |
HMM | Hidden Markov Model |
HNLP | Hausa Natural Language Processing |
IE | Information Extraction |
IND | Indexing |
IR | Information Retrieval |
MSE | Mis Stemming Errors |
MT | Machine Translation |
NER | Name Entity Recognition |
NLP | Natural Language Processing |
NLTK | Natural Language Toolkit |
OCR | Optical Character Recognition |
OSE | Over Stemming Errors |
POS | Part of Speech |
QA | Question Answering |
TC | Text Classification |
TClu | Text Clustering |
TS | Text Segmentation |
TS | Text Summarizations |
USE | Under Stemming Errors |
WSF | Word Stemmed Factor |
YASS | Yet Another Suffix Stripper |
References
- Alshalabi, H.; Tiun, S.; Omar, N.; AL-Aswadi, F.N.; Ali Alezabi, K. Arabic Light-Based Stemmer Using New Rules. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 6635–6642. [Google Scholar] [CrossRef]
- Zakari, R.Y.; Lawal, Z.K.; Abdulmumin, I. A Systematic Literature Review of Hausa Natural Language Processing. Int. J. Comput. Inf. Technol. 2021, 10, 173–179. [Google Scholar] [CrossRef]
- Xu, J.; Croft, W.B. Corpus-Based Stemming Using Cooccurrence of Word Variants. ACM Trans. Inf. Syst. 1998, 16, 61–81. [Google Scholar] [CrossRef]
- Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural Language Processing: State of the Art, Current Trends and Challenges. Multimed. Tools Appl. 2023, 82, 3713–3744. [Google Scholar] [CrossRef] [PubMed]
- Inuwa-Dutse, I. The First Large Scale Collection of Diverse Hausa Language Datasets. arXiv 2021, arXiv:2102.06991. [Google Scholar]
- Rakhmanov, O.; Schlippe, T. Sentiment Analysis for {H}ausa: Classifying Students{’} Comments. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, Marseille, France, 20–25 June 2022; pp. 98–105. [Google Scholar]
- Bashir, M.; Rozaimee, A.B.; Malini, W.; Isa, B.W. A Word Stemming Algorithm for Hausa Language. IOSR J. Comput. Eng. 2015, 17, 2278–2661. [Google Scholar] [CrossRef]
- Jabbar, A.; Iqbal, S.; Tamimy, M.I.; Hussain, S.; Akhunzada, A. Empirical Evaluation and Study of Text Stemming Algorithms. Artif. Intell. Rev. 2020, 53, 5559–5588. [Google Scholar] [CrossRef]
- Jabbar, A.; Iqbal, S.; Ilahi, M. High Performance Stemming Algorithm to Handle Multi-Level Inflections in Urdu Language. Research Square 2022. [Google Scholar] [CrossRef]
- Bimba, A.; Idris, N.; Khamis, N.; Noor, N.F.M. Stemming Hausa Text: Using Affix-Stripping Rules and Reference Look-Up. Lang. Resour. Eval. 2016, 50, 687–703. [Google Scholar] [CrossRef]
- Yalçin, O.G. Natural Language Processing. In Applied Neural Networks with TensorFlow 2: API Oriented Deep Learning with Python; Apress: Berkeley, CA, USA, 2021; pp. 187–213. ISBN 978-1-4842-6513-0. [Google Scholar]
- Kaur, P. Review on Stemming Techniques. Int. J. Adv. Res. Comput. Sci. 2018, 9, 64–68. [Google Scholar] [CrossRef]
- Jabbar, A.; Iqbal, S.; Akhunzada, A.; Abbas, Q. An Improved Urdu Stemming Algorithm for Text Mining Based on Multi-Step Hybrid Approach. J. Exp. Theor. Artif. Intell. 2018, 30, 703–723. [Google Scholar] [CrossRef]
- Bichi, A.A.; Samsudin, R.; Hassan, R. Automatic Construction of Generic Stop Words List for Hausa Text. Indones. J. Electr. Eng. Comput. Sci. 2022, 25, 1501–1507. [Google Scholar] [CrossRef]
- Memon, S.; Mallah, G.A.; Memon, K.N.; Shaikh, A.; Aasoori, S.K.; Dehraj, F.U.H. Comparative Study of Truncating and Statistical Stemming Algorithms. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 563–568. [Google Scholar] [CrossRef]
- Jivani, A.G. A Comparative Study of Stemming Algorithms. Int. J. Circuit Theory Appl. 2011, 2, 1930–1938. [Google Scholar]
- Musa, S.; Obunadike, G.N.; Yakubu, M.M. An Improved Hausa Word Stemming Algorithm. Fudma J. Sci. 2022, 6, 291–295. [Google Scholar] [CrossRef]
- Wayessa, N.; Abas, S. Multi-Class Sentiment Analysis from Afaan Oromo Text Based on Supervised Machine Learning Approaches. Int. J. Res. Stud. Sci. Eng. Technol. 2020, 7, 10–18. [Google Scholar]
- Newman, P.; Newman, R.M. Hausa-English/English-Hausa, Ƙamusun Hausa: Hausa-Ingilishi/Ingilishi-Hausa; Bayero University Press: Kano, Nigeria, 2020; 627p, ISBN 978-978-98446-6-1. [Google Scholar]
- Sharipov, M.; Salaev, U. Uzbek Affix Finite State Machine for Stemming. arXiv 2022, arXiv:2205.10078. [Google Scholar]
Mahauci | should be returned as Mahauci | Mahauci |
Mabuqaci | remove “ma” and “ci” and add “ta” at the end | Buqata |
Matsoraci | remove “ma” and “ci” and replace preceding vowel “a” with “o” | Tsoro |
Makaranci | Remove “ma” and “nci” and add “tu” at the end | Karatu |
Structure | Rules |
---|---|
Words starting with prefix “ma” and suffixes “ci” and “nci” | Remove prefix “ma” and suffixes “ci” and “nci” |
Words starting with prefix “ma” and suffix “awa” | Remove the suffix “wa” |
Words starting with prefix “ma” and suffixes “nci” and “uc” | Remove prefix “ma” and suffixes “uci”, “nci”, and “tu” |
For words starting with prefix “ma” and suffix “aci” | Removing prefix “ma” and suffix “ci”; replace suffix “ci” with “ta” |
Factor | Bashir | Bimba | Siraj | Improvement |
---|---|---|---|---|
Total Words (TW) | 2573 | 1786 | 1786 | 5077 |
Number of Words Stemmed (SW) | 1607 | 1213 | 1258 | 3817 |
Words Stemmed Factor (WSF) | 62.52% | 67.92% | 70.44% | 75.18% |
Number of Distinct Words (S) | 741 | 1046 | 1022 | 2630 |
Correctly Stemmed Words (CSW) | 547 | 1146 | 1219 | 3766 |
Incorrectly Stemmed Words (ISW) | 194 | 67 | 39 | 51 |
Correctly Stemmed Words Factor (CSWF) | 73.81% | 94.56% | 96.90% | 98.66% |
Correct Words Not Stemmed (CW) | 966 | 573 | 528 | 1260 |
Average Words Conflation Factor (AWCF) | 36.27% | 50.04% | 59.47% | 63.62% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bari, M.A.; Umar, H.A.; Bello, B.S.; Ahmed, I.S. A Rule-Based Model for Stemming Hausa Words. Eng. Proc. 2025, 87, 51. https://doi.org/10.3390/engproc2025087051
Bari MA, Umar HA, Bello BS, Ahmed IS. A Rule-Based Model for Stemming Hausa Words. Engineering Proceedings. 2025; 87(1):51. https://doi.org/10.3390/engproc2025087051
Chicago/Turabian StyleBari, Mustapha Ashiru, Hadiza Ali Umar, Bello Shehu Bello, and Ibrahim Said Ahmed. 2025. "A Rule-Based Model for Stemming Hausa Words" Engineering Proceedings 87, no. 1: 51. https://doi.org/10.3390/engproc2025087051
APA StyleBari, M. A., Umar, H. A., Bello, B. S., & Ahmed, I. S. (2025). A Rule-Based Model for Stemming Hausa Words. Engineering Proceedings, 87(1), 51. https://doi.org/10.3390/engproc2025087051