MDPI - Publisher of Open Access Journals

25 pages, 1380 KB

Open AccessReview

A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization

by Muhammad Azhar, Adeen Amjad, Deshinta Arrova Dewi and Shahreen Kasim

Information 2025, 16(9), 784; https://doi.org/10.3390/info16090784 - 9 Sep 2025

Viewed by 406

Abstract

The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, abstractive summarization for Urdu remains largely unexplored due to the language’s complex morphology and rich literary [...] Read more.

The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, abstractive summarization for Urdu remains largely unexplored due to the language’s complex morphology and rich literary tradition. This paper systematically evaluates four transformer-based language models (BERT-Urdu, BART, mT5, and GPT-2) for Urdu abstractive summarization, comparing their performance against conventional machine learning and deep learning approaches. Using multiple Urdu datasets—including the Urdu Summarization Corpus, Fake News Dataset, and Urdu-Instruct-News—we show that fine-tuned Transformer Language Models (TLMs) consistently outperform traditional methods, with the multilingual mT5 model achieving a 0.42 absolute improvement in F1-score over the best baseline. Our analysis reveals that mT5’s architecture is particularly effective at handling Urdu-specific challenges such as right-to-left script processing, diacritic interpretation, and complex verb–noun compounding. Furthermore, we present empirically validated hyperparameter configurations and training strategies for Urdu ATS, establishing transformer-based approaches as the new state-of-the-art for Urdu summarization. Notably, mT5 outperforms Seq2Seq baselines by up to 20% in ROUGE-L, underscoring the efficacy of Transformer-based models for low-resource languages. This work contributes both a systematic review of prior research and a novel empirical benchmark for advancing Urdu abstractive summarization. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) with Applications and Natural Language Understanding (NLU))

► Show Figures

Figure 1

27 pages, 1817 KB

Open AccessArticle

A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media

by Muhammad Usman, Muhammad Ahmad, Grigori Sidorov, Irina Gelbukh and Rolando Quintero Tellez

Computers 2025, 14(7), 279; https://doi.org/10.3390/computers14070279 - 15 Jul 2025

Cited by 1 | Viewed by 2047

Abstract

The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, [...] Read more.

The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, this study introduces a comprehensive approach for multilingual hate speech detection. To facilitate robust hate speech detection across diverse languages, this study makes several key contributions. First, we created a novel trilingual hate speech dataset consisting of 10,193 manually annotated tweets in English, Spanish, and Urdu. Second, we applied two innovative techniques—joint multilingual and translation-based approaches—for cross-lingual hate speech detection that have not been previously explored for these languages. Third, we developed detailed hate speech annotation guidelines tailored specifically to all three languages to ensure consistent and high-quality labeling. Finally, we conducted 41 experiments employing machine learning models with TF–IDF features, deep learning models utilizing FastText and GloVe embeddings, and transformer-based models leveraging advanced contextual embeddings to comprehensively evaluate our approach. Additionally, we employed a large language model with advanced contextual embeddings to identify the best solution for the hate speech detection task. The experimental results showed that our GPT-3.5-turbo model significantly outperforms strong baselines, achieving up to an 8% improvement over XLM-R in Urdu hate speech detection and an average gain of 4% across all three languages. This research not only contributes a high-quality multilingual dataset but also offers a scalable and inclusive framework for hate speech detection in underrepresented languages. Full article

(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)

► Show Figures

Figure 1

18 pages, 373 KB

Open AccessArticle

Machine Learning- and Deep Learning-Based Multi-Model System for Hate Speech Detection on Facebook

by Amna Naseeb, Muhammad Zain, Nisar Hussain, Amna Qasim, Fiaz Ahmad, Grigori Sidorov and Alexander Gelbukh

Algorithms 2025, 18(6), 331; https://doi.org/10.3390/a18060331 - 1 Jun 2025

Cited by 2 | Viewed by 1223

Abstract

Hate speech is a complex topic that transcends language, culture, and even social spheres. Recently, the spread of hate speech on social media sites like Facebook has added a new layer of complexity to the issue of online safety and content moderation. This [...] Read more.

Hate speech is a complex topic that transcends language, culture, and even social spheres. Recently, the spread of hate speech on social media sites like Facebook has added a new layer of complexity to the issue of online safety and content moderation. This study seeks to minimize this problem by developing an Arabic script-based tool for automatically detecting hate speech in Roman Urdu, an informal script used most commonly for South Asian digital communications. Roman Urdu is relatively complex as there are no standardized spellings, leading to syntactic variations, which increases the difficulty of hate speech detection. To tackle this problem, we adopt a holistic strategy using a combination of six machine learning (ML) and four Deep Learning (DL) models, a dataset from Facebook comments, which was preprocessed (tokenization, stopwords removal, etc.), and text vectorization (TF-IDF, word embeddings). The ML algorithms used in this study are LR, SVM, RF, NB, KNN, and GBM. We also use deep learning architectures like CNN, RNN, LSTM, and GRU to increase the accuracy of the classification further. It is proven by the experimental results that deep learning models outperform the traditional ML approaches by a significant margin, with CNN and LSTM achieving accuracies of 95.1% and 96.2%, respectively. As far as we are aware, this is the first work that investigates QLoRA for fine-tuning large models for the task of offensive language detection in Roman Urdu. Full article

(This article belongs to the Special Issue Linguistic and Cognitive Approaches to Dialog Agents)

► Show Figures

Figure 1

17 pages, 557 KB

Open AccessArticle

The Role of Transliterated Words in Linking Bilingual News Articles in an Archive

by Muzammil Khan, Sarwar Shah Khan, Yasser Alharbi, Ali Alferaidi, Talal Saad Alharbi and Kusum Yadav

Appl. Sci. 2023, 13(7), 4435; https://doi.org/10.3390/app13074435 - 31 Mar 2023

Cited by 4 | Viewed by 1861

Abstract

Retrieving a specific digital information object from a multi-lingual huge and evolving news archives is challenging and complicated against a user query. The processing becomes more difficult to understand and analyze when low-resourced and morphologically complex languages like Urdu and Arabic scripts are [...] Read more.

Retrieving a specific digital information object from a multi-lingual huge and evolving news archives is challenging and complicated against a user query. The processing becomes more difficult to understand and analyze when low-resourced and morphologically complex languages like Urdu and Arabic scripts are included in the archive. Computing similarity against a query and among news articles in huge and evolving collections may be inaccurate and time-consuming at run time. This paper introduces a Similarity Measure based on Transliteration Words (SMTW) from the English language in the Urdu scripts for linking news articles extracted from multiple online sources during the preservation process. The SMTW link Urdu-to-English news articles using an upgraded Urdu-to-English lexicon, including transliteration words. The SMTW was exhaustively evaluated to assess the effectiveness using different size datasets and the results were compared with the Common Ratio Measure for Dual Language (CRMDL). The experimental results show that the SMTW was more effective than the CRMDL for linking Urdu-to-English news articles. The precision improved from 50% to 60%, recall improved from 67% to 82%, and the impact of common terms also improved. Full article

(This article belongs to the Special Issue Advanced Computational and Linguistic Analytics)

► Show Figures

Figure 1

15 pages, 9394 KB

Open AccessArticle

Introducing Urdu Digits Dataset with Demonstration of an Efficient and Robust Noisy Decoder-Based Pseudo Example Generator

by Wisal Khan, Kislay Raj, Teerath Kumar, Arunabha M. Roy and Bin Luo

Symmetry 2022, 14(10), 1976; https://doi.org/10.3390/sym14101976 - 21 Sep 2022

Cited by 47 | Viewed by 4438

Abstract

In the present work, we propose a novel method utilizing only a decoder for generation of pseudo-examples, which has shown great success in image classification tasks. The proposed method is particularly constructive when the data are in a limited quantity used for semi-supervised [...] Read more.

In the present work, we propose a novel method utilizing only a decoder for generation of pseudo-examples, which has shown great success in image classification tasks. The proposed method is particularly constructive when the data are in a limited quantity used for semi-supervised learning (SSL) or few-shot learning (FSL). While most of the previous works have used an autoencoder to improve the classification performance for SSL, using a single autoencoder may generate confusing pseudo-examples that could degrade the classifier’s performance. On the other hand, various models that utilize encoder–decoder architecture for sample generation can significantly increase computational overhead. To address the issues mentioned above, we propose an efficient means of generating pseudo-examples by using only the generator (decoder) network separately for each class that has shown to be effective for both SSL and FSL. In our approach, the decoder is trained for each class sample using random noise, and multiple samples are generated using the trained decoder. Our generator-based approach outperforms previous state-of-the-art SSL and FSL approaches. In addition, we released the Urdu digits dataset consisting of 10,000 images, including 8000 training and 2000 test images collected through three different methods for purposes of diversity. Furthermore, we explored the effectiveness of our proposed method on the Urdu digits dataset by using both SSL and FSL, which demonstrated improvement of 3.04% and 1.50% in terms of average accuracy, respectively, illustrating the superiority of the proposed method compared to the current state-of-the-art models. Full article

(This article belongs to the Special Issue Computer Vision, Pattern Recognition, Machine Learning, and Symmetry)

► Show Figures

Figure 1

21 pages, 4249 KB

Open AccessArticle

Risk and Pattern Analysis of Pakistani Crime Data Using Unsupervised Learning Techniques

by Faria Ferooz, Malik Tahir Hassan, Sajid Mahmood, Hira Asim, Muhammad Idrees, Muhammad Assam, Abdullah Mohamed and El-Awady Attia

Appl. Sci. 2022, 12(7), 3675; https://doi.org/10.3390/app12073675 - 6 Apr 2022

Cited by 4 | Viewed by 10231

Abstract

To reduce crime rates, there is a need to understand and analyse emerging patterns of criminal activities. This study examines the occurrence patterns of crimes using the crime dataset of Lahore, a metropolitan city in Pakistan. The main aim is to facilitate crime [...] Read more.

To reduce crime rates, there is a need to understand and analyse emerging patterns of criminal activities. This study examines the occurrence patterns of crimes using the crime dataset of Lahore, a metropolitan city in Pakistan. The main aim is to facilitate crime investigation and future risk analysis using visualization and unsupervised data mining techniques including clustering and association rule mining. The visualization of data helps to uncover trends present in the crime dataset. The K-modes clustering algorithm is used to perform the exploratory analysis and risk identification of similar criminal activities that can happen in a particular location. The Apriori algorithm is applied to mine frequent patterns of criminal activities that can happen on a particular day, time, and location in the future. The data were acquired from paper-based records of three police stationsin the Urdu language. The data were then translated into English and digitized for automatic analysis. The result helped identify similar crime-related activities that can happen in a particular location, the risk of potential criminal activities occurring on a specific day, time, and place in the future, and frequent crime patterns of different crime types. The proposed work can help the police department to detect crime events and situations and reduce crime incidents in the early stages by providing insights into criminal activity patterns. Full article

(This article belongs to the Special Issue Computational Sensing and Imaging)

► Show Figures

Figure 1

15 pages, 3780 KB

Open AccessArticle

Recognition of Cursive Pashto Optical Digits and Characters with Trio Deep Learning Neural Network Models

by Muhammad Zubair Rehman, Nazri Mohd. Nawi, Mohammad Arshad and Abdullah Khan

Electronics 2021, 10(20), 2508; https://doi.org/10.3390/electronics10202508 - 15 Oct 2021

Cited by 6 | Viewed by 3924

Abstract

Pashto is one of the most ancient and historical languages in the world and is spoken in Pakistan and Afghanistan. Various languages like Urdu, English, Chinese, and Japanese have OCR applications, but very little work has been conducted on the Pashto language in [...] Read more.

Pashto is one of the most ancient and historical languages in the world and is spoken in Pakistan and Afghanistan. Various languages like Urdu, English, Chinese, and Japanese have OCR applications, but very little work has been conducted on the Pashto language in this perspective. It becomes more difficult for OCR applications to recognize handwritten characters and digits, because handwriting is influenced by the writer’s hand dynamics. Moreover, there was no publicly available dataset for handwritten Pashto digits before this study. Due to this, there was no work performed on the recognition of Pashto handwritten digits and characters combined. To achieve this objective, a dataset of Pashto handwritten digits consisting of 60,000 images was created. The trio deep learning Convolutional Neural Network, i.e., CNN, LeNet, and Deep CNN were trained and tested with both Pashto handwritten characters and digits datasets. From the simulations, the Deep CNN achieved 99.42 percent accuracy for Pashto handwritten digits, 99.17 percent accuracy for handwritten characters, and 70.65 percent accuracy for combined digits and characters. Similarly, LeNet and CNN models achieved slightly less accuracies (LeNet; 98.82, 99.15, and 69.82 percent and CNN; 98.30, 98.74, and 66.53 percent) for Pashto handwritten digits, Pashto characters, and the combined Pashto digits and characters recognition datasets, respectively. Based on these results, the Deep CNN model is the best model in terms of accuracy and loss as compared to the other two models. Full article

(This article belongs to the Special Issue Machine Learning Technologies for Big Data Analytics)

► Show Figures

Figure 1

13 pages, 2447 KB

Open AccessArticle

AUDD: Audio Urdu Digits Dataset for Automatic Audio Urdu Digit Recognition

by Aisha Chandio, Yao Shen, Malika Bendechache, Irum Inayat and Teerath Kumar

Appl. Sci. 2021, 11(19), 8842; https://doi.org/10.3390/app11198842 - 23 Sep 2021

Cited by 32 | Viewed by 4951

Abstract

The ongoing development of audio datasets for numerous languages has spurred research activities towards designing smart speech recognition systems. A typical speech recognition system can be applied in many emerging applications, such as smartphone dialing, airline reservations, and automatic wheelchairs, among others. Urdu [...] Read more.

The ongoing development of audio datasets for numerous languages has spurred research activities towards designing smart speech recognition systems. A typical speech recognition system can be applied in many emerging applications, such as smartphone dialing, airline reservations, and automatic wheelchairs, among others. Urdu is a national language of Pakistan and is also widely spoken in many other South Asian countries (e.g., India, Afghanistan). Therefore, we present a comprehensive dataset of spoken Urdu digits ranging from 0 to 9. Our dataset has 25,518 sound samples that are collected from 740 participants. To test the proposed dataset, we apply different existing classification algorithms on the datasets including Support Vector Machine (SVM), Multilayer Perceptron (MLP), and flavors of the EfficientNet. These algorithms serve as a baseline. Furthermore, we propose a convolutional neural network (CNN) for audio digit classification. We conduct the experiment using these networks, and the results show that the proposed CNN is efficient and outperforms the baseline algorithms in terms of classification accuracy. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

16 pages, 1697 KB

Open AccessArticle

Visualization of High-Dimensional Data by Pairwise Fusion Matrices Using t-SNE

by Mujtaba Husnain, Malik Muhammad Saad Missen, Shahzad Mumtaz, Muhammad Muzzamil Luqman, Mickaël Coustaty and Jean-Marc Ogier

Symmetry 2019, 11(1), 107; https://doi.org/10.3390/sym11010107 - 17 Jan 2019

Cited by 30 | Viewed by 7610

Abstract

We applied t-distributed stochastic neighbor embedding (t-SNE) to visualize Urdu handwritten numerals (or digits). The data set used consists of 28 × 28 images of handwritten Urdu numerals. The data set was created by inviting authors from different categories of native Urdu speakers. [...] Read more.

We applied t-distributed stochastic neighbor embedding (t-SNE) to visualize Urdu handwritten numerals (or digits). The data set used consists of 28 × 28 images of handwritten Urdu numerals. The data set was created by inviting authors from different categories of native Urdu speakers. One of the challenging and critical issues for the correct visualization of Urdu numerals is shape similarity between some of the digits. This issue was resolved using t-SNE, by exploiting local and global structures of the large data set at different scales. The global structure consists of geometrical features and local structure is the pixel-based information for each class of Urdu digits. We introduce a novel approach that allows the fusion of these two independent spaces using Euclidean pairwise distances in a highly organized and principled way. The fusion matrix embedded with t-SNE helps to locate each data point in a two (or three-) dimensional map in a very different way. Furthermore, our proposed approach focuses on preserving the local structure of the high-dimensional data while mapping to a low-dimensional plane. The visualizations produced by t-SNE outperformed other classical techniques like principal component analysis (PCA) and auto-encoders (AE) on our handwritten Urdu numeral dataset. Full article

► Show Figures

Figure 1

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI