**Featured Application: The potential applications of scene text reading are ordering large pictures and video databases by their literary substance, such as Bing Maps, Apple Maps, and Google Street View, as well as supporting visual impaired people.**

**Abstract:** Reading text and unified text detection and recognition from natural images are the most challenging applications in computer vision and document analysis. Previously proposed end-to-end scene text reading methods do not consider the frequency of input images at feature extraction, which slows down the system, requires more memory, and recognizes text inaccurately. In this paper, we proposed an octave convolution (OctConv) feature extractor and a time-restricted attention encoder-decoder module for end-to-end scene text reading. The OctConv can extract features by factorizing the input image based on their frequency. It is a direct replacement of convolutions, orthogonal and complementary, for reducing redundancies and helps to boost the reading text through low memory requirements at a faster speed. In the text reading process, features are first extracted from the input image using Feature Pyramid Network (FPN) with OctConv Residual Network with depth 50 (ResNet50). Then, a Region Proposal Network (RPN) is applied to predict the location of the text area by using extracted features. Finally, a time-restricted attention encoder-decoder module is applied after the Region of Interest (RoI) pooling is performed. A bilingual real and synthetic scene text dataset is prepared for training and testing the proposed model. Additionally, well-known datasets including ICDAR2013, ICDAR2015, and Total Text are used for fine-tuning and evaluating its performance with previously proposed state-of-the-art methods. The proposed model shows promising results on both regular and irregular or curved text detection and reading tasks.

**Keywords:** octave convolution; bilingual scene text reading; Ethiopic script; attention
