1. Introduction
According to Mark Weiser “…ubiquitous computing makes technology help people to live in the real world…” [
1]. However, most people who work closely with information and communications technologies (ICTs) use visual interfaces that force them to pay attention to the input and output devices used and that, in some cases, require certain skills to perform a specific task, e.g., web browsing.
In web browsing, the visual mode continues to predominate for the delivery and selection of content. For this activity, hypertext documents and a browser or viewer on the user’s device are needed. In the browser, the user must select the links they wish to follow and, in the same way, the user must read and interpret which pages or websites are most useful, discarding those that are not interesting for them [
2,
3].
In recent years, the use of virtual assistants such as Amazon Alexa, Google Assistant, or Siri has become popular. These assistants receive user requests through voice commands and allow actions similar to web browsing such as playing music, performing searches, creating shopping lists, controlling smart electronic devices, and setting alarms. However, the availability of these actions depends on the applications installed in the assistants [
4,
5,
6,
7].
Assistive technology, such as JAWS [
8], VoiceOver [
9], WebReader [
10], SRAA [
11], and Hearsay [
12], has been developed for visually impaired people to navigate the Web. These proposals work in environments composed of desktop computers or laptops and input peripherals, such as braille keyboards, microphones, or computer mice. They receive instructions using combinations of keys, screen gestures, or specific voice commands and deliver content to users using audio or braille. There are even browser extensions, such as iTOC [
13] and SalL [
14], for the JAWS screen reader that help find information that is considered relevant through statistical or semantic analysis of web pages, with the goal of reducing the time spent browsing the Web.
Despite the increase in the use of ICTs, there are individuals who are excluded from the benefits and opportunities that these technologies offer. The existence of this digital divide not only reflects a disparity in access to information but also contributes to marked social inequality [
15]. One of the reasons why the digital divide occurs is due to the lack of user skills, motivations, or attitudes necessary to operate ICTs. This problem may be due to illiteracy or the inability to carry out an action at a certain time. Thus, the need to address the digital divide arises.
The development of voice interfaces in virtual assistants, screen readers, and chatbots plays an important role in simplifying interaction and helping to combat the digital divide. However, these solutions have significant limitations in web browsing. For example, some chatbots are designed exclusively for a specific website or area, and virtual assistants cannot navigate web pages, restricting their usefulness in this specific task. In addition, those solutions that can browse the Web do not guarantee to effectively retrieve information regarding the user’s intentions and interests. To obtain the most suitable results, users should use search strategies [
16], e.g., defining filters, configuring the web browser, and using Boolean operators and special characters. These strategies can be complicated for a user without the right skills.
For these reasons, a proposal that transforms the digital experience for those who face significant barriers in the information age becomes relevant. In this paper, we propose a model to facilitate web browsing by calculating the coincidences between the user’s context and that of the web pages to deliver content oriented to the user’s interests. According to Dey and Abowd [
17], context is “any information that can be used to characterize the situation of an entity” where “an entity may be a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and the application itself”. In this way, the proposed model leverages the data of the environment and communication with the user, allowing it to offer a better service since it has a higher level of understanding of the user’s requests and can respond according to the current situation.
Our solution also supplies a more natural way of interaction between users and ICTs by allowing them to browse the Web using voice and receiving content in audible or graphic format. From the proposed model, we design, implement, and assess a prototype whose user interface merges the interpretation of instructions from popular virtual assistants (e.g., Amazon Alexa and Google Assistant), and the delivery of content from screen readers and similar technologies. In this way, it is possible to contribute to reducing the digital divide.
This paper is organized as follows. After analyzing related work in
Section 2, we describe in
Section 3 the methodological approach followed to develop our proposal, a context-based model for web browsing via voice, which is intended for people affected by the inequality generated by the digital divide. Then, the acquisition of requirements and the architecture of our model and its components are detailed in
Section 4. Afterwards, in
Section 5, we explain the functionalities of the model to achieve context-based web browsing through voice. In
Section 6, we describe the tests of our prototype that were carried out with the SUS and QUIS instruments to assess the level of user satisfaction with respect to the results obtained in a search and the comfort perceived by users when using the prototype. Then, in
Section 7, we analyze the results obtained from these tests and discuss the evaluation of the results obtained by the participants when performing web searches and the impact of context on such web search results. In
Section 8, we provide a conclusion of the work achieved, and finally, in
Section 9, we identify some limitations of our proposal and we outline a few ideas for future work.
2. Related Work
Different works related to our proposal were analyzed and organized according to the following categories: first, we present works that have to do with virtual or voice assistants (see
Section 2.1), then those solutions concerning context-based web browsing (see
Section 2.2), and finally, proposals associated with voice-based web browsing (see
Section 2.3). A comparative analysis of related works is also presented (see
Section 2.4).
2.1. Virtual or Voice Assistants
A virtual or voice assistant is a computer program capable of interacting with the user through natural language. Furthermore, it is an intermediary between the user and smart devices to provide services and information through a verbal dialogue at runtime [
18].
Currently, a variety of commercial virtual assistants have been developed, such as Amazon Alexa [
4], Google Assistant [
5], Apple Siri [
7], and Microsoft Cortana [
6]. The first three have physical devices, such as Echo Dot, Google Home, and HomePod, respectively.
Table 1 lists the characteristics of these virtual assistants, whose functionalities are:
Web access to find answers to everyday questions, e.g., calculations, translations, unit conversions, nutrition, and dictionaries;
Multimedia content playback, e.g., music, video, and audiobook;
Control of compatible devices, e.g., lighting devices, television, and speakers;
Access to personal organizers and management of alarms and timers;
Access to supported application features, e.g., voice commerce, games, and teaching materials;
Other tasks, e.g., making calls and sending messages.
On the other hand, virtual assistants have been developed with specific approaches in different areas, such as education, health, and economy. For example, in the educational area, we can mention Ubot [
19], an assistant focused on facilitating administrative and procedural activities at the Bernardo O’Higgins University of Chile. In the health area, SaminBot [
20] had the objective of collecting data and providing information during the COVID-19 pandemic in Peru.
In the scientific literature, no proposals that are directly related to context-based web browsing have been found in the domain of virtual assistants. However, some specific purpose assistants discussed below are related to our proposal.
MyrrorBot [
21] is a digital assistant based on holistic user models for personalized access to online services, which are invoked through their respective applications, such as YouTube, GoogleNews, and Spotify. MyrrorBot processes two kinds of requests expressed in natural language: user-related and access to online services. The user profile is self-managed, and the user can consult and edit the holistic model.
Shekhar et al. [
22] propose a virtual assistant that aims to help in the student’s learning process, taking into account (1) the context of each student detected through general questions to identify additional needs and (2) the instructor’s resources to generate leading questions. This assistant also provides personalized answers to student questions and reminders about assignment deadlines.
Ivanov and Orozova [
23] create a virtual assistant that facilitates the field work of bat researchers. Its architecture is based on the Beliefs, Desires, and Intentions (CDI) model [
24], allowing customizable field data collection, data processing, validation and analysis, generation of reports, and proactivity based on the adjustment of objectives.
Bui et al. [
25] develop a proactive assistant with the goal of managing project execution and helping users with their activities, such as information organization and meeting summary generation. This assistant also relies on the CDI model, using SPARK2, and works based on knowledge of users’ workflow, progress, preferences, and objectives.
2.2. Context-Based Web Browsing
In web browsing, there are at least three types of entities: user, web page, and browser. In context-based web browsing, information about the entities involved can be used to refine the search and thus present the best information for the request according to the user’s context. Below, we analyze works related to context-based web browsing.
Ho et al. [
26] develop a proposal to automatically extract quantitative data from tables and contextualize the content through the recognition of structural marking and informative clues within the web page. This proposal uses a real-time data classification mechanism based on similarity of context and coherence between facts. However, this proposal is based solely on HTML table format and just the context of the table of interest is analyzed, and thus it does not facilitate browsing in other web elements.
Ustinovskiy and Serdyukov [
27] propose a method for customizing web search preferences based on the user’s short-term context, which is generated from the user’s recent queries and activities. New user searches are customized based on other users’ experiences. The context is related to the statistically most visited web content and the type of content preferred by the user. This method uses machine learning techniques [
28] to filter and classify queries into two categories: useful and not useful for web search customization. However, it does not analyze the context of web pages and only addresses the problem of customizing the first queries in a search session without context.
Bian [
29] implements algorithms and techniques for the acquisition of knowledge in social networks, such as Yahoo Answers, Flickr, and YouTube. Structured semantic extraction from content is based on dynamic context information that considers both the user’s reputation and the quality of the content they published on the Web. However, this work does not take into account the context of the user performing the search.
Zhang [
30] presents a conceptual framework for the classification of various types of context in a web search environment. The objective is to reduce the ambiguity of queries by analyzing the context of the request and the semantic relationship between the concepts being consulted. Query processing is performed in natural language using text format. However, this work does not consider the context of the user, and the prototype of this proposal is aimed only at the medical area.
2.3. Voice-Based Web Browsing
According to Munteanu and Penn [
31], speech is the most natural form of communication for humans, and, at the same time, it is one of the most difficult modalities for machines to understand. But, thanks to advances in voice interfaces, this mode of interaction has become popular, and, under this approach, multiple tools have been developed, such as those described in
Section 2.1. Below, we analyze works related to web browsing which use voice-based interfaces and deliver content in audible format with the aim of improving user interaction or facilitating accessibility for people at risk of technological exclusion.
Ferdous et al. [
32] present an extension for the Google Chrome web browser as a complement to the JAWS screen reader. Its objective is to facilitate the interaction of people suffering from partial or total vision loss with data records in modern web applications. This proposal identifies and manages interest segments on web pages to display leaked information in an alternate web interface and deliver information by voice using JAWS. However, the prototype only works in a predetermined test environment.
SaIL [
14] is an automatic system that detects important information from websites using a machine learning algorithm and generating reference points in the ARIA standard that determine the location and sequence of the identified information. It is trained with information obtained from the search context and the most visited information on a web page. SaIL is compatible with the Google Chrome browser and facilitates navigation through JAWS screen readers.
iTOC [
13] is a browser extension that automatically identifies and extracts hyperlinks from tables of contents in web documents to allow the JAWS screen reader quick, on-demand access to each item within the table. Its goal is to improve JAWS users’ interaction with long web documents. It is compatible with the Google Chrome web browser.
Ashok et al. [
11] propose a semantic abstraction-based system, whose input is voice commands in natural language with the objective of quickly locating and accessing specific parts of websites related to purchases, reservations, and registrations. The study is based on the general characteristics of 100 popular websites. The system segments and filters useful web page elements based on the interpretation of predefined commands. To achieve this, it uses a library of search terms, so there is no semantic analysis of user instructions, and it is oriented to actions in which it is necessary to interact with buttons, cursor, and registers.
Bukhari et al. [
33] develop Ho2iev, a mechanism for high-precision information extraction using heavyweight ontologies. It also offers an integrated voice command recognition system for visually impaired Internet users. However, Ho2iev does not perform natural language processing or allow users to interact with the content of web pages. In addition, communication with the web server is done via short message service (SMS).
Hearsay [
12] is a non-visual and multimodal web browser whose goal is to solve some web accessibility problems. It is a desktop application that can also be used remotely via landline phones and is compatible with IBM’s social accessibility network. It performs content analysis and context management of web pages. Recognition of voice commands is based on a statistical model. Hearsay supports voice, text, and touch screen gestures as input modes and audio, visual, and Braille output modes. In addition, it converts web page content into interactive dialog for the user based on the VoiceXML standard. However, Hearsay does not semantically analyze user requests or process natural language, but rather handles specific, predetermined voice commands.
2.4. Comparative Analysis of Related Works
The study of related works made it possible to identify functionalities that can improve web browsing for users with disabilities or limited skills in the use of the web. Among them, those presented in
Table 2 stand out as fundamental, since they have been evaluated as useful by users and have shown positive results in previous studies.
Table 2 also shows a comparison of related works based on these functionalities. As a result of this analysis, it can be observed that no work has been published that simultaneously integrates the following capabilities: (1) manage user context and monitor user navigation; (2) generate custom queries based on the user’s request and context and retrieve web content according to queries; (3) facilitate web browsing by taking into account both the user and web page contexts; (4) obtain results based on matching the web page context, query, and user context to present web content of real interest to the user; and (5) provide a user interface that actively listens to user input, allowing for continuous interaction and present results graphically and audibly.
The lack of a solution that combines these five functionalities highlights the opportunity for improvement in the design of accessible web navigation systems. Therefore, the proposal of a context-based model for browsing the Web through voice is relevant as it seeks to address these limitations and integrate a more accessible and intuitive approach for users.
3. Materials and Methods
In this paper, we propose a context-based model for web browsing using voice, which is intended for people who are affected by the inequality generated by the digital divide. Specifically, for people with some degree of illiteracy or digital illiteracy, older adults, and people with visual or motor disabilities. The methodology used in this proposal is a hybrid one, combining the traditional waterfall methodology with an incremental development approach as shown in
Figure 1. Several authors, such as Sommerville [
34] and Molina et al. [
35], have highlighted the effectiveness of hybrid methodologies since they allow integrating the best of both approaches and adapting better to the project’s needs.
The use of the waterfall methodology is justified by the need to generate exhaustive and sequential documentation to facilitate future activities such as maintenance, evolution, and regulatory compliance. The classic phases of this methodology were followed: analysis and definition of requirements, model design, prototype implementation, unit testing, integration, general testing, and evaluation. However, since there was no active user participation in the initial phases (analysis and definition of requirements), it was decided to define the requirements in detail based on the related work, as well as the ISO IEEE 29148-2018 [
36] and ISO 9241-11 [
37] standards. These standards were used because the former sets out the principles and recommendations for the specification of software requirements, and the latter provides a framework for understanding and applying the concept of usability to various situations and objects of interest.
The approach was not completely rigid. The incremental component allowed the model to be continuously adjusted and improved through iterations in the unit testing, general testing, and design phases, incorporating comments and feedback from the participants. This flexible approach facilitated the implementation of a prototype by versions, allowing us to evaluate, validate, and identify points of improvement of the model according to emerging needs.
4. Model Requirements and Architecture
The specification of requirements was carried out based on the related work, as well as on the ISO IEEE 29148-2018 and ISO 9241-11 standards mentioned above. As a result, six main functions to be covered by the model were defined:
Continuously listening to and understanding the user voice requests and responses;
Finding relevant web content based on the user’s query and its context;
Enabling web browsing through voice commands, including interacting with objects such as buttons and checkboxes, selecting text or multimedia, and scrolling between sections;
Presenting relevant web content or information audibly and visually;
Managing the user’s context and browsing history logging;
Facilitating interaction primarily through voice, with additional support for on-screen gestures and a virtual keyboard.
Therefore, ten functional requirements and three non-functional requirements were defined, highlighted for their significance, and listed in
Table 3 and
Table 4, respectively. Non-functional requirements concern the usability criteria [
38] defined as the degree to which a product can be used by specific users to achieve specific objectives with effectiveness, efficiency, and satisfaction in a specific context of use.
Considering these requirements, the structural model was established. As shown in
Figure 2, the context-based model for web browsing via voice consists of ten components are described below:
Multi-modal User Interface: it is responsible for presenting notifications, results, and available actions in audible or graphical form. In addition, it allows user interaction primarily through voice with standby and active listening. It also supports interaction via touch gestures on the screen, and text input through a virtual keyboard;
Natural Language Translator: it analyzes the validity of user requests, detects the underlying intent, and extracts relevant data from the lexical units in order to establish parameters associated with that intent. Depending on the user’s activity (profiling or browsing), the results generated by this component (intent and parameters) are sent to the User Context Manager, Query Generator or Operations component;
Operations: this component is responsible for managing the execution of events in other components of the model based on the intent and parameters provided by the Natural Language Translator. Its main functions include managing voice-driven browsing, coordinating the presentation of results through the Multi-modal User Interface, and maintaining data communication with the Response Generator;
User Context Manager: it creates the user profile by means of a questionnaire. From the answers provided, it extracts context descriptor terms and stores them in the User Context table. In addition, it generates and performs updates to the General Set of URLs table based on the navigation history;
Query Generator: it builds a query to search for content on the Web based on the user’s request and context. Each query is composed of a set of words from the user’s request, context descriptor terms, search operators, and filters. This component ensures that there are no duplicate words in the query and selects the appropriate filters based on the type of content requested, such as videos, images, or web pages;
Response Generator: it synthesizes natural language responses based on the results of (a) intent detection, (b) extraction of relevant content, and (c) generation of application notifications, including error alerts;
Web Interface: it manages communication between the Search component and the Web;
Search: this component looks for and retrieves a set of URLs in response to a user query. To this end, it performs a match analysis between the query terms (including search terms and user context descriptors) and the context descriptors of web pages. It then selects the web pages with the highest percentage of matches to give the user;
Generator of Navigation History: it creates and updates the browsing history based on the frequency of accessing web pages and searching for specific topics. It receives the URLs of the pages visited by the user from the Operations component and, in addition, updates the User Context table with the corresponding information;
Web Page Context Analyzer: it analyzes the elements of a web page to identify its subject, considering attributes such as title, content, author, publication date, keywords, syntactic elements, and tags defined in HTML. Based on these elements, it generates terms describing the context of the web page. This component receives from the Search component a list of URLs obtained after a web page crawl and updates the Web Page Context table accordingly. In addition, it sends the Search component a flag indicating whether it is possible to consult the data in the Web Page Context table.
The following is a description of the tables mentioned above, which are shown in
Figure 2 in light blue:
General Set of URLs: it stores a compendium of URLs, covering websites from various domains and providing information on a wide variety of topics. It includes, for example, cultural dissemination sites, online encyclopedias, and repositories;
User Context: it saves demographic data, personal affinities, preferred websites, and descriptor terms related to this information. It also includes data derived from the user’s browsing history;
Web Page Context: it stores descriptor terms that represent the context of a set of web pages obtained as a result of content search. These descriptor terms are generated from key elements of the web pages, such as title, content, keywords, relevant HTML tags, and other metadata, allowing the theme of each page to be identified and categorized.
5. Model Functionalities for Voice-Based Web Browsing
In this section, we describe the functionalities of the proposed context-based model to facilitate web browsing using voice.
Figure 3 depicts the five functionalities that aim at improving web browsing experience for users with disabilities or limited skills. They are represented as diagram blocks associated by means of use relationships. In addition, such blocks served as a guide to understand the structure of this section.
As shown in
Figure 3, the user context is generated from the user profile and the user navigation history (see
Section 5.1); in turn, the user profile is built from the responses of a survey applied to the user. Our model creates customized queries based on the user’s request and context, and it retrieves web content according to queries (see
Section 5.2). In addition, to facilitate web browsing, the model not only takes into account the user context but also that of web pages (see
Section 5.3). To present web content of real interest to the user, the results are obtained by matching the web page context, query, and user context (see
Section 5.4). Additionally, the model provides a user interface that actively listens to user input, allowing for continuous interaction, and presents results graphically and audibly (see
Section 5.5).
5.1. User Context Generation
The user context is composed of two main elements: the user’s profile and the user’s browsing history. The data associated with the user’s profile are static as they do not usually change over time, while the data related to the browsing history are dynamic. The data that make up both the user profile and browsing history, as well as how they are obtained, are detailed below.
5.1.1. User Profile
Currently, there is no consensus on how to model the context of an entity, as context models are often specific to the problem domain being addressed. This is because the context varies significantly depending on the application domain, such as health, education, or transportation. In this model, the data that make up the user profile represent their characteristics and their relationship with the environment, specifically in the domain of web browsing. Since it is possible to include a large amount of data in this domain, those that we consider most significant were selected. Therefore, the user profile is structured from the data included in the following four categories:
Demographic data: they provide important information about the characteristics of the population to which the user belongs. The demographics used in this model are:
Age: it allows understanding the different stages of life the user is going through and how these may influence their needs, behaviors, and preferences;
Gender: it facilitates understanding of the user’s perception of the world, their roles, and consumption patterns;
Address: it provides information about the environment where the user lives, such as geographic and cultural characteristics that may influence their behavior and specific needs;
Educational level: it provides insight into the user’s level of knowledge, skills, and access to opportunities, which may have an impact on their behavior, decision-making, and perspectives.
Topics of interest: on the Web, the user can find a wide variety of content on various topics such as politics, science, technology, art, and history. Knowing the topics of interest to the user can improve the browsing experience by making it easier to find and access meaningful information.
Personal affinities: it allows knowing the user’s interests, likings, or preferences. In this category, the following data are considered:
Activities and hobbies: it allows knowing the activities that the user enjoys doing in their free time, such as sports, music, reading, art, movies, travel, or cooking;
Favorite music and movies: it allows knowing the musical and cinematographic genres preferred by the user, as well as the movies and music that they consult the most. It also provides the opportunity to know the artists, bands, or directors that the user appreciates;
Books and readings: it allows knowing the literary genres that the user prefers, if they have favorite authors or books, or if they like to read novels, essays, biographies, or other genres;
Sports and physical activities: it allows knowing if the user is interested in any particular sport, if they like to exercise, participate in outdoor activities or follow any sports team or competition;
Food and gastronomy: it allows knowing the user’s culinary preferences, if they enjoy any particular type of cuisine or have favorite restaurants or dishes.
Website preferences: it allows the user’s preferences to be known when choosing to access certain websites to search for information.
This data are collected through a voice survey conducted during the application setup phase.
Figure 4 shows the sequence diagram corresponding to the collection of user profile data. This diagram illustrates the interaction between the components of the model to collect and process information through a voice survey, as described below.
- 1. & 2.
The User Context Manager starts the process by invoking the LoadSurvey- WelcomeScreen() and ReproduceWelcome() methods from the Multi-modal User Interface, which display a welcome message and play the audio instructions;
- 3.
The User Context Manager sets the variable i = 1, which indicates the current question number;
A loop then controls the repetition of the steps required for the 12 questions that make up the survey:
- 4.
The User Context Manager invokes the Load&PlayQuestion() method from the Multi-modal User Interface to present the current question () and the related images (), and play the associated instruction ();
- 5.
The user responds to the question by voice;
- 6.
The User Context Manager invokes the Speech2Text() method from the Natural Language Translator that converts voice response to text;
- 7.
The Natural Language Translator returns the user’s response converted to text;
- 8.
The User Context Manager validates the answer according to the question, which results in the alternative block (ALT) with the following options:
- –
If the response is invalid, a loop is started and terminated upon receipt of a valid response. At each iteration, the User Context Manager prompts the Multi-modal User Interface to display a notification explaining the reason for the invalidity (9a). The Multi-modal User Interface then presents the current question again and plays its instruction (10a). The user provides their answer (11a), then the Natural Language Translator converts it from speech to text (12a and 13a), and the User Context Manager validates the answer again (14a);
- –
If the answer is valid, the User Context Manager processes the corresponding text to extract the data (9b), updates the User Context table with the obtained data (10b) and increments the question counter (i) by one to advance to the next question (11b).
When , the main loop is terminated, indicating that the survey is complete.
Processing Survey Responses As mentioned above, user responses are processed (see step 9b in
Figure 4) according to the specific characteristics of each question. Two types of responses are considered:
Answers to closed-ended questions: these responses are usually limited to a set of options, and they are short or have a specific data type. Therefore, response processing and data extraction take into account the particular situations of each question. For example, in the question “How old are you?”, it is desired to extract an integer numeric value that falls within a range representing a valid age. The value is extracted from the text string, considering that age can be represented by numeric symbols, such as “8”, or through words, such as “eight”;
Answers to open-ended questions: these responses are not limited to a fixed set of options and may vary in length and content. Therefore, they are analyzed using natural language processing techniques such as text preprocessing, syntactic analysis, and dependency parsing. To perform these tasks, we use Stanford CoreNLP [
39] server, a widely adopted natural language processing toolkit that provides robust linguistic analysis, including the aforementioned techniques. The steps to process these responses and generate terms that are descriptive of the user’s context are detailed below:
Preprocessing: the user’s response in text format is processed by converting it to lowercase. The text is then split into words. Irrelevant words such as articles, fillers, and conjunctions, are removed;
Syntactic analysis: the text resulting from the preprocessing step is used as input. The words are labeled according to their grammatical category, such as nouns, verbs, adjectives, or adverbs. In addition, an analysis of the dependency relationships between words is performed. As a result, a dataset is generated containing the words with their corresponding grammatical category and the labels of the dependency relations between words;
Generation of context descriptor terms: the input is the dataset generated in the syntactic analysis step. The nouns and words that compose the following dependency relations are filtered as:
- –
Adjectival modifier (amod): it indicates the dependency relationship in which the adjective qualifies the noun. For example, in the sentence “I like to listen to classical music”, the adjectival modifier “classical” modifies the noun “music”, describing the type of music the user listens to;
- –
Nominal modifier (nmod): it indicates the dependency relation in which the nominal modifier adds specific information characteristic of the noun to which it refers. For example, in the phrase “I read history books”, the relation indicates that “history” describes the type of books that the user reads.
As shown in steps 9b and 10b of
Figure 4, the words that make up the dependency relationship are selected as descriptor terms and stored in the User Context table. This table is structured as follows: each row represents a data category with the first column indicating the category and the remaining columns listing its corresponding keywords (see
Table 5).
5.1.2. User Navigation History
Another element of the user’s context is their preferred websites. During the setup phase, these are collected using the voice survey mentioned previously. Subsequently, the user’s context is updated dynamically, adding or removing sites based on the user’s frequent visits. This process is managed by the Generator of Navigation History, which continuously monitors the web pages visited by the user, using the data provided by the Operations component. Each time the user accesses a website or web page, the Operations component sends the corresponding URL to the Generator of Navigation History, which stores this information in a table with entries for the URL, the date of the last visit, and the number of visits (see
Table 6). When receiving a URL, the Generator of Navigation History checks whether it is already registered. If not, it adds the URL and initializes the “number of visits” field to one; if the URL already exists, the Generator of Navigation History increments that field by one. In both cases, the date of the visit is updated.
To facilitate the management of the navigation history and to avoid overgrowth, every ten days, the Generator of Navigation History analyzes the logged pages transferring to the User Context table, under the category of “website preferences”, those that have been visited five or more times in that period. Although five visits in ten days might seem insufficient to define a site as “preferred”, it is crucial to consider that the users targeted by this model do not navigate the web in the same way or with the same frequency as the average. For them, visiting a page several times in a short period may indicate genuine interest.
The Generator of Navigation History maintains in history the information of web pages that meet the following criteria:
In the last ten days, they have received at least five visits;
The last visit must have occurred within the last month.
5.2. Query Formulation and Search
Once the user’s context is generated, the user can perform search requests.
Figure 5 shows the process that results in generating a search query. This process includes the stages in which the user makes the search request by voice, the transformation from voice to text by the Multi-modal User Interface, and the detection of the request intent by the Natural Language Translator. Once the search intent is detected, the Query Generator constructs a search query to locate web content. This aids users who are less experienced with web browsing or struggle to form queries.
A query is composed of keywords, search operators, and filters:
Keywords. They are derived from both the user’s request and context. On the one hand, when the Natural Language Translator identifies that the user’s intention is to perform a search, it transfers to the Query Generator the relevant words extracted from the request. These words include named entities, terms that form syntactic relationships, nouns, and those defined as key parameters for the search, such as content type (e.g., image, video, or web page) and quantities. Then, the Query Generator creates a set of “search keywords”, omitting any repetition. On the other hand, the Query Generator examines the first column of the User Context table looking for terms that match the words extracted from the user’s request. If it finds matches, it adds the keywords associated with the row of the corresponding element(s) to the set of “search keywords”;
Search operators. Three search operators are employed:
- –
Double quotation marks (“”): they are used to search for named entities and nouns in order to obtain results containing exactly those words, e.g., “Marie Curie”;
- –
AND operator: it is used to add to the query the keywords derived from the user’s request, e.g., movie AND current;
- –
OR operator: it is used to add to the query the keywords derived from the user context, e.g., drama OR terror.
Filters. Two parameters are defined to refine the results:
- –
searchType:[type]: it is used to specify the desired content type. Some values for the [type] parameter include image to search for images, video to look for videos, and news to search for news;
- –
site:[site]: it is used to limit the search results to a specific website. The parameter [site] must be replaced by the domain of the desired site, e.g., site:
youtube.com.
The query is built in two parts. First, the keywords extracted from the user’s request are connected by the AND operator, which ensures that they are all present in the results. Then, these keywords are combined with the context keywords, which are grouped into a set linked by the OR operator. This set, in turn, is connected to the main query by AND. For example, if the request keywords are “movie” and “current”, and the context keywords are “drama”, “horror”, and “comedy”, the resulting query would be: movie AND current AND (drama OR horror OR comedy). Subsequently, the query and the filter parameters are sent to the Search component, which is responsible for performing the search and returning a set of URLs as a result. Its functions are:
Connect to Internet via web interface;
Receive the user’s query and search filters provided by the Query Generator;
Perform the web search and generate a list of ten URLs that match the query, which is then sent to the Web Page Context Analyzer;
Receive a notification from the Web Page Context Analyzer, indicating whether the Search component can access the Web Page Context table;
Select the web pages whose context is most similar to the user’s query and context, and then send them to the Response Generator.
5.3. Web Page Context Processing
Web page context is composed of data extracted from the content of the web page to determine its usefulness and relevance, which makes it possible to decide what content to present to the user according to their own user context. The availability of data varies between web pages due to the decisions and approaches taken by developers when designing them. Each developer is free to determine what information to include and how to organize it, resulting in differences in the data available for contextual analysis. For example, some pages may include the author’s name and publication date, while others may not.
The model focuses on retrieving the following data items: title, author, publication date, subtitles, content keywords, and multimedia indicator. These data items give an overview of the topic, focus, and key aspects of the page while also indicating whether the web page is updated and whether it contains multimedia elements.
As mentioned above, a list of ten URLs generated from the query-based search is sent to the Web Page Context Analyzer to extract the context of each web page. This process is outlined in the following three steps:
Content collection: the HTTP request is sent to the server to obtain the HTML of the web page, while possible errors are handled;
Metadata analysis: the objective of this process is to identify and search the content of HTML tags, such as titles, descriptions, introductions, keywords defined by the page developer, author’s name, and publication date;
Data extraction: natural language processing techniques such as sentence segmentation, tokenization, part-of-speech tagging, dependency parsing, and named entity recognition are applied to understand the content, identify relevant entities, and finally extract the keywords that describe the context of the web page. To achieve the above, the following steps are performed:
Separate the text into sentences;
Remove unwanted elements such as symbols, tags, and punctuation marks;
Separate each sentence into individual words (tokenizing);
Label words according to their grammatical category;
Obtain dependency relations between words;
Select the words that have the dependency relationships shown in
Table 7;
Detect named entities and select those belonging to the following groups:
- –
Persons: names of individuals, such as “Citlalli Avalos”;
- –
Geographic locations: place names, such as “Mexico”;
- –
Organizations: names of companies, institutions or organizations, such as “Google” or “Cinvestav”;
- –
Dates: expressions of time or dates, such as “today” or “12 July 2023”.
A set of selected words is obtained from which repeated words are removed, thus ensuring that each word is represented only once in the final set of descriptor terms describing the context of the web page content. Finally, a row is added to the Web Page Context table with the URL and keywords of the analyzed page.
5.4. Contexts and Query Matching
As mentioned in previous sections, the context of the user, the context of web pages, and the user request are represented by descriptor terms that are basically keywords. Based on the comparison and analysis of these keyword sets, the most appropriate content for the user is sought. The objective of the comparison is to determine which set of keywords from the context of web pages has the highest similarity to the keywords from the user’s context and request.
Measures of similarity and distance are used in various fields to quantify and compare the similarity or difference between sets or elements. Since, in this problem, we wish to compare sets of words with each other, Jaccard’s coefficient [
40] was chosen for the following reasons:
It is suitable for comparing sets;
It considers only the presence or absence of words in the sets, regardless of the order or frequency of occurrence;
It does not take into account duplicates within sets;
It provides intuitive interpretation.
As illustrated in
Figure 6, the context matching process consists of two phases:
Phase 1: the matches between the keywords extracted from the User Request Keywords and the User Context Keywords are identified. The values of the first column of the User Context table (see
Table 5) are taken and compared with the User Request Keywords. When a match is found, the User Context Keywords of a certain row is added to the User Request Keywords;
Phase 2: the Jaccard coefficient is calculated between the keywords resulting from the identification of matches and the keywords of the context of each web page (see Resulting Keywords and Web Page Contextual Keywords in
Figure 6). As mentioned above, the Jaccard coefficient is a numerical measure of the degree to which two sets of data are similar. The values of this coefficient are between zero, when there is no similarity, and one, when there is total similarity. The Jaccard coefficient is defined as the size of the intersection of the sets divided by the size of their union [
40]. Mathematically, it is expressed as follows:
where:
- –
A and B are sets of data;
- –
represents the cardinality of the intersection of A and B, i.e., the number of elements present in both A and B;
- –
denotes the cardinality of the union of A and B, i.e., the number of unique elements in both sets.
Finally, five of those web pages whose set of Web Page Contextual Keywords (WPCK) has obtained a high similarity with the set of Resulting Keywords (RK) are selected, i.e., they have a Jaccard coefficient J(WPCK, RK) close to 1. The selected pages are listed as results to be presented to the user. The remaining five pages are sorted by similarity, from highest to lowest, and temporarily stored in case the user wants to explore or navigate them without making a new search request.
5.5. Result Delivery and Voice Navigation
Figure 7 illustrates the process in which the user makes a voice request to the model. The Multi-modal User Interface converts voice to text, and the Natural Language Translator parses the request to identify the user’s intent. For intent detection, an agent was configured in Dialogflow [
41], a Google Cloud platform designed to create conversational interfaces in mobile applications. Dialogflow was chosen due to its built-in natural language understanding (NLU) capabilities, which enable accurate intent recognition and entity extraction. Additionally, its integration with various platforms and support for multiple languages make it a robust solution for developing interactive and scalable conversational agents. The agent configuration includes the definition of entities, intents, parameters, and preset responses. The intentions addressed by the proposed model fall into three categories:
Intent related to the modification of the user interface that allows interaction with images, videos, terms, or web pages (see
Table 8);
Search intent that includes searching for images, videos, terms, or web pages;
Navigation intent that enables movement between web pages by following links, and interaction with menus and visual elements.
The Natural Language Translator may also assign the classification “unknown” when the intent of the user’s request does not fit into the above three categories. Therefore, it is necessary to manage this possibility and notify the user.
As shown in
Figure 7, when the Natural Language Translator detects that the user’s intent is “search”, the Search component processes the query and generates a list of ten content options sorted by relevance to the user. On the other hand, if the detected intent is “navigation or user interface modification”, the request and the intent parameters generated by the Natural Language Translator are processed by the Operations component.
The intent parameters represent essential data to execute specific actions. For example, if the intent is “move the image”, a parameter might indicate the direction of movement or a relative amount, such as “a little” or “more”. Once this information is received, the Operations component performs the following tasks:
Validation: it checks if the requested action can be performed in the current user interface or window;
Parameter analysis: it interprets the intent parameters to determine the type of operation to be performed. For example, if the intent is “increase volume” with no associated parameters, the volume is increased by a default value. However, if there are parameters, such as “two levels”, this amount is added to the current level. If the parameter specifies an end level, the volume is simply set to that value;
Execution: it performs the corresponding operations, such as invoking methods or generating events in the Multi-modal User Interface to reflect the requested changes or navigation;
Error handling: it handles problems related to validation, parameter analysis, or operation execution.
After processing the user’s request according to the detected intent, whether it is “search”, “navigation”, “user interface modification”, or an “unknown” intent, the results of each process are sent to the Response Generator, as shown in
Figure 7. This component creates responses based on the web content and the user’s request, in addition to generating notifications about available navigation options and application messages in natural language. These responses are synthesized and sent to the Multi-modal User Interface for presentation to the user in graphical and audible form.
Regarding the graphical and audible responses provided through the Multi-modal User Interface, some examples are presented below. In each case, the graphical user interfaces (GUIs) are shown along with a description of the audible responses generated by the application:
The process described in
Section 5.1, which is related to the generation of the user context, starts with a welcome screen as shown in
Figure 8a. At this stage, an audio file provides instructions to the user. It asks the user to answer a questionnaire and tells the user that they can skip any question by saying the word “omit”. If the user decides not to continue, the application closes. Otherwise, the questionnaire starts.
Figure 8b–d show examples of some screens from the questionnaire. On each screen, there is a specific audio instruction for that question;
Once the user’s context is generated, they can make requests to search for content on the web and explore the results. The contents found by the application are presented according to their type, as illustrated in
Figure 9a,b. For example, if the user searches for images or videos, the application informs by audio that the result corresponds to an image or video related to their request, together with the available navigation actions (see
Table 8);
Similarly, other types of results are presented in the corresponding screens.
Figure 10a shows an example of the search results for terms and web pages, accompanied by an audio playback that allows the user to select among the available options. Once the user chooses an option, it is displayed on its corresponding screen, as illustrated in
Figure 10b,c. In the case of term results, the application plays in audio the relevant information found about the searched term. For web pages, the available navigation options are reproduced, allowing the user to explore the content.
6. Testing the Prototype with End Users
This section focuses on describing the tests performed to evaluate our prototype, which was developed from the proposed model as a mobile application for the Android 12S operating system [
42]. The decision to use Android was based on its dominant presence in the mobile market. According to data collected by StatCounter [
43] from June 2022 to June 2023, Android accounted for 77.35% of mobile devices in Mexico, followed by iOS with 22.34%, while other operating systems collectively represented only 0.31%. Additionally, Android continuously evolves with performance improvements, new features, API updates for developers, and security enhancements. Android 12S (API 31) was chosen for its availability, accessibility for research, and compatibility with approximately 33% of current devices, ensuring long-term applicability as older versions become obsolete. Next, we explain the evaluation metrics used in the tests (see
Section 6.1), the scenario conditions (see
Section 6.2), the tasks and instructions given to users to evaluate the prototype (see
Section 6.3), and the characteristics of the participants in the tests (see
Section 6.4).
6.1. Evaluation Metrics
The purpose of these tests is to know the level of user satisfaction with respect to the results obtained in a search, and the comfort perceived by users when using the prototype. To carry out these evaluations, two instruments designed to measure usability were employed: the System Usability Scale (SUS) [
44] and the Questionnaire for User Interaction Satisfaction (QUIS) [
45].
SUS offers a general perspective of subjective usability evaluations, and it can be applied in various contexts. This scale consists of ten standardized questions, which allow obtaining a quantitative measure of the ease of use perceived by the user.
On the other hand, the purpose of QUIS is to measure the user’s subjective satisfaction in relation to the user interface. This questionnaire seeks to obtain opinions on key areas, such as ease of use, consistency, and system capacity.
We used both SUS and QUIS in the usability evaluation of the prototype due to their specific characteristics. On the one hand, SUS provides a quick and generalized assessment of perceived ease of use and overall user satisfaction. On the other hand, QUIS offers a more extensive approach, which allows a variety of system qualities to be evaluated in detail. This combination of tools provided a broad view of the usability of our prototype.
We chosen the full version of SUS and the latest version of QUIS (i.e., number 7). In order to adapt the QUIS questionnaire to the particularities of our study context, we carried out the selection of sections and questions, which are aligned with the characteristics of the prototype. The sections and questions chosen include dimensions such as previous experience, graphical interfaces, application’s capability, and overall user impression.
6.2. Test Scenario
To evaluate this first prototype, we selected a target population with experience in web browsing and individuals who did not have particular needs. The purpose of this selection is to capture the perceptions of users who have prior knowledge of conventional web browsing practices. This approach serves as a starting point for future versions of the prototype focused on other types of users.
Convenience sampling [
46] was used in the selection of test participants. It is important to mention that this technique, which depends on the accessibility of available research elements, can introduce biases and does not guarantee that the sample is representative of the entire target population. But, despite the fact that it is a small sample, it is within the average for this type of test [
47].
Moreover, an environment free of acoustic distractions was configured for the evaluation tests, with access to internet services via wireless connection. Only the test user and the person responsible for applying the tests participated in this space. Essential devices for testing included a laptop to run the Stanford CoreNLP server and a tablet connected to the internet service, with the prototype installed. Optionally, the use of wireless headphones was suggested to improve the user’s listening experience during interaction with the prototype. The method for conducting our tests has been widely used by various authors in similar contexts (Aula et al., 2010 [
48]; Chin and Fu, 2010 [
49]; Bevan et al., 2016 [
50]; Merz et al., 2016 [
51]; Sánchez-Adame et al., 2018 [
52]; and Monroy-Rodríguez et al., 2024 [
53]).
6.3. Tasks and Instructions
To carry out the evaluation tests, various tasks were assigned to the users:
Profile configuration. In order to optimize the evaluation process, seven profiles were predetermined, each with distinctive characteristics. This measure allowed users, if they preferred, to select one of these profiles instead of spending time completing their profile. Users were asked to act as if they had the likes and characteristics associated with the chosen profile. In this way, they were able to interact with the prototype based on the preferences set for each specific profile. However, the option of filling out their own profile was allowed and encouraged for those users who wished to do so;
Search for an image and interact with it, such as change of size, movement in the plane, and rotation;
Search for a video and interact with it, such as pause, manage volume, repeat, advance, or retreat;
Search for a web page and navigate it, such as scrolling up and down, and interacting with menus;
Search for a term or definition;
Select an option from the suggestions provided;
Adjust the prototype volume;
Request the repetition of the phrases pronounced by the prototype;
Request the presentation of the following option in any search carried out;
Close the prototype.
Likewise, the following instructions were given to the test participants:
Before starting the evaluation tests, each user received a brief presentation of the prototype, highlighting the objectives and aspects related to the evaluation process;
Also, the user received a brief demonstration of how the prototype works, with the aim of giving them an understanding of how the prototype would approach their interactions;
Subsequently, the user was given a piece of paper containing the specific tasks to be performed during the evaluation. Each task was accompanied by example sentences that served as a guide for interaction with the prototype;
At the end of each search or interaction, the user was questioned about the suitability of the answer obtained, seeking immediate feedback;
The user was asked to complete the questionnaires at the end of the assigned tasks.
6.4. Characteristics of the Participants
The test set consisted of 20 participants, 9 women and 11 men. The average age of the participants was 26 years, with a maximum age of 52 and a minimum age of 16. Regarding the educational level of the participants, the following was observed (see
Figure 11):
15% of the participants indicated that they had completed or were attending the elementary or middle school level;
40% stated that they are currently at the high school level or had completed it;
10% of the participants indicated that they were studying or had completed a bachelor’s degree;
The rest of the participants (35%) reported being enrolled in a master’s or doctoral program or having obtained a postgraduate degree.
As can be seen, users with different levels of experience participated in the evaluation, and future tests are planned with more diverse profiles, including inexperienced people with different skills. The premise is that, if the results are consistent across these profiles, this would indicate acceptable usability and performance for a wide variety of users.
7. Results and Discussion
Next, we analyze and discuss the results obtained from the tests described in
Section 6. First, we present the results of the prototype evaluation using the SUS and QUIS 7 instruments to measure usability (see
Section 7.1 and
Section 7.2). Then, we show the evaluation of the results obtained by the participants when performing web searches (see
Section 7.3). Finally, we analyze the impact of context on such web search results (see
Section 7.4).
7.1. Prototype Evaluation Using SUS
According to Brooke [
38], the usability scale generates a single number (from 0 to 100) that represents a composite measure of the overall usability of the prototype.
Figure 12 presents the usability score per user. When calculating the average of these scores, a value of 76.87 points is obtained, indicating an acceptable level of usability, but with some areas for improvement.
From this score, it can be concluded that the prototype reached an acceptable value, suggesting that it is easy to learn, easy to remember how to use, consistent in its operation, and provides a generally pleasant experience.
7.2. Prototype Evaluation Using QUIS 7
The four dimensions assessed using the QUIS 7 instrument are (1) the previous experience of the participants, (2) the graphical interfaces of the prototype, (3) the application’s capabilities, and (4) the overall user impression. In the following subsections, we describe the results of the evaluation corresponding to each dimension.
7.2.1. Previous Experience
With the aim of validating the previous experience of users with voice-based virtual assistants, they were consulted about their background of interaction with these technologies. As seen in
Figure 13, the following answers were obtained: two users indicated not having any previous experience with virtual assistants, while four others confirmed having used at least one virtual assistant. Also, ten participants stated that they had interacted with two different types of virtual assistants and, finally, four participants stated that they had used three to four different virtual assistants.
In addition to inquiring about their experience with virtual assistants, participants were asked about the technologies with which they feel familiar and use regularly. The results indicate that 100% of participants reported using touch screen devices, accessing the Internet, and managing emails. Regarding other technologies, 90% of the participants expressed being familiar with the use of color monitors, 85% of them indicated knowing how to use the keyboard, 80% stated that they knew how to use or had used a mouse, 75% stated that they had skills in using a personal computer, and 60% expressed having used voice recognition systems. The sample of users presented, on average, an intermediate educational diversity. This educational profile, together with the fact that 90% of users have experience using at least one virtual assistant and that 100% access the Internet, shows that the perceptions and opinions collected come from people with solid knowledge and experience in this field.
7.2.2. Graphical Interfaces
Eleven questions were asked to assess the usability of the graphical interfaces. The average results of these evaluations are presented in
Figure 14. In all evaluations carried out, the percentage is interpreted on a scale ranging from 0% to 100%. In this context, a score of 100% represents all or the highest level of the quality evaluated, such as totally good, completely sharp, or absolutely clear. On the other hand, 0% indicates the opposite extreme, i.e., the absence or the lowest level of the quality evaluated. In this way, the intermediate scores reflect proportional degrees of quality depending on the established scale.
An average percentage score of 93.3% was obtained from the evaluation of the 11 aspects. Additionally, some users submitted comments regarding the following: (1) include the dark mode version; (2) maintain a uniform font size throughout the prototype, e.g., on the visited web pages; (3) improve content distribution because “sometimes words pile up”; (4) include a tutorial in the prototype; (5) allow the user to change the font size; and (6) allow the use of touch screens.
The evaluation of the graphical interfaces revealed, in general, a good acceptance by the users. Highlights include ease of reading, clarity in the sequence of browsing activities, the usefulness of the screen format, and the ability to predict which screen is next in a sequence. In addition, these results suggest areas for improvement, such as adapting the font size according to the user’s preferences, and customizing the user interface based on the amount of content. In general, user interfaces properly fulfill their function of presenting content in graphic format.
However, it was observed that the participants, being accustomed to interacting with applications that require hardware (such as touch devices) expressed some recommendations aimed at improvements that resemble the conventional navigation model. It is important to note that this work seeks to implement voice navigation to achieve natural interaction, thus differentiating itself from conventional hardware-based interaction modes.
7.2.3. Application’s Capability
To measure the prototype’s capability, three characteristics were evaluated by the users. The average results of these evaluations are presented in
Figure 15. Similar to the previous sections, 100% represents the entirety of the evaluated aspect, e.g., fully reliable. On the contrary, 0% denotes the minimum score, e.g., not at all reliable.
Users perceived the prototype as quite reliable for browsing the web, as they observed that it executed their requests correctly. They also highlighted the ease of use of the prototype, even with minimal instructions for use. These findings suggest that participants can interact with the prototype simply and naturally. However, it is recommended to improve the ease of use, so a more detailed evaluation of this feature is required.
7.2.4. Overall User Impression
The participant’s overall impression was assessed using three questions, and the average of the grades obtained for each one is as follows:
The overall impression of the prototype was:
8.45, where 1 is very bad and 9 is very good;
8.4, where 1 is frustrating and 9 is enjoyable.
The use of the prototype was:
8.55, where 1 is boring and 9 is stimulating;
8.6, where 1 is difficult and 9 is easy.
The interaction with the prototype was:
7.95, where 1 is rigid and 9 is flexible;
8.15, where 1 is inadequate and 9 is adequate.
An average score of 8.35 was obtained, reflecting a mostly positive assessment of the impression the prototype made on users. The best-rated qualities indicate that the prototype is easy to use, stimulating, highly competent in the tasks it can perform, and enjoyable.
7.3. Web Search Results
Participants were asked, after each of the four searches carried out, whether the results obtained coincided with their user profile and requests. The results were as follows:
Searching for an image: as evidenced in
Figure 16a, 60% stated that the results were in agreement, 20% indicated that they did not coincide, while the remaining 20% expressed that they corresponded slightly;
Searching for a video: as shown in
Figure 16b, 41% said the results were consistent, 36% commented that they did not correspond, and the remaining 23% expressed that they slightly coincided;
Searching for a web page: as seen in
Figure 16c, 75% stated that the results were consistent, 5% commented that they did not correspond, while the remaining 20% expressed that they slightly coincided;
Searching for a term: as depicted in
Figure 16d, 70% said the results were consistent, and 30% said they slightly corresponded.
Considering the four searches carried out, 61.5% of users stated that the results coincided with their profiles and requests, 23.25% expressed that the results corresponded slightly, while 15.25% indicated that the results did not fit their profiles and requests. It is concluded that the matching of contexts and requests achieved that 84.75% of users obtained relevant content in their first interactions. However, the opportunity for improvement through the implementation of a dynamic method for the analysis and matching of contexts is recognized. This will enable the expansion of the context, allowing a better understanding of the user’s needs and the characteristics of the web resources. Likewise, the integration of a dedicated search engine that operates considering the user’s context can be contemplated.
7.4. Impact of Context on Web Search Results
This study compares the search results obtained using the prototype with the search results in a popular web browser. The objective is to evaluate how the user’s context and browsing history influence the results obtained. The following methodology was applied:
Definition of search requests: a set of four representative requests covering different types of content, such as images, web pages, videos, and specific terms, was selected;
User context consideration: to assess the impact of context on the results, each query was carried out under a different user profile. The profiles used are defined in
Table 9;
Participants: a total of four participants were part of this evaluation, each corresponding to one of the defined user profiles;
Searching: each request was executed two times:
- –
The former using the prototype, which takes into account the user context to filter and sort the results;
- –
The latter, through Google incognito mode, as a standard search reference without explicit customization.
Users reviewed the first five results from each set and provided a subjective rating. They were asked to rate the match of the results with their profile using a scale of 1 to 3, where 1 indicated “no match”, 2 “slightly match”, and 3 “match”.
Table 10 shows the first five results obtained for the query “search for current music, please”, made by a user with Profile A, and their respective ratings.
Each user made a different query to rate the results. In particular, they requested:
Profile A: “Search for current music, please”;
Profile B: “What is chemistry?”;
Profile C: “I want to see the CINVESTAV website”;
Profile D: “Show me images of the Raiders”.
These requests allowed us to analyze the ability of each tool to respond to different types of searches.
Figure 17 shows the comparison of the average match rate of the five results per request obtained in Google and in our prototype according to the evaluation of users with different profiles. It is observed that our prototype outperforms Google for requests from users with profiles A and D, who searched for videos and images, respectively. In contrast, for queries from profiles B and C, related to a web page and a specific term, both tools obtained similar ratings. These results suggest that our prototype offers competitive results, which are even more suitable according to the user profile.
The findings of this study come from a limited set of queries and user profiles, without applying formal metrics to evaluate the impact of context on web searches, so they constitute a first approximation to the behavior of our solution. It should be noted that this study was performed with an initial prototype. However, testing is currently underway with more queries, profiles, and quantitative and qualitative methodologies to further evaluate the impact of context on the search experience and the effectiveness in retrieving relevant content for the user.
8. Conclusions
This article proposed a context-based model for web browsing through voice, intended for people affected by the inequality caused by the digital divide. Specifically, for people with some degree of illiteracy or digital illiteracy, older adults, and people with visual or motor disabilities. Unlike other solutions, such as virtual assistants and even common search tools, this proposal takes into account the user’s context, the web page’s context, and the search string to obtain the results that best fit the user’s interests.
The methodology used in this proposal is a hybrid, combining the traditional waterfall methodology with an incremental development approach. Since our proposal seeks to facilitate web browsing, we relied on the analysis of related works and the ISO IEEE 29148-2018 and ISO 9241-11 standards to identify the functions and requirements of virtual assistants and web browsing tools. As a result, six main functions, ten functional requirements, and three usability requirements were identified.
To validate the proposed model, a prototype was developed in the form of an application intended for a mobile device (tablet) with Android operating system. The prototype generally met most of the requirements positively, fully offering five of the six basic browsing functionalities proposed. The only functionality that was not fully fulfilled was the ability to navigate within a web page using voice commands, as the prototype cannot interact with some objects that generate events, such as checkboxes or embedded media. However, it was possible to implement browsing between sections of a web page and vertical scrolling on the page.
The prototype was tested by users with different educational levels and experience in using virtual assistants in order to obtain a wider range of answers. The evaluation was carried out using two instruments to measure the level of user satisfaction with respect to the results obtained in a search and the comfort perceived by users when using the prototype. In both cases, the results were quite positive. Also, a comparison of our prototype with the most popular web browser was carried out to analyze the impact of context on web search results. These obtained results suggest that our prototype offers competitive results, and even more suitable according to the user profile.
Also, the tests carried out revealed that users maintain a strong connection with conventional browsing methods that involve the use of hardware, such as computer mouse, keyboard, and touchscreen, to interact with devices. This bias was clearly reflected in their comments and suggestions during the prototype evaluation. Despite this deep-rooted preference, it is noteworthy that our prototype was well received by users, as suggested by the graphs of usability and satisfaction results.
It is important to mention that our proposal presents significant differences compared to tools that use voice interfaces, such as virtual assistants. While virtual assistants focus on providing a wide range of services and functions, from performing everyday tasks, such as setting reminders and sending messages, to interacting with smart devices in the home using preset commands, the proposal described in this article has a different approach. It seeks to provide content of genuine interest to the user, in audible and graphic format, by actively analyzing web pages for matching with the user’s context and request. Additionally, the proposal includes the ability to browse results and web pages using flexible voice commands. This feature facilitates accessibility for those who are not familiar with current information technologies, which contributes to the reduction in the digital divide. The first prototype validated the requirements and functionalities for voice navigation, laying the groundwork for future prototypes that aspire to make a significant contribution to bridging the digital divide.