3. Text-Conditioned Tactile Graphics Generative Model
The transformation of semantic text descriptions into tactile representations builds upon theories of latent space partitioning and generative modeling. Specifically, methods such as BART and VQ-VAE have proven effective for mapping high-dimensional data into interpretable outputs suitable for tactile exploration. The subject of its modeling is the process of converting text information into graphic information. To accomplish this, the embedded space of the transformer, which was formed during language modeling on pretraining task, was divided into two independent embedded spaces—text and graphics—instead of a shared one.
In this study, the developed model was trained using the parameters presented in
Table 1 and
Table 2 for the VQ-VAE and BART models, respectively. At the same time, the parameters of the BART’s text-embedded space remained the same as during language modeling.
Simultaneously, the parameters of the BART’s graphic embedded space were adjusted so that the dimension of the embedded space was equal to the size of the VQ-VAE’s “codebook” and the dimensionality of the vectors of the graphic embedded space was equal to the dimensionality of the latent space 2d vectors of the VQ-VAE model calculated with the following equation:
where
is a latent 2d vector of the VQ-VAE;
represent the width and height of the image, respectively; and
is the number of hidden layers.
Therefore, the latent vector of the VQ-VAE has dimensions of 8 × 8, while the BART decoder’s sequence size should be equal to this.
Also, it is important to note that the size of the decoder’s dictionary and the size of the sequence are each increased by one unit compared to the original values. This adjustment is necessary to introduce an additional service token (i.e., BOS token), which is added at the beginning of the sequence to facilitate autoregressive image generation.
Before generating text tokens using the BPE [
20] tokenization model, the original text components are normalized to achieve a uniform format, including converting all uppercase letters to lowercase.
Formally, the process of converting text tokens into graphic tokens using a text-conditional tactile graphics generation model is described in successive stages. The structural and functional diagram that depicts these stages is shown in
Figure 1.
The first step is to generate a bounded sequence of text tokens based on a text prompt: , where is a sequence of text tokens of dimensions , and is a dictionary of text tokens. If the size of the generated sequence of text tokens exceeds the value, its size is reduced to the maximum value, discarding the excess tokens. If the size of the generated sequence of text tokens is smaller than the value, its size is increased to the maximum value by adding service 〈PAD〉 tokens that do not affect the simulation result.
In the next step, the text tokens that form the sequence
are mapped to the text-embedded space vectors
, forming a subset of it:
where
is the sequence of text tokens;
is a text-embedded space;
are elements of the text-embedded space. Elements
reflect the semantic meaning of text tokens in the embedded space.
Next, the vectors of the text-embedded space
are transformed by the transformer’s bidirectional encoder, which is formed from several layers, forming hidden states
. The bidirectionality of the encoder means that it analyzes the full context of an individual vector of the embedded space, considering both the previous and the following elements of the sequence:
where
is the hidden state of the encoder and
is the transformer’s encoding operation, defined within [
17].
The hidden state of the encoder
is then converted by linear layers and a nonlinear activation function to the hidden state of the decoder (i.e., graphic information), forming a subset of the graphic embedded space
:
where
is the hidden state of the encoder;
is the hidden state of the decoder;
is a linear layer; and
is a non-linear layer activation function.
At the next stage, an autoregressive [
18] Transformer’s decoder is used. This means that the decoder generates one graphics token per iteration, considering the context of the previously generated graphics tokens. Thus, during the decoding process, the model performs calculations based on the hidden state
and pre-generated elements of the vector sequence of the graphic embedded space
or
:
where
is the
-th element of the vector sequence of the graphic embedded space
,
are previously generated vectors of the graphic embedded space;
is the hidden state of the decoder;
is the size of the final sequence
; and
is the transformer’s decoding operation, defined within [
17].
Decoding occurs in an iterative manner until the sequence
size is equal to
(i.e., the size of the latent space vector of the VQ-VAE model). Once the decoding is complete, the resulting sequence of vectors of the graphics-embedded space
is converted by a linear layer and
function into a sequence of probability distributions from which the element with the highest probability is selected, determining the selected graphics token:
where
is the generated sequence of graphic tokens of size
and
is an element of the vector sequence of the graphics-embedded space
.
In the next step, on the basis of graphic tokens (6), a sequence of latent quantized vectors is formed,
, which is defined by the Formula (7). Each graphic token:
, is the positional number of the quantized vector in the “codebook” of the VQ-VAE model:
where
is the set of latent quantized vectors, or “codebook”;
is a sequence of latent quantized vectors;
is a graphics token; and
is the size of the sequence of latent quantized vectors.
The final step is the generation of tactile graphics using a sequence (7) with a VQ-VAE decoder:
where
is the sequence of latent quantized vectors;
is an image-decoding operation based on latent representation defined within [
19]; and
is a generated tactile image.
5. Results
5.1. Experiment
This section presents the results of the experiment through examples of tactile images generated by the developed model. The generation process was performed using the “greedy search” method, which identifies the shortest path in a directed graph representation of the generation process to synthesize the corresponding image. These examples demonstrate the effectiveness of the variable synthesis capability of the complex tactile graphics generation model, showcasing its ability to produce tactile images based on text prompts of varying descriptions. This text-prompt-based generation approach significantly enhanced the controllability of the process, making it both convenient and accessible, even for individuals without expertise in tactile graphics production.
The results of the experiment include samples of generated tactile graphics images based on various types of text prompts, such as monosyllabic prompts, prompts with numerals, and prompts with epithets. These samples are presented in
Figure 2.
5.2. Comparison with Baseline
Table 4 presents a comparative evaluation of the developed text-conditioned tactile graphics generative model against other state-of-the-art models, including MidJourney, Stable Diffusion, DALL-E, and DALL-E 2. Each model was assessed based on its text interpretation and image generation mechanisms, as well as the resulting output quality and suitability for tactile graphics.
During the experiment, a consistent text prompt, “coloring page for kids, black outline, white background, tactile graphics, a tree, cartoon style, very low detail, no shading”, was used for MidJourney V6.1, Stable Diffusion 3.5, DALL-E, and DALL-E 2. For our model, a simplified prompt, “a tree”, was used, as it relies on direct semantic mapping from text to tactile graphics. The following generation parameters were applied: temperature = 0.75 and top-k = 3.
Our proposed model, which integrates a VQ-VAE with a BART Transformer, demonstrated clear advantages in generating simplified and accessible tactile graphics. Unlike the outputs from other models, which prioritize photorealism or complex visual textures, our approach produces clean, interpretable shapes optimized for tactile exploration. Notably, models like MidJourney and Stable Diffusion rely on CLIP-based image generation, which, while effective for general visual synthesis, lacks the precision needed for tactile requirements.
The outputs generated by DALL-E and DALL-E 2, while incorporating VQ-VAE and diffusion techniques, often include extraneous details that complicate tactile interpretation. In contrast, our method prioritizes simplicity and clarity, ensuring that the generated outputs align with accessibility standards for visually impaired users.
5.3. Expert Assessment
The developed software was tested at two enterprises operating in different fields: the Levenya Educational and Rehabilitation Center (ERC “Levenya”) and the Publishing House of Lviv Polytechnic. During the tests, a series of text prompts describing the desired graphical output were provided as input to the program. Upon completion, expert evaluations were conducted by specialists at the respective organizations to assess the quality of the synthesized tactile graphics. The experimental results demonstrated that the synthesized images accurately matched the desired content and exhibited appropriate quality. Based on expert assessments, it was concluded that the software could be recommended for producing tactile illustrations for educational materials and other publications aimed at people with visual impairments.
The process of producing tactile images synthesized by the program on specialized heat-sensitive capsule paper was also evaluated separately. A “PIAF” tactile graphics printer, commonly used by individuals with visual impairments in educational, professional, and personal contexts, was employed to reproduce the synthesized images. According to the printing results (presented in
Figure 3), specialists at the Levenya Educational and Rehabilitation Center determined that the synthesized tactile graphics met the required qualitative and technical standards, making the software suitable for further use in educational materials and similar publications for people with visual impairments.
During testing, experts also noted the significantly faster production speed of tactile graphics when using the developed software compared to traditional manual methods. Moreover, the text-prompt-driven synthesis process was praised for its user-friendly approach, as it does not require operators to possess advanced expertise in tactile graphics production. This feature enables the creation of images with diverse content while maintaining ease of use.
Furthermore, experts highlighted the software’s advantages over other automated systems for synthesizing tactile graphics. Unlike systems that rely on additional input data (e.g., programs that convert photos into tactile images), the developed software requires only a textual description, significantly simplifying the preparation of materials for production. This convenience, combined with its high efficiency, positions the software as a valuable tool for creating tactile graphics in various domains.
5.4. Quantitative Metrics
The evaluation results are shown in
Table 5. The CLIP score value, according to Formula (14), for the obtained model was equal to 23.7. At first glance, the result may seem bad; however, an explanation was found for this. The CLIP model used to calculate the CLIP Score was trained on a diverse range of image samples, including realistic and graphically complex images, as it was designed for general-purpose applications. However, it has been found that the CLIP model—at least in its publicly available versions—is not well-suited for calculating CLIP scores on simple, graphic images such as tactile graphics.
6. Discussion
The current training dataset consists of relatively simple images, such as animals, plants, and basic objects. However, one limitation of the model lies in its potential difficulty in scaling to more complex images, such as those with intricate details (e.g., architectural blueprints or detailed scientific diagrams). The model’s ability to capture fine-grained details is constrained by the size of its latent space and the number of hidden layers in the VQ-VAE architecture. Generating complex tactile graphics may require a more detailed representation, which could introduce inefficiencies or inaccuracies without modifications to the model architecture.
Furthermore, while the model performs well with simpler prompts (e.g., “a cat” or “a tree”), it faces challenges when handling more complex and nuanced prompts (e.g., “a group of children playing soccer with a spotted ball”). As the semantic complexity and length of the prompt increase, the transformer’s encoding of textual information becomes more demanding. This can result in difficulties disentangling and representing all elements of a complex scene in tactile graphics form, potentially leading to information loss or oversimplification.
In terms of computational requirements, training the proposed model—which combines the BART Transformer and the VQ-VAE—requires substantial resources. The autoregressive nature of the model and the dual processing of textual and graphical latent spaces make training computationally expensive. This process demands powerful GPUs or TPUs, significant memory capacity, and extended training times, particularly as the dataset size increases. Scaling the model to handle larger datasets or higher-dimensional image outputs poses additional challenges without access to advanced computing infrastructure.
The hyperparameters listed in
Table 3 were chosen based on a combination of empirical tuning and best practices from prior studies in related fields. The batch size, in particular, was set to 2 due to the high dimensionality of the inputs and the computational constraints of the hardware used during training. While larger batch sizes are often preferred for their ability to improve gradient estimation stability, our experiments showed that the chosen configuration achieved acceptable performance without overloading available resources.
We acknowledge the potential of more advanced hyperparameter optimization techniques, such as Bayesian optimization, to explore and identify more effective configurations. These methods could enable the model to converge faster or achieve improved performance by systematically searching a broader parameter space. While the current study focused on manual tuning to balance resources and performance, integrating automated optimization tools in future work represents a promising direction for further improving the model’s effectiveness.
Nevertheless, while the training process of the proposed model is computationally intensive and time-consuming, due to the high dimensionality of the inputs and the complexity of the machine learning process, the inference phase remains efficient. This is achieved through the compact latent space of the VQ-VAE (8 × 8 dimension), which minimizes the computational load during decoding. Additionally, the autoregressive BART decoder generates graphical tokens iteratively, but the manageable size of the latent space (64 elements) ensures that the process remains lightweight and fast. The model’s focus on producing simplified and clear tactile graphics, rather than photorealistic images, further optimizes inference time, making it practical for real-world applications on resource-constrained devices.
Ethical considerations are paramount in the development of tactile graphics to ensure that the generated images accurately represent the intended information. For visually impaired users, tactile graphics serve as a primary means of interpreting visual content. Any distortion or inaccuracy in the generated graphics could lead to misunderstandings. For instance, oversimplification or omission of critical details in a tactile graphic could provide an incomplete or misleading representation. To address this concern, it is essential to rigorously validate model outputs against established standards for tactile graphics. Additionally, soliciting feedback from visually impaired users is crucial to ensure that the tactile representations are both accurate and comprehensible.
Currently, effective methods for assessing the quality of tactile graphics include expert evaluations, focus groups, and user testing. However, we believe it is crucial to explore the development of an automated tool for assessing the quality of tactile graphics. Such a tool could, for instance, function as a discriminator within a GAN architecture, providing an objective and scalable approach to evaluation. The development of such a tool is a challenging task, as it must address both the technical specifications and the quality standards, which imply that tactile graphics should be compatible with the following properties:
Legible: The convexity of forms, protrusions of points, signs, lines, and textures defining highlighted surfaces must be easily recognizable by touch;
Understood: The reader must clearly grasp the idea conveyed by the author through the illustration;
Substantial: The illustration must correspond to and enhance the text it accompanies;
Attractive: The illustration should feel pleasant to the touch and evoke interest. Such images encourage visually impaired readers to engage with them, even if it requires effort;
Continuous: The reader should seamlessly follow the graphic element without losing their place or searching for the next tactile component;
Useful: Illustrations must convey meaningful information to the user. Decorative elements that merely enhance aesthetic value should be avoided, as they can interfere with readability.
7. Conclusions
Our proposed method introduces a novel approach to generating tactile graphics directly from textual descriptions, leveraging a Bidirectional and Auto-Regressive Transformer (BART) and a Vector Quantized Variational Auto-Encoder (VQ-VAE). By redefining the latent space architecture to separate textual and graphical embeddings, the model ensures precise semantic mapping from text to tactile graphics. This innovation eliminates the reliance on visual input, distinguishing our work from existing methods like Pic2Tac, which depend on photographs or predefined tactile libraries.
Additionally, our adaptation of VQ-VAE optimizes the generated outputs for tactile accessibility by prioritizing legibility and simplicity over photorealistic detail. The end-to-end text-to-tactile pipeline significantly reduces production time and expands accessibility use cases, enabling the automated generation of educational materials for visually impaired users. These contributions establish our approach as a transformative solution in tactile graphics generation, addressing key limitations of prior methods while demonstrating real-world applicability.
This technology has the potential to bridge the accessibility gap in educational materials, enabling visually impaired individuals to better engage with subjects that rely heavily on visual content, such as science, mathematics, and geography. By providing automated tactile graphics, the technology promotes greater independence in learning and supports participation in inclusive classrooms and professional settings.
A key direction for future research involves expanding the size and diversity of the training dataset. This will enhance the model’s generalization capabilities and ensure its stable performance across a broader range of scenarios.