1. Introduction
Still images have long served as a primary medium for communicating visual ideas and emotions, from the earliest cave paintings to contemporary digital artwork. Even without actual motion, certain images possess a powerful sense of “movement” or dynamism—an impression conveyed purely through compositional elements such as line, color, shape, and light, which we will call compositional dynamism and study in more detail. While art historians have historically debated the roots and mechanisms behind this sensation of movement, the recent emergence of generative text-to-image models introduces a new perspective as follows: machines that can produce novel, often highly expressive visuals based on text prompts. As these models become increasingly prevalent in both commercial and creative spheres, it is essential to understand whether and how they encode aesthetic constructs that humans have developed over centuries.
From a visual semiotic and art historical perspective, whether static images can truly convey motion has always been a question of interest. Traditionally, painting and sculpture have been cast as “arts of space”, seemingly unable to capture successive events central to “arts of time” like poetry or music [
1,
2]. However, scholars such as Groupe
and Petitot argue that still images
simulate temporality by relying on compositional strategies—diagonals, overlapping forms, or blurred outlines—that imply movement [
3,
4,
5]. These insights underscore the tension between the material fixity of a two-dimensional plane and the viewer’s perception of energy or duration, suggesting that even static artworks can gesture toward narrative or process-based dynamism.
One of the foundational theories of visual style and composition is derived from Heinrich Wölfflin’s
Principles of Art History [
6,
7], which outlines five pairs of opposing categories to distinguish the so-called
classical mode of depiction from the
Baroque. Wölfflin’s Baroque category, in particular, is often associated with diagonal lines, dramatic lighting, and intertwined shapes that create a heightened sense of movement or energy. His framework has proven to be enduring in art historical discussions, influencing how scholars interpret painting, sculpture, and architecture. However, Wölfflin’s categories originated in early twentieth-century Europe, well before the advent of digital art and artificial intelligence, prompting a debate about the universality and adaptability of his insights when facing new media and new cultural contexts.
Generative text-to-image models, such as those that leverage diffusion or generative adversarial network architectures, draw from large-scale image databases that span multiple epochs, styles, and cultural sources. This situation raises the following key questions: To what extent do these AI models internalize compositional strategies that we might label as “classical” or “baroque”? Can they replicate the sense of dynamism that art historians identify in a Bernini sculpture or a Caravaggio painting? And do they inadvertently introduce new forms of dynamism beyond what Wölfflin could have foreseen? While early demonstrations of generative art have been both celebrated and critiqued, systematic research on how these models portray concepts like movement and compositional tension remains limited.
Contributions. This study, therefore, aims to bridge Classical art historical theory with contemporary computational tools. First, we propose an operational definition of “dynamism” in still images, rooted in Wölfflin’s five oppositions. Next, we explore how text-to-image models respond to prompts invoking either “classical” or “baroque” motifs through expert evaluations performed blindly and then holistically. Finally, we examine how well these generative systems capture Wölfflin’s nuanced compositional differences when faced with varying levels of stylistic specificity in the prompt (including the absence of style labels or the introduction of certain genres). By combining historical, theoretical, semiotic, and computational approaches, we hope to offer fresh insights into the longevity of Wölfflin’s principles, as well as to shed light on the evolving landscape of AI-driven art.
4. Results and Discussion
We present and discuss the results of the five experiments described in
Section 3. Then, we report the outcome of the general discussion between the experts about the generated images, as described in Section
Section 3.3.
4.1. Analysis of Experiment 1 Results
Figure 3 and
Figure 4 show the images produced for this experiment, and
Table 1 shows the average Likert scores from our two expert evaluators (E1 and E2) for these images, as detailed in
Section 3.2.1. Each score ranges from 1 (strongly Classical) to 5 (strongly Baroque) and reflects a general feeling and the five Wölfflin dimensions as follows: (1) General Feeling; (2) Linear vs. Painterly; (3) Planar vs. Recessional; (4) Closed vs. Open Form; (5) Multiplicity vs. Unity; and (6) Clearness vs. Unclearness.
A first observation is that E1 perceives a marked contrast between Baroque and Classical prompts for both models, with DALL•E Baroque consistently scoring above 3.3 on all dimensions (notably 4.4 for clearness), with DALL•E Classical remaining below 2.7 (notably 1.7 for linear vs. painterly). A similar gap appears in Midjourney for E1, where Baroque prompts are above 4.0 in 4 out of 6 cases, indicating a strong sense of dynamism or drama, and Classical images tend to hover near 2.0. This suggests that, from E1’s perspective, both models can generate images that align with a Baroque or Classical aesthetic when prompted, with a relatively clear distinction between the two.
E2’s ratings paint a more complex picture. For DALL•E, E2 likewise distinguishes Baroque (3.0–3.6) from Classical (2.0–2.3), though the numerical gap is narrower than with E1. Meanwhile, Midjourney prompts show a surprising overlap. Classical outputs are often rated as high as 3.2–3.8, on par with or exceeding Baroque prompts. This outcome indicates greater ambiguity in how E2 interprets the compositional cues of Midjourney images, implying that the model’s “Classical” or “Baroque” results may incorporate traits (such as high contrast or merged forms) that this expert finds harder to categorize.
Comparing the Neutral condition across both experts reveals additional insights. For E1, DALL•E’s neutral images generally cluster around 2.3, barely differing from the Classical ratings, whereas Midjourney’s neutral outputs revolve around 3.2, with ratings always comprising the Classical and Baroque ratings, indicating a clear distinction between the three regimes. E2 sees DALL•E’s neutral images as even more Classical (scores as low as 1.4 for Planar vs. Recessional) than the Classical images, yet finds Midjourney’s neutral prompts frequently match or exceed Baroque-level ratings (e.g., around 3.8 in general feeling). These findings suggest that Midjourney’s default visual style, at least as perceived by E2, tends to be significantly more dramatic or dynamic than DALL•E’s, potentially overshadowing the explicit style instructions of “baroque” or “classical”.
In summary, both experts perceive DALL•E’s “classical” and “baroque” conditions as relatively well separated, reinforcing the notion that DALL•E applies different compositional cues for these two styles. At the same time, for DALL•E, both E1 and E2 note that “classical” and “neutral” prompts score quite similarly, suggesting that DALL•E’s default (neutral) output does not diverge much from a Classical baseline. In contrast, Midjourney presents a more varied picture. E1 sees distinct groupings among the Classical, Baroque, and neutral categories, indicating three recognizable stylistic modes. However, E2’s ratings reveal extensive overlap across these prompts, often assigning higher (i.e., more Baroque-like) scores overall. This discrepancy implies that Midjourney’s underlying aesthetic biases, combined with E2’s interpretative frame, blur the lines between Classical and Baroque or even neutral. As such, Midjourney images can appear significantly more dramatic or “baroque” by default, but the distinction from intentionally prompted Baroque outputs is less pronounced for E2. Taken together, these observations underscore the interplay between a model’s default stylistic tendencies and an evaluator’s experience, ultimately shaping how faithfully generative outputs reflect canonical notions of Classical and Baroque compositions.
4.2. Analysis of Experiment 2 Results
Figure 5 and
Figure 6 show the images produced for this experiment, and
Table 2 reports the mean ratings from our two expert evaluators (E1 and E2) for these images, with the same conventions as previously.
Focusing on E1’s results, we observe that the difference between DALL•E’s Baroque and Classical portrait prompts is far less pronounced than in Experiment 1. In most Wölfflin categories, their scores cluster around 2.0–3.0, with the neutral condition again aligning closely with Classical rather than standing apart on its own. This indicates that once the prompt specifies a portrait, DALL•E no longer clearly separates Baroque from Classical as it did before. For Midjourney, E1’s ratings remain uniformly low across all three style conditions, with only one dimension (Clearness vs. Unclearness for Baroque) slightly exceeding 3.0. While Baroque scores tend to be marginally higher than Classical in most categories, the general feeling dimension is actually lower for Baroque than Classical, and neutral emerges with the highest overall impression. These unexpected outcomes highlight how constraining the model to a portrait subject may dilute or even invert the stylistic cues observed in Experiment 1, blurring Baroque and Classical distinctions for E1 in Midjourney’s outputs.
E2’s evaluations reinforce the tendency for DALL•E’s Baroque portraits to score marginally higher on Baroque cues than its Classical or neutral counterparts, but the numerical gaps are small. All scores but two lie within the 2.6–3.2 range, implying that DALL•E’s portrait-based outputs do not strongly diverge across Classical or Baroque conditions for this evaluator. Midjourney’s results for E2 reveal more striking anomalies as follows: “Classical”, with all scores above 3.0, is always rated higher than “Baroque”, suggesting that the “classical image” prompt may introduce dramatic forms that E2 finds quite Baroque in spirit. Moreover, the “Neutral” prompt occasionally yields equally elevated scores (3.8 in general feeling), underscoring how Midjourney’s aesthetic biases—and this evaluator’s interpretive framework—can blur the lines between what is nominally Classical, Baroque, or neither. The score difference between DALL•E and Midjourney is also reduced for E2 in this experiment, adding to the overall uncertainty pertaining to this experiment.
Taken together, these observations suggest that imposing the portrait genre on prompts further diminishes the style contrasts in certain cases, particularly for Midjourney. Although DALL•E continues to show a moderate Baroque–Classical gap for E1, both experts find that Classical and neutral conditions are often interchangeable for DALL•E, and E2 sees only subtle differences overall. Meanwhile, Midjourney exhibits strong variability as follows: E1 perceives a relatively even “flattening” of style across all three prompts, whereas E2 frequently considers Midjourney’s Classical and neutral outputs more dramatic than the Baroque ones. The net effect is that, once constrained by a portrait subject, neither model consistently translates “baroque” or “classical” into clearly distinct compositional treatments, reinforcing the idea that recognizable portrait conventions may overshadow the intended historical style cues.
In order to generalize these observations beyond the sole “portrait” genre, we conducted the same experiment with “landscapes” and “still lifes”, which are two other main topics in art history. The images are provided in
Appendix A. They were not rated by our experts because, by quickly looking at them, they concluded that the same observations as for portraits applied; for both DALL•E and Midjourney, landscapes and still lifes are centered, classically composed with regular closed forms contained within the frame, excepted for the lighting which is often Baroque-like. We therefore focused our efforts on the next set of experiments.
4.3. Analysis of Experiment 3 Results
Figure 7 and
Figure 8 show the images produced for this experiment, and
Table 3 reports the mean ratings from our two expert evaluators (E1 and E2) for these images, with the same conventions as previously.
Focusing first on E1, DALL•E Baroque tends to exceed DALL•E Classical by roughly one to two points in all categories, clearly separating the two styles. This pattern is also observed with Midjourney, whose Baroque ratings approach or exceed 3.0 in all columns, implying that Midjourney amplifies the compositional cues E1 associates with Baroque. Both models yield relatively low or moderate values under the “classical” label, with most dimensions falling below 2.5 for E1.
E2 draws attention to a different elements. DALL•E Baroque (3.6–3.9 across most columns) remains slightly higher than DALL•E Classical (2.3–3.5), but the gap narrows in dimensions like Planar vs. Recessional or Closed vs. Open Form. Meanwhile, Midjourney Baroque still draws the highest scores (3.7–4.0), highlighting its robust Baroque impression. Curiously, though, “MidJ Classical” reaches 3.0 or more in several columns (Linear vs. Painterly, Clearness vs. Unclearness), indicating that E2 finds elements of strong contrast or merged forms even under the “classical-oriented” label.
Overall, the data suggest that both experts consistently rate “baroque-oriented” prompts higher than “classical-oriented” prompts in each model, with Midjourney often producing especially dramatic outputs that rely more strongly on Baroque traits. However, E2’s ratings reveal smaller Baroque–Classical gaps in DALL•E and relatively elevated scores for “Midjourney Classical”, hinting at underlying differences in how each expert interprets compositional cues. These observations reinforce previous findings that while generative models do respond to Baroque vs. Classical prompts, the resulting stylistic separation remains imperfect and often depends on the evaluator’s experience. They also show that, despite the more "abstract" nature of the images generated, Wölfflin’s characteristics when explicitly prompted can still be transposed by the generative models in a distinctive way, albeit they might be less relevant in particular cases.
4.4. Analysis of Experiment 4 Results
Figure 9 and
Figure 10 show the images produced for this experiment, and
Table 4 reports the mean ratings from our two expert evaluators (E1 and E2) for these images, with the same conventions as previously.
For E1, DALL•E shows a noticeable gap between dynamic (2.7–3.4 ratings) and static (1.7–2.4), implying that “dynamic” produces images perceived as moderately more Baroque in Wölfflin dimensions, except for the clearness. However, Midjourney’s dynamic and static prompts do not differ significantly for E1, with nearly identical ratings in all categories. Importantly, this suggests that Midjourney’s outputs remain relatively similar regardless of whether the prompt requests a dynamic or static painting. To some extent, this aligns with the observations of D’Armenio et al. [
37], who note that Midjourney consistently aims to beautify its images based on an internalized and stereotyped notion of what constitutes a painting with an aestheticizing ambition.
E2 perceives an even stronger distinction between dynamic and static in DALL•E, assigning scores above 3.0 for most categories of DALL•E dynamic, whereas DALL•E static hovers largely below 2.5, excepted again for clearness. This pattern indicates that the “dynamic” instruction leads DALL•E to generate paintings that E2 interprets as distinctly more Baroque. By contrast, Midjourney again appears less sensitive to these instructions for E2, with both dynamic and static prompts rated at or near 4.0 on several dimensions, indicating a consistently high Baroque-like feel, no matter whether the prompt specifies motion or stillness.
A notable point for both experts is that DALL•E’s dynamic prompts yield comparatively low clearness scores. Meanwhile, Midjourney’s clearness remains nearly the same for dynamic and static in both experts’ ratings, suggesting that Midjourney’s “dynamic” approach does not necessarily sacrifice clearness. Overall, these observations imply that DALL•E responds to “dynamic” vs. “static” with a stronger compositional shift than Midjourney does. E2, in particular, sees DALL•E dynamic as distinctly more Baroque-leaning than DALL•E static, whereas Midjourney’s outputs remain highly Baroque-like in both conditions, limiting the impact of these prompts.
Importantly, the experts also noted that most images were mostly “abstract”, and found it challenging to apply Wölfflin categories in this setting. Notably, they reported that this demanded a strong focus because it is unusual to rate the Baroqueness of abstract images, although these benefit as well from this analysis. They also underlined the idea that Wölfflin categories, while relevant for Classical vs. Baroque art and a large portion of art history, require a costly mental effort to be transposed to more recent (especially abstract) compositions. This observation does not only apply to images generated by AI models but also to several artistic movements of the last century. They suggested that novel categories might need to be proposed to refine the task of analyzing paintings or images in modern contexts. This need is reinforced by experiments such as the one presented above, as generative AI models are easily able to produce images that defy existing categories, while drawing inspiration from existing human creations. The possibility of generating easily many of them helps realizing this important yet possibly overlooked feat.
In addition, when examining
Table 1 alongside
Table 4, the magnitude and consistency of stylistic shifts differ across models and experts. In Experiment 1,
Baroque vs.
Classical prompts can yield relatively large gaps (e.g., DALL•E for E1 moves from 1.8 to 4.1 in General Feeling), whereas in Experiment 4,
dynamic vs.
static generally show milder differences for the same expert–model pairing. For instance, DALL•E with E1 ranges only from 1.9 to 2.7 under static and dynamic prompts, a narrower spread than between Classical and Baroque. However, for E2, the situation can be the reverse; sometimes “dynamic vs. static” produces as large or larger gaps than “baroque vs. classical”. Midjourney also exhibits peculiar behavior; in Experiment 4 it may conflate Baroque and Classical, yet in Experiment 1 some conditions appear strongly “baroque” or "classical" by default. Overall, the pattern suggests that specifying “baroque” or “classical” often triggers greater compositional separation for certain model–expert pairs, but “dynamic vs. static” can at times equal or exceed that effect, depending on each evaluator’s interpretive framework and each model’s baseline generative style.
4.5. Analysis of Experiment 5 Results
Figure 11 and
Figure 12 show the images produced for this experiment, and
Table 5 reports the mean ratings from our two expert evaluators (E1 and E2) for these images, with the same conventions as previously.
Focusing on E1, DALL•E’s static portrait lands mostly in the 1.6–2.7 range, suggesting a Classical feel, while its dynamic portrait rises to 2.4–3.5 (especially for Closed vs. Open Form and Clearness vs. Unclearness), indicating a modest but consistent shift toward more Baroque-like traits. Midjourney exhibits smaller gaps between static and dynamic prompts, as in the previous experiment; for instance, the static portrait hovers around 2.1–2.9, whereas dynamic remains just slightly higher in General Feeling (3.1 vs. 2.4) but otherwise shows minimal differences in dimensions. This indicates that Midjourney again does make a significant difference between the prompts, even when the portrait genre is specified.
E2 sees a slightly sharper distinction for DALL•E, rating its static portrait below 2.0 in most columns (strongly Classical) but pushing the dynamic version nearer 3.0 for several dimensions. By contrast, Midjourney’s static and dynamic prompts yield similarly elevated scores (roughly 3.1–4.0), implying a generally “baroque” bent for both conditions—even “static” is perceived as relatively dramatic or merged. This high baseline for Midjourney’s portraits dilutes any compositional shift from “static” to “dynamic”, echoing prior observations that Midjourney’s default aesthetic often appears dynamic or high-contrast regardless of the prompt.
Overall, specifying “dynamic” versus “static” in a portrait context drives a clearer stylistic separation in DALL•E, whereas Midjourney remains fairly consistent—and fairly Baroque—across both prompts, particularly according to E2’s evaluations.
A direct comparison of
Table 2 and
Table 5 reveals that the relative impact of “dynamic vs. static” versus “classical vs. baroque” depends markedly on both the model and the expert, as already observed in the comparison of Experiments 1 and 4. For DALL•E, E2 observes a larger stylistic gap in Experiment 5 (dynamic–static) than in Experiment 2 (Classical–Baroque), while E1 perceives similar-sized distinctions in both experiments. In Midjourney, E1 sees a more pronounced difference for dynamic–static prompts than for Classical–Baroque, but E2 experiences the opposite, finding only minimal variation between Midjourney’s dynamic and static prompts, yet a stronger (albeit partially reversed) gap for Classical–Baroque. These contrasting outcomes suggest that neither static/dynamic nor Classical/Baroque categorization consistently dominates; rather, each model’s inherent style biases, combined with individual evaluators’ interpretive frames, shape how effectively a given prompt drives compositional change in portrait scenarios.
4.6. Expert Discussions After Holistic Viewing
Once the experts had completed their blind, randomized ratings for each experiment, we invited them to examine all images from a given prompt condition
en bloc, with full knowledge of which prompt generated which image, as described in Section
Section 3.3. This open review prompted a series of observations and reflections, which are summarized below.
Genre constraints vs. Baroque/Classical cues. When a specific genre (portrait, still life, or landscape) is specified, many outputs lean towards Classical, irrespective of nominally “open” framing. In Midjourney, for instance, figures or objects remain rigidly centered, impeding a genuinely open-form impression. Specifying the genre also imposes a characteristic compactness—a constant figure–ground relationship typical of portraiture, but transposed to other genres. Even when the prompt states “baroque”, only one Wölfflin trait (Unclearness) consistently appears; faces or vases in the center retain closed lines and shapes. By contrast, if we remove explicit mention of the genre and simply say “baroque”, the outputs become visibly more “baroque”, often with interlaced bodies—perhaps a stereotypical association for the AI. Conversely, if we omit the term “baroque” yet retain the reworked Wölfflin descriptors, the images skew toward abstract compositions. This suggests the model ties “baroque” strongly to figurative entanglements (as in historical seventeenth-century Baroque).
Thematic vs. compositional tensions. Multiple prompts show an intriguing divergence between theme and layout as follows: Baroque “themes” (swirling figures and rich ornamentation) might still adhere to a strictly Classical structure (e.g., a perfectly centered circle). Sometimes, images deemed “classical” in a broader sense earn Baroque-like ratings on clearness, perhaps due to dramatic lighting or high contrast. When we prompt “classical” without using that word explicitly, certain architectural references reminiscent of Brunelleschi emerge, capturing geometric Renaissance perspectives. However, labeling a prompt “Renaissance painting” (not shown but performed in other side experiments) often triggers feminine portraits that do not obviously match canonical Renaissance aesthetics (excepted maybe Botticelli’s); Midjourney’s “classical paintings” also privilege idealized female beauty. Thus, AI seems to map Renaissance or Classical to specific themes rather than purely compositional elements.
Viewing all images together. After reviewing an entire batch of images side by side, the experts noted that their initial blind impressions occasionally shifted. For instance, aspects like Clearness might have been rated higher (e.g., near 4) had they seen the collective corpus from the start. In “A painting of a portrait” prompts, Midjourney often supplies female subjects regardless of “classical”, “neutral”, or “baroque”. If the label says “baroque”, the costuming or decorative details may indeed appear very Baroque (e.g., with typical and historically important ruffs or millstone collars), yet the composition stays rigidly centered—Classical in structure. Meanwhile, “classical” prompts display simpler, timeless clothing, and these subtleties are more evident when the images are grouped rather than intermixed randomly.
Long descriptions and dynamic/static prompts. When we used extended textual descriptions for the Classical style, DALL•E frequently produced coherent Renaissance-like architecture, channeling Brunelleschi’s experimentation and “windows onto the landscape”. This is interesting because the window was a central device in theoretical reflections on painting during the Renaissance; it was the instrument that enabled the development of perspective, a foundational shift in artistic representation. The Baroque prompts (in both DALL•E and Midjourney) leaned partly abstract yet still hinted at Baroque motifs. Interestingly, Midjourney “long description Baroque” often yielded purely abstract forms, whereas simply saying “baroque” triggered more historically recognizable details. In the “dynamic vs. static” scenarios, DALL•E sometimes resembled Baroque in lighting and clutter yet remained Classical in center-weighted composition; Midjourney produced similar visuals across both prompts. Importantly, prompting "dynamic painting" seems to produce images that are more chaotic, but a close inspection shows that this apparent chaos is most often organized around the center of the image, which confers it a somewhat Classical touch. Notably, portraits labeled “dynamic” vs. “static” in DALL•E had more pronounced differences (e.g., color palette, angles of the head), while Midjourney again displayed minimal variation across the two conditions. This holistic review thus reinforced the interplay of prompt wording, model biases, and the tension between compositional form (often Classical) and thematic or stylistic details (sometimes read as Baroque or dynamic).
As a final word, let us add that, as somehow observed in [
33], this study highlights how Wölfflin’s stylistic concepts can arise in AI-generated outputs, yet remain uneven or incomplete relative to canonical art historical norms. It also underlines the fact that Wölfflin categories might have become outdated to analyze recent artistic productions, and that renewed, finer classes might need to be elaborated to provide more moderns analyzing keys to art historians.
4.7. Limitations and Future Works
Reproducibility of the results. By nature, the subjective expert-based evaluation stage cannot guarantee the full reproducibility of our results. To mitigate that issue, we used not only one but two experts in visual semiotics and art history (which are not easy to find), and asked them to review a large number of images in each experiment. This allows to reinforce the significance of our conclusions, seen as generic trends and observations rather than isolated scores. Regarding the generative stage alone, to the best of our knowledge, unfortunately neither DALL•E nor Midjourney allow fully reproducible results. This is due to a random pixel initialization at the start of the generation and to added randomness in the revision of the prompt and during the generative diffusion process itself. Just as for the evaluation, this also explains why we repeated each prompt multiple times. Therefore, even though the individual results (i.e., the images generated and their evaluations) are not fully reproducible, we established our experimental protocol to ensure as much as possible that our conclusions are significant, and hence that they are the parts benefiting from increased reproducibility.
Choice of models. While we focused on two models, DALL•E and Midjourney, our methodology can be used as is to assess any other text-to-image generative model, e.g., Stable Diffusion (
https://stability.ai/stable-image (accessed on 18 February 2025)), Imagen (
https://deepmind.google/technologies/imagen-3/ (accessed on 18 February 2025)), Flux (
https://flux1.ai/) (accessed on 18 February 2025), etc. Our early tests with other models did not hint at drastically different behaviors than those observed with DALL•E and Midjourney. Hence, to keep the time-consuming and difficult expert-based evaluation manageable, we opted for those two popular and easy-to-use models. Also, we preferred not to generate all the images for all the models and then leave most of them unrated, as the expert evaluation is really what gives value to our experiments. Thus, as for the evaluation stage, the generation stage also ultimately focused on our two selected models. Comparing a wide range of models across a possibly restricted number of prompts would be an interesting future work.
Extra prompting strategies. We designed our prompting strategy with mostly short prompts in order to assess as much as possible how the models captured and materialized the essence of the concepts that we tested, alleviating any distraction. However, many other prompting strategies can be further tried. Experiment 3 is an example of a different kind of strategy, as it uses a highly structured and detailed prompt, yielding results on par with shorter prompts. We did not investigate other long prompts because their diversity explodes exponentially with the prompt length, making it more difficult to isolate influential factors and study specific aspects, such as each Wölfflin category. Yet, this is certainly an interesting idea for a follow-up work. Another prompting strategy to try could be few-shot prompting, where several iterations and refinements are allowed in order to maximize the chances of achieving a desired result. In our case, we used only one-shot prompting, as we believe this better reflects the core capabilities of the models, but it might well be the case that a small number of iterations yields different outcomes. Lastly, one could test image-conditioned prompting, where a reference image is passed along the prompt to the model, as done in [
38] to mix various styles. This could serve to guide the model toward a more meaningful result than simply letting the model generate images out of nowhere, which in turn might not allow an insightful analysis. Again, in this paper, we chose to focus on the intrinsic properties of the models, but we believe that many prompting strategies could serve as basis for interesting future work.
Multivariate analysis. Although a multivariate approach could reveal deeper structures in how experts rate Wölfflin’s dimensions, we ultimately chose not to pursue this route. The core reason is that it would shift the focus of the analysis too much on the agreement between the experts or on very peculiar patterns, thus many more experts would be needed to fully deliver an insightful message. Moreover, layering additional complexity onto our already multifaceted experiments might risk obscuring the interpretive, art historical insights that are central to this study. More in-depth data science-based analysis would benefit from a more restricted and data rich experimental context, which could ultimately consist in a paper on its own, with its own research questions. Instead, descriptive analyses and expert commentaries allow us to preserve clarity and accessibility for our intended audience, ensuring that the theoretical narrative around Wölfflin’s framework remains at the forefront.
5. Conclusions
In this study, we examined how two popular text-to-image models (DALL•E and Midjourney) interpret and generate imagery under Heinrich Wölfflin’s Classical/Baroque framework. Through both explicit style prompts (e.g., “baroque”, “classical”) and more nuanced compositional instructions (“dynamic”, “static”, or reworked Wölfflin descriptors), we collected expert ratings under blind conditions, then followed up with a holistic review of the resulting images. Our results reveal a complex interplay between prompt wording, model biases, and expert interpretation. While “baroque” prompts often elicit dramatic contrasts or mixed figures, “classical” tends to be less distinctly defined, especially when the genre (e.g., portrait, still life) imposes a centered composition. In some cases, “dynamic” versus “static” triggers equally large stylistic ratings, showing how neither label is guaranteed to produce the expected historical cues. The ”dynamic image” prompt even shows seemingly chaotic images which, in fact, display a Classical spatial organization around their center. At times, Baroque themes appear within fundamentally Classical compositions, indicating a tension between the two styles. Meanwhile, removing explicit style labels but retaining Wölfflin-based descriptions yields more abstract forms, detached from historic Baroque figurations. The broader, side-by-side review helped the experts identify subtleties—like the partial overlap in clearness or lighting—that were not immediately apparent when images were rated in isolation. Taken together, these findings illustrate that while Wölfflin’s categories can influence AI outputs, the models’ contemporary training data and generative preferences complicate any straightforward mapping of Classical versus Baroque. Our work thus suggests that Wölfflin’s theory, once deemed atemporal, merits thoughtful revision to address the novel ways generative systems blend compositional form, thematic cues, and modern biases.