On the perception of graph layouts

In the field of software engineering, graph‐based models are used for a variety of applications. Usually, the layout of those graphs is determined at the discretion of the user. This article empirically investigates whether different layouts affect the comprehensibility or popularity of a graph and whether one can predict the perception of certain aspects in the graph using basic graphical laws from psychology (i.e., Gestalt principles). Data on three distinct layouts of one causal graph is collected from 29 subjects using eye tracking and a print questionnaire. The evaluation of the collected data suggests that the layout of a graph does matter and that the Gestalt principles are a valuable tool for assessing partial aspects of a layout.


| INTRODUCTION
In software engineering (SE) there are various applications for graphical models, such as 1 : • during development or testing: fault trees for accessing reliability in functional security 2,3 or attack trees for detecting threats to IT security. 4for specification or documentation: class or sequence diagrams for communicating the structure or behavior of code. 5for reengineering: program or data flow diagrams for understanding barely documented code. 6st graphical models are graph-based; they consist of two general types of elements, nodes, and edges (see Figure 1A for an illustration and particular references 7,8 for a more formal introduction).Thereby, information is transferred through the selection and combination of model elements.Take again the model in Figure 1A as an example: it consists of three nodes (named X, Y, and Z) as well as two edges.The edges connect the nodes X and Z with Y, in each case toward Y.The alignment of the node elements relative to each other (i.e., the layout of the graph) does not yield any informationthe graph in Figure 1B encodes the exact same information as the one in Figure 1A.In practice, the layout of the graph is determined at the user's discretion, possibly by using some sort of graph sorting algorithm.
Our research starts at this very pointcombating arbitrariness in the selection of graph layouts: with an empirical study, we investigate how, if at all, different alignments of model elements influence the comprehensibility and popularity of a graph and whether we can predict the Statements-This article is an extension of work originally presented in the "1st Workshop on Advances in Human-Centric Experiments in Software Engineering" (HUMAN 2022) [1].We adhere to the Wiley standards for ethics and integrity.For our study, we followed the procedure recommended by the 'Joint Ethics Committee of the Universities of Bavaria' (GEHBa) responsible for our university and performed a self-assessment; we also had our subjects sign an informed consent form.The data collected in our study and the corresponding study materials are available at www.doi.org/10.5281/zenodo.7241097.Our work is funded by the 'Bavarian State Ministry of Economic Affairs, Regional Development and Energy' (STMWI) within the funding project HolmeS 3 (FKZ: DIK0173/03) and by the 'German Federal Ministry of Education and Research' (BMBF) within the funding projects HASKI (FKZ: 16DHBKI035) and FH-Invest (FKZ: perception of certain aspects in the graph with basic graphical laws from psychology (i.e., Gestalt principles).As a research method, we rely on the triangulation of eye tracking and questioningthis allows us to access both, subconscious cognitive processes and conscious attitudes of our subjects.Causal graphs are chosen as objects of investigation because they seem to be the most intuitive.
The following pages present the conducted research starting with some background information: an explanation of the underlying theoretical constructs (i.e., causal graphs and Gestalt principles), an introduction of the research method of eye tracking, and an overview of related work.
Afterwards, the study itself is describedits design, implementation, and analysis.To bring this article to a close, the results of our research are presented and discussed.Note that this article is an extension of work originally presented in a conference paper. 1

| BACKGROUND
This section provides some background information on the study content to facilitate its understanding.First, the causal graphs are introduced.This is followed by the explanation of selected Gestalt principles.Then, the research method of eye tracking is presented.In the end, related approaches are put in place.

| Causal graphs
Causal graphs are graph-based models used to visually represent cause-and-effect relations between stochastic random variables by directed edges 9 if the graphs in Figure 1A and B were interpreted as causal, they would state that the random variables X and Z cause the random variable Y directly (i.e., X and Z are direct causes of Y). 8 Causality prohibits undirected relations, bidirectional relations, or cyclesan effect cannot influence itself or its cause. 10r the empirical investigation, a concrete graph is needed.Based on previous work by the authors, a highly simplified model of how a car accident occurs is chosenoutlined in Figure 1C.The model states that an accident (random variable A) may happen because the driver (random variable D) is distracted or tired, the current speed of the car (random variable S) makes it impossible to brake in time, or the environmental conditions (random variable E) such as visibility or road condition are poor.It is assumed that the speed of the car is mainly influenced by its driver and the environmental conditionsthe speed tends to increase once the driver is stressed and decreases when stuck in a traffic jam.In other words, environmental conditions include both, weather (random variable W) and traffic (random variable T).

| Gestalt principles
Gestalt principles are law-like observations about human perception of graphical objects.Their beginnings trace back to the early 20th century: with empirical research, the psychologist Wertheimer identified multiple factors that determine the perception of grouping. 11For the purposes of this article, we will focus on three principles known as proximity, similarity, and closure.They state that elements are perceived to belong together if they are: • proximity: … close to each other, • similarity: … similar to each other, or • closure: … in combination suggest a closed form. 12I G U R E 1 Terminology (A) and alignment variety (B) of graph-based models and an exemplary causal graph (C).
See Figure 2A-2C for common examples: a collection of dots is perceived column-wise when properly arranged (Figure 2A) or colored (Figure 2B); an implied circle is perceived as full (Figure 2C). 12The Gestalt principles can also be applied to graphs; Figure 2D-2F shows the assumed effect of the three Gestalt principles on the causal graph from Figure 1C.Proximity suggests that closely located nodes are perceived as a unitregardless of them being connected by edges (Figure 2D).Similarity strikes once several edges are similar to each other in that they point in a similar direction (Figure 2E).The principle of closure should show if the arrangement of model elements resembles a known form (Figure 2F).

| Eye tracking
Eye tracking is the recording of eye movement data; in current systems, this works non-invasively with nothing more than a small camera (and sometimes a light source) pointed at the subject. 13The raw eye tracking data is simply a time series of two-dimensional coordinates 13 the point in space and time the subject looked at.This time-series-data is usually transformed into more complex metrics mostly based on: • fixations: … moments when the subject's gaze is roughly focused on some region while processing information within that region, 14 • saccades: … moments when no information is taken in by the subject because the subject's gaze alternates between two areas, 14 or • areas of interest (AOI): … sub-areas of the presented stimulus. 13e acquired data is valuable for a variety of use cases: from support systems (e.g., in the form of driver assistance systems 15 or interactive learning systems 16 ) and medical applications (e.g., for gaze interaction 17 or visual acuity measurement 18 ) to applied research (e.g., of usability 19 or behavioral patterns 20 ).

| Related work
There is already some eye tracking-based research on our first starting point of research, the comprehensibility of graph layouts.For example, Körner 21 used eye tracking to set up a heuristic model of graph comprehension for so-called hierarchical graphs (i.e., graph-based models with undirected edges where the relative vertical position of nodes indicates the direction of their relationship); the subjects' task was to determine whether a certain directed relationship exists between two nodes a graph.The results suggest that the subjects' behavior can be divided into three stages: searching for the first node, searching for the second node, and reasoning about their relation.
Huang 22 validated the proposed three-stage model for undirected graphs by investigating the effects of individual aspects of graph layout (i.e., edge crossings and geodesic path tendency) on graph comprehensibility.The subjects' task was to look for a given number of edgeseither between two nodes or toward a node.For edge crossings, Huang found a difference in performance for general edge crossings, but in the eye tracking data only for small crossing angles.However, the eye tracking data suggests a tendency toward geodesic paths.
F I G U R E 2 Exemplary effects of gestalt principles proximity (A, D), similarity (B, E), and closure (C, F).Based on Gerrig 12 (A-C).
There are also eye tracking studies that look at the comparison of specific layouts: for example, Pohl, Schmitt, and Diehl 23 compared three graph layouts (i.e., orthogonal, force-directed, and hierarchical) using some random undirected graphs.Again, the subjects encounter tasks based on graph theory.The results of the study indicate that the forced-directed layout is the best in terms of response time and accuracy.
In addition, some empirical studies have been conducted on diagrams entirely without eye tracking: Sharif and Maletic 24 studied different layouts (i.e., multi-cluster and orthogonal) of class diagrams (i.e., partially directed graphs) using an online questionnaire.Here, subjects completed tasks more closely related to the SE domain, such as bug fixing.The results are in favor of the multi-cluster layout.
The list of empirical work on graphs could be continued, but to our knowledge, all of them differ from the work presented in this paper in at least one of the following aspects, as the approaches presented [21][22][23][24] exemplarily show: • goal: As in the latter two approaches, 23,24 the goal of this paper is to compare specific layouts.In contrast the first approach 21 works on building a heuristic model for graph understanding, while the second one 22 evaluates the impact of certain layout aspects.
• object of investigation: The first approach 21 studies hierarchical graphs (i.e., graphs that encode directed relationships between node elements by their layout), while the middle two 22,23 use undirected graphs (i.e., graphs that encode only undirected relationships between node elements).The present study considers directed graphs (i.e., graphs that encode directed relationships between node elements by directed edges), a concept that is far more natural for the SE domain than hierarchical or undirected graphs.
• tasks: In the first three approaches, [21][22][23] the subjects' tasks are built upon graph theory (e.g., dealing with paths, cliques, or degrees).The present study uses tasks that are more natural to the discipline of SE (i.e., memorizing and debugging or reproducing, respectively)in line with the work done in the last apporach. 24analyses: In the first three approaches [21][22][23] as well as in the present study, the different graphs are evaluated with eye tracking (e.g., heat maps or fixation counts) or without (e.g., answer times, error rates, or questioning)in the last one 24 no eye tracking is employed.However, all of them only use quantitative data to rate the entire graph and qualitative data, if at all, to focus on distinct parts of the graph (e.g., with heatmaps).In contrast, we relate quantitative data to individual model elements.Moreover, we extend the work of all previous approaches presented [21][22][23][24] by not only collecting and evaluating empirical data but also using psychology (i.e., Gestalt principles) to make predictions in advance.
The second starting point of our research, the applicability of those Gestalt principles was investigated several times over the last 100 years.Hu and Bači c 25 even used eye tracking as a research method.However, with or without eye tracking, up to now, simple geometric arrangements (e.g., Figure 2A-2C) were chosen as object of investigation instead of more complex graphical objects such as graphs (e.g., Figure 2D-2F).

| METHODS
This section takes a closer look at the study conducted; thereby, the structure of the section is specifically adapted to the study that was conducted.Before the concrete hypotheses are formulated, the materials, procedure, and assumptions are discussed.Then, the collection and analysis of the study data is presented.

| Material
Since our main goal is to investigate the influence of the layout of a graphic on its perception, the independent variable (IV) is layout.Here, we consider three different values: • IV1 (Top-Down): This layout corresponds to the way a graph is usually presented in causal literature 9 all edges point either downwards or horizontally.
• IV2 (Bottom-Up): This layout is oriented opposite to IV1, with all edges pointing upwards or horizontally.This type of representation is most common in technical domains (e.g., SE), for example in so-called fault trees. 2,3IV3 (random): The aim of the final layout is to create a counterpart to the two tree-like structures.Here, the graph is not aligned in one direction but rather resembles a hexagonal or circular object.
Figure 3A-3C shows the individual layouts for the exemplary graph of the study presented earlier in Figure 1C.The remaining panels in Figure 3 show other graphs that the subject encounters in the course of the studythe alienated (3d-3 f) and manipulated versions (3 g-3i) of the respective layouts or original graphs.These are required by some tasks of the study: each subject has to memorize and reproduce one graph from all three layouts.
If the graphs from Figure 3A-3C were chosen for memorization and the subject is supposed to reproduce the contents of one graph, the subject could fall back on his knowledge of the other layouts for this purpose.Therefore, the graphs are alienated: the actual random variables are cyclically replaced by letters of the alphabet.To create as equal conditions as possible, exactly one vowel is assigned to each IV (which results in the vowel "o" not being assigned at all).The distribution of the assigned letters in the graphs does not follow a fixed scheme.Rather, the arrangements are chosen to minimize memory aidsfor example, the spatial proximity of letters that produce common abbreviations is avoided.The resulting alienated graphs are shown in Figure 3D-3F.
The reproduction is realized partly as "debugging" in a manipulated version of the alienated graph.To ensure comparability between layouts, the respective manipulations must be of the same kindwe opted for the inversion of two edges and the rotation of three nodes for each layout.
The resulting manipulated graphs are compiled in Figure 3G-3I with the altered model elements highlighted in red and the edges numbered consecutively.The former is only done for better readability of this very figure and not in the study itself, the latter serves for better addressing of particular edges in the present article as well as in the empirical study.

| Procedure
The study follows a within-subject design; each subject is exposed to each task or variation of the IV. 14 In the beginning, the subjects complete some paperwork.They sign a consent form and fill the first part of the questionnaire with their demographic data (i.e., age, gender, occupation, and area of expertise) as well as their previous experience with fault trees or causal graphs, respectively.
To increase the accuracy of eye tracking data, a 9-point-calibration is performed for each subject prior to the actual recording: 9 dots are presented on the eye tracker's monitor one after the other with the subjects instructed to focus the particular dots with their gaze. 14e first stimulus of the eye tracking study includes a written introduction to causal graphssimilar in content to the first paragraph in section 2.1.While viewing this stimulus, the subjects have the opportunity to ask questions about the topic.This ensures that all subjects have the necessary prior knowledge and that we do not have to restrict participation in any way.After that, the actual eye tracking data collection follows: divided into two experiments regarding comprehensibility and popularity, respectively.

F I G U R E 3 Original (A-C), alienated (D-F), and manipulated graphs (G-I) for IV1 (A, D, G), IV2 (B, E, H), and IV3 (C, F, I).
In the first experiment, the subjects are presented with one of the alienated graphs and instructed to memorize it for a self-selected period of time; they are then asked to name the direct causes of one node (i.e., the reproducing task) and to identify the changes in the corresponding manipulated graph (i.e., the debugging task)all from memory and verbally, with answers noted in the questionnaire by the study leader.The experiment is repeated for all three layouts or IVs, with a trial run at the beginning.The trial run is based on an arbitrary graph constructed from six edges and five nodes with names for the random variables that do not appear in the aliased graphs (i.e., the training graph).The order of the three real runs is varied between the subjects as first (i.e., IV1 !IV2 !IV3), second (i.e., IV2 !IV3 !IV1), and third timeline (i.e., IV3 !IV1 !IV2).The different timelines counteract possible learning effects of the subjects: even if each subject achieves the best personal results in the last run of the trial, this effect balances out across all subjects.This ensures an objective evaluation of the comprehensibility of the individual graphs.
In the second experiment, some preferential looking tasks (PLTs) are performed: the subjects are presented with stimuli showing two elements of one kind (i.e., two causal graphs) side by side and instructed to view them at will. 26Each PLT is preceded by a stimulus showing a small cross for a few secondsthe subjects are asked to focus on this cross.This ensures that the subjects' gaze on the PLT stimulus starts at a predefined position.In total, each subject encounters 18 PLTs or three runs of the experiment, respectively: the trial run consisting of six PLTs between one of the alienated graphs and the training graph, the first run with two of the alienated graphs each, and the second run with two of the original graphs each.The individual stimuli of the first and second run each follow the scheme: IV1 against IV2, IV2 against IV3, IV3 against IV1, IV2 against IV1, IV3 against IV2, and IV1 against IV3.Before the second run, the causal story behind the original graphs is explained to the subjects in written form similar in content to the second paragraph in section 2.1 of the present article.
The study ends with a short retrospective interview.There, subjects are asked about their memorizing strategy and their preference between the three layoutsbefore knowing the causal context (i.e., with the alienated graphs) and after learning it (i.e., with the original graphs).This choice is again noted in the questionnaire by the study leader.Also, during the interview, the subjects have the opportunity to view their gaze recording and make further comments on the study.
To give a better idea of the individual eye tracking stimuli, the sequence of the two experiments is shown in Figure 4. There, it is also noted whether the change between stimuli is triggered by a timer or by an action (i.e., mouse click) of the subject.

| Assumptions
The study design carries two main assumptions.First, with the Gestalt principles we can predict the subjects' behavior while memorizing the graph and facing the reproducing or debugging tasks.Second, with PLTs we can expose the subjects' subconscious decisions between two elements of one kind.This section explains how we reached our testable implications based on these two main assumptions and some other minor ones.
In memorization, we generally assume that a subject's gaze is directed along the edges of the graphs, regardless of their direction.However, Gestalt principles suggest certain deviations from the edgesin particular, that the subject's gaze: Procedures of the first (A) and second (B) experiment.
• proximity: … switches between nodes that are close to each other, even if there is no direct connecting edge between them (e.g., the subject's gaze should wander between the nodes D and G in the alienated graph of IV1), • similarity: … preferably follows the shorter one in case of nearly parallel edges (e.g., the edge between nodes B and K should be preferred over the edge between the nodes B and H in the alienated graph of IV2), and • closure: … follows the boundary lines of a shape indicated by the positioning of the nodes, even if this shape is not completely described by edges (e.g., the subject's gaze should wander between the nodes C and S in the alienated graph of IV3).
Table 1 summarizes those implications for the three layouts.Thereby, it introduces the term transition.For the purposes of this article, a transition will refer to the direct change of a subject's gaze between two nodes regardless of direction.When a subject's gaze wanders from node D to node G without passing through any other node, this is interpreted as a transition between D and G, but so is a change of gaze from node G to node D without passing through any other node.
The reproducing tasks are chosen in a way that two random variables are correct eachone from each of the two mental groupings according to proximity.We assume that it is easier for the subject to name the cause that is assigned to the same grouping as the node of the question.Consequently, the relative frequency of this answer should be greater than that of the other answer across all subjects.Figure 5A-5C outlines these considerations using IV3 as an exampleonce again, all edges are numbered for an easier referencing.In Figure 5A, the mental groupings of the graph relevant to the task are colored.In Figure 5B the correct answers (i.e., the direct causes of the node I) are highlighted in red. Figure 5C  Likewise, in the debugging tasks, some manipulations should be easier to detect than others: we assume that switching random variables between mental groupings according to proximity should be more noticeable to subjects than switching within.Also, reversed edges should attract more attention if they break a grouping based on the similarity.Thus, again, the associated answers should occur more frequently across all subjects.Figure 5D-5F outlines this intuition for IV3. Figure 5D presents the mental groupings of the memorized graph.In Figure 5E the correct answers (i.e., the manipulations) are highlighted in color.Figure 5F once again combines the previous illustrations.The color highlighting makes the manipulation of nodes L and F stand out more than the one of Crecalling the memorized graph should produce the same effect.
Same applies to the edges: by reversing edge 7, the latter grouping is missing from the manipulated graph.This gives the subjects a "clue" for detecting the manipulation of edge 7.These considerations lead to the following inequalities: P("F") ≥ P("C"), P("L") ≥ P("C"), and P("7") ≥ P("6").
Table 2 summarizes the previous explanations for the three layouts.The contents for IV1 and IV2 come without further derivationthe underlying considerations follow the same scheme as for IV3.
In the course of PLTs we rely on two aspects: which of the elements was looked at first and which was looked at longer by the subject.In accordance with Behe, Campbell, and Khachatryan, 26 we assume that the element that was viewed first caught the subject's attention, while that one that was viewed more extensively fascinates the subject more.This means that a PLT tells which of the given items is (subconsciously) chosen by the subjectwhile being more robust than questioning. 26 comparing the three layouts or IVs, we believe that the tree-like arrangements (i.e., IV1 or IV2) will prove advantageous in terms of comprehensibility and popularitysimply because they appear more orderly.We also assume that IV1 is more appropriate than IV2, as this is consistent with the natural way of causal thinking that leads from causes to effects, rather than the other way around.Regardless of whether PLTs or surveys are used, we firmly believe that the classification of popularity will become clearer once the causal context is known.

| Hypotheses
In accordance with section 1, we define our research questions to be 1) Do the Gestalt principles hold true for (causal) graphs?2) Do different alignments influence the comprehensibility of a (causal) graph? 3) Do different alignments influence the popularity of a (causal) graph?For a concise evaluation, we convert those research questions into a set of nine hypotheseswith three hypotheses each grouped by an overarching formulation: H1.The Gestalt principles of proximity, similarity, and closure hold true for causal graphs.
H1a.The transitions during memorization comply with edges or the predictions in Table 1.
H1b.The relative frequencies of correct answers to the reproducing tasks satisfy the predictions in Table 2.
H1c.The relative frequencies of correct answers to the debugging tasks satisfy the predictions in Table 2.
T A B L E 2 Predictions for reproducing and debugging tasks.

Layout Task Solution
Expected ratio of relative answer frequencies Underlying gestalt principle
H2a.The duration of memorization increases from IV1 to IV2 to IV3.
H2b.The score on the reproducing tasks decreases from IV1 to IV2 to IV3.
H2c.The score on the debugging tasks decreases from IV1 to IV2 to IV3.
H3.The popularity of the causal graphs decreases from IV1 to IV2 to IV3.
H3b.With direct questioning, one chooses IV1 over IV2 over IV3 H3c.With the knowledge of the causal context, the effects from H3a and H3b strengthen.
Note that the quantity score mentioned in the hypotheses H2b and H2c is measured as a percentage, but is not the same as the percentage of correct answersrather, the correct and incorrect answers are offset.A score of 100% can only be achieved with exactly the correct answers (each two for the reproducing and five for the debugging tasks as listed in Table 2); each wrong or missing answer leads to point deduction of 50% for the reproducing tasks and 20% for the debugging tasks.
Figure 6 breaks down the data sources used to evaluate the hypothesesfor the memorization (6a), the reproducing or debugging tasks (6b), and for the PLTs. Figure 6A and C also show the empirically chosen AOIsas circles over the nodes of the graphs to be memorized or as rectangles over the elements of the PLTs.The transitions between the nodes during graph memorization (6a) cannot be exported directly from the eye tracker's analysis tool, but must be reconstructed from the exportable activations of the set AOIsthis procedure is further elaborated in section 3.6.In the present case, the memorization duration (6a) can be equated with the viewing duration of the corresponding stimulusa quantity that is not actually an eye tracking metric in the strict sense, but is recorded by the eye tracker.The answers to the reproducing or debugging tasks as well as the deliberate decisions of the subjects (6b) are not assigned a metricthey are taken from the questionnaire instead of from the eye tracker.From the PLT stimuli (6c) we export two eye tracking metrics: the time span until the first gaze into each AOI and the total time duration of their viewing.
The utilized eye tracking metrics are proven to be valid and reliable. 14Their naming complies with the utilized analysis tool of the eye tracker.

| Realization
We We promoted the study at the Technical University of Applied Sciences Regensburg and did not restrict participation by prior knowledge, area of expertise, or any other factor.Thus, we recruited 29 subjects aged 21 to 64 years (median = 25).Among them, 11 were women (38%).At the time of the study, most subjects were enrolled in a university (24 subjects, 83%) and worked in STEM fields (22 subjects, 76%).

| Analysis
The data collected was analyzed with the programming language R version 4.2.1 using the libraries base 27 and rstatix. 28Data analysis is quantitative throughouthowever, the first hypothesis (i.e., H1) is examined using descriptive statistics only, while the other hypotheses (i.e., H2 and H3) are with inductive statistics in terms of statistical tests.
For hypothesis H1a we start by exporting the metric AOI hit from the analysis software as one data set per subject, each of which contains one column per AOI or node.These columns are filled with one value per measuring point: • "NA" if the stimulus was not presented at the corresponding measuring point, • "1" if the AOI was activated (i.e., viewed) at the corresponding measuring point, and • "0" if the AOI was not activated (i.e., not viewed) at the corresponding measuring point.
To get from those data sets to our quantity of interest (i.e., the number of transitions between every two nodes during memorization) we repeat a 5-step-process for every subject and layoutvisualized in Figure 7 for a highly simplified exemplary gaze on IV3 As a first step, we restrict the dataset to the data of one layoutby deleting the columns that do not belong to the AOIs of the current layout and then removing the rows that exclusively contain the value NA (i.e., the rows that belong to measuring points when another layout was presented).In this cleaned data set, we determine the rising edges in the binary coding of each column (i.e., the entries of gaze into the corresponding AOI).This is done by subtracting each value from its predecessor; a difference of 1 marks a rising edge.In step 2, we fill an auxiliary vector with the column names of the cleaned dataset (i.e., the names of Process for determining the relative frequencies of transitions for a subject. the AOIs) at the indices that were found to be rising edges in the respective column.Removing the rows without rising edges from the auxiliary vector (i.e., step 3) gives an ordered listing of visited nodes.The absolute frequencies of transitions can be taken directly from this listing (i.e., step 4).For example, whenever C is followed by F, the absolute frequency of the transition C $ F is increased by 1.In a final step, we convert those absolute numbers to relative ones; this is necessary because the number of transitions varies greatly between subjects, ranging from a total of 22 transitions on the entire graph (i.e., subject P18) to 44 transitions only between two particular nodes (i.e., subject P29).
The evaluation of hypotheses H1b and H1c is more straightforward: we simply determine the relative frequencies with which each model element is named in the course of the reproducing or debugging tasks.
To access the hypotheses H2a, H2b, H2c, H3a, and H3b, we need to find out whether the means or medians of the respective variables differ significantly for the distinct layouts (i.e., three groups).Due to the within-subjects design of our study, the data collected is dependent between the three groups.Following Field and Hole 29 we use either a one-way repeated measures ANOVA for a joint comparison, followed by dependent t-tests for pairwise comparisons, or Friedman's ANOVA for a joint comparison, followed by Wilcoxon signed-rank tests for pairwise comparisons.
The choice between these two alternatives is based on the nature of our data: if the respective variable is not normally distributed for at least one group or if sphericity between groups is not given, we choose the latter non-parametric alternative.To check for normal distribution within groups and for sphericity between groups, we rely on Shapiro-Wilk tests (significant ) no normal distribution) and Mauchly's tests (significant ) no sphericity), respectively.
For hypothesis H3c, we need to split the data into two groups by time: before the subjects knew the causal story and after they learned it.Once again, our data is dependent as collected from the same subjects for both of the groups.As a result, we again follow Field and Hole 29 and use dependent t-tests or Wilcoxon signed-rank tests preceded by Shapiro-Wilk tests for deciding between the former parametric and the latter nonparametric test alternative.
In the course of the present article, the p-values of one-way repeated measures ANOVAs are adjusted using the Greenhouse-Geisser correction or Hyunh-Feldt correction method when sphericity is not given; the choice between those two corrections is based on the computed ε (i.e., Greenhouse-Geisser for ε < 0.75, Hyunh-Feldt for ε < 0.75). 29Meanwhile, Bonferroni correction is applied to all pairwise tests. 29l (adjusted) p-values are evaluated against a significance level of α = 0.05.In line with Field and Hole, 29 for each test, we report the p-value, the test statistic and, when possible, the effect size; we do not report effect size measures for Shapiro-Wilk tests, Mauchly's tests, or pairwise comparisons. 29For the choice of a proper effect size measure, we align with Albers and Lakens 30 for parametric and Tomczak and Tomczak 31 for non-parametric tests (i.e., generalized ε 2 for one-way repeated measures ANOVAs, Hedges' g for dependent t-tests, Kendall's W for Friedman's ANOVA, and the correlation coefficient r for Wilcoxon signed-rank tests).

| RESULTS
This section presents the results of the empirical studyorganized according to the three research questions or main hypotheses of 1) the applicability of Gestalt principles, 2) the comprehensibility of distinct layouts, and 3) their popularity.

| Research question 1gestalt principles in causal graphs
Hypothesis H1a formulates the intuition that when memorizing a causal graph, the subjects' gaze tends to move along edges or deviate according to the Gestalt principles.This is accessed via the frequencies of transitions between node elements: Table 3 lists the medians of the relative frequencies of transitions across all subjects, for each layout or transition.The color highlighting of cells indicates whether the transitions are based on an edge (i.e., black cell highlighting), are made according to the Gestalt principles despite the absence of an edge (i.e., red cell highlighting), or are avoided according to the Gestalt principles despite the presence of an edge (i.e., blue cell highlighting).The individual rows in this table do not sum to 1 by definition; this is because we calculate the relative frequencies of the transitions for each subjectas explained in section 3.6and then median over all these percentage values.This procedure ensures to account equally for the distribution of each subject's transitions (i.e., that data from subjects such as P18 with very few transitions do not drown against data from subjects such as P29 with many transitions).To emphasize this fact, we have decided to present the median values as decimal numbers instead of percentages.
In the table, three transitions are highlighted in red: M $ P (IV1), D $ J (IV1), and B $ H (IV2).For these transitions, the data do not really agree with our predictions based on the Gestalt principles; the remaining transitions either have a rather small median (i.e., less than 0.02), while they are considered absent, or have a rather large median (i.e., more than 0.06), while they are considered present.This means that Table 3 partly supports hypothesis H1a.
Hypotheses H1b and H1c assume that answers to the reproducing or debugging tasks can be predicted through the Gestalt principles.This is evaluated with Table 4; this table enhances Table 2 with the actually observed relative answer frequencies on tasks.Here, the data strongly support all of our predictions and thus also the hypotheses.

| Research question 2comprehensibility of layouts
Hypotheses H2a, H2b, and H2c state that the memorizing duration as well as the scores on the reproducing or debugging tasks change between the distinct layouts: the former should increase from IV1 to IV2 to IV3, while the latter two should decrease that way.To verify this, we rely on some statistical tests, the results of which are presented in Table ; there, color highlighting of cells indicates selection of the appropriate test, as explained in section 3.6.From those, we see that: • the memorization duration is significantly lower for IV1 than for IV2 or IV3, • the score on the reproducing tasks is significantly higher for IV1 than for IV3, and • the score on the debugging tasks is significantly higher for IV1 or IV2 than for IV3.
We found no significant differences in the memorization duration of IV2 and IV3, in the scores on the reproducing tasks of IV2 and the other two layouts, or in the scores on the debugging tasks of IV1 and IV2.In other words: we did not find any significance contradicting the hypotheses H2a, H2b, or H2c, but we could not completely confirm them either.
T A B L E 4 Observations for reproducing and debugging tasks.

IV Task
Ratio of relative answer frequencies Expected Observed
Layout Values

| Research question 3popularity of layouts
The hypotheses H3a and H3b are similar to the previously discussed hypotheses.They state that three further quantities change between the distinct layouts: the percentage with which each layout is viewed the longest or first in the course of PLTs* and the percentage with which it is chosen deliberately in direct questioning † .All three quantities should decrease from IV1 to IV2 to IV3.Again, we resort to a series of statistical tests, the results of which are shown in Table 5. Though, we can only find one significance here: the number of deliberate choices is significantly higher for IV1 than for IV3.This observation is consistent with hypothesis H3b, yet we cannot fully confirm the hypotheses H3a and H3b.
Hypothesis H3c examines the same three quantities, but this time does not compare between layouts, but whether or not subjects knew about the causal storyline.It is hypothesized that the popularity of IV1 over IV2 over IV3 should strengthen with causal context in both, PLTs and surveys.Again, we performed statistical tests (see Table 6 for results) and found a few significances, all of which support the hypothesis: • the number of first views of IV1 is significantly higher with causal context, • the number of deliberate choices of IV1 is significantly higher with causal context, and • the number of deliberate choices of IV3 is significantly lower with causal context.

| CONCLUSION
This section concludes the present article by discussing findings, limitations, and implications of the presented work.

| Findings
The results presented in section 4 suggest that the Gestalt principlesor at least the principles of proximity, similarity, and closurehold for (causal) graphs.Moreover, the results showed that the layout of a causal graph affects its comprehensibility and popularity.In the actual use case, IV1the downwards-oriented graphproved to be the most popular and the most suitable for memorizing, reproducing, and debugging.
In addition to this, the results of the study suggest that PLTs should be used with caution for low-emotion or recurrent elements.First, a PLT provides information about which of the presented elements is found more interesting by the subjecthowever, it remains unclear why the subject decides this way.For emotional elements, such as photos of facial expressions or vacation spots, interestingness and preference coincide.Causal graphs, on the other hand, convey little emotion; it is possible that subjects look at a causal graph longer if it seems more complexand thus more interestingto them.Second, with recurring elements, subjects had the opportunity to develop not only likes but also dislikes for certain layoutsthe retrospective interviews showed that when one of the tree-like alignments was preferred (e.g., IV1), the other was strongly disliked (e.g., IV2).In a PLT, a neutrally rated element (e.g., IV3) is preferred to the disliked element (e.g., IV2) provoking bias.

| Limitations
The validity of the results presented is subject to some limitations, including: • the subjects: We cannot tell whether the subjects really did their best when working on the tasks.In addition, we cannot ensure that the sample was heterogeneous enough; as described in section 3.5, most subjects were of similar age and background.
• the independent variable: The particular arrangement of the model elements was chosen by the authors at their discretion.We cannot assert that there is no other layout that might prove advantageous for causal graphs.
• the material: The stimuli present graphs with six nodes and seven edges.This is due to the study design: the subjects are asked to memorize and reproduce a given graph; a more complex graph would not allow for this type of task.However, the graphs used in everyday software engineering are much more extensivewe cannot assure that the results presented are valid for other causal graphs, let alone for other types or extensions of graph-based models.
The study materials (i.e., stimuli and questionnaire) as well as our collected data are available at www.doi.org/10.5281/zenodo.7241097to facilitate a replication with a different sample or design.

| Implications
If one had to sum up the message of this article in one phrase, it would be: "graph layout matters"the alignment of model elements in a graph influences their perception.This is already clear from the work of Pohl, Schmitt, and Diehl 23 or Sharif and Maletic, 24 respectively; the contribution of our work lies in the validation of this statement for directed graphs of SE, in particular for causal graphs, but also in the use of a quantitative analysis at the level of model elements.With the present article, we want to encourage • … practitioners to pay attention to the design of their graphs as the perception of a graph can be improved by simply adjusting its layout.We suggest Gestalt principles as a tool to predict the understanding of certain aspects of graphs.
• … researchers to further investigate, which layout proves beneficial for which graph type or use case.
In future work, our findings can be used to develop style guides for graph-based modeling techniques.This way, they can be integrated into the software and system development process and support the entire SE community in their work.

5
combines the two previous illustrations.It can be seen that the node of the quest ion (I) and one of the causes (F) belong to the red grouping, while the second cause (R) belongs to the blue grouping.The connection of nodes F and I (edge 5) falls into one grouping; the connection of nodes R T A B L E 1 Predictions for memorization.Grouping according to the gestalt principles (A, D), solution (B, E), and explanation (C, E) for reproducing (A-C) and debugging tasks (D-F) of IV3. and I (edge 6) is in between groupings.The connection to node F should mentally be more closely grouped with the node I and thus more easily reproducible the connection to node R. The relative answer frequencies should thus satisfy the inequality P("F") ≥ P("R").
utilized the monitor-based eye tracker Tobii Pro Spectrum (monitor: 23.8 in.; 16:9), its associated analysis tool Tobii Pro Lab version 1.145.28180(x64), and a print questionnaire.The eye tracking data was collected contact-free at a frequency of 300 Hz; the print questionnaire was filled in partly by the subject and partly by the study leader.The entire process took about 20 to 40 minutes per subject.F I G U R E 6 Data for the evaluation of hypotheses during memorization (A), reproducing or debugging tasks (B), and PLTs (C).