Our visual system is collecting summary information from our environment all the time, whether it be scanning an arrangement of boxed strawberries to select the “best” box to purchase, scanning a crowd of people exiting an auditorium to decide which exit is the least crowded, or scanning an audience to assess how your performance was received. Our ability to extract summary statistics of visual features, like mean color of a box of strawberries, mean direction of a crowd, or mean facial expression from an audience, helps us navigate our environment and make decisions that best serve us. This ability has been studied by researchers for decades under the term “ensemble perception.” While summary statistics can be extracted via other modalities, like audition (e.g., E. A. Piazza, Sweeny, Wessel, Silver, & Whitney, 2013), this tutorial review will only discuss ensemble perception in the visual modality.

While most of the field have historically concentrated on our ability to estimate mean (see Whitney & Yamanashi Leib, 2018, for a recent review), as of more recently, the study of variance perception and the use of ensemble coding in interpreting data visualizations have been on the rise. Both areas of research have implications for statistics education. For one, variance is one of the most difficult statistical concepts for students to understand and reason accurately about (Reid & Reading, 2008), yet it is a summary statistic that our visual system is sensitive to (Bronfman, Brezis, Jacobson, & Usher, 2014; Haberman, Lee, & Whitney, 2015; Morgan, Chubb, & Solomon, 2008; Solomon, 2010; Tong, Ji, Chen, & Fu, 2015). Next, while a lot of the graphical perception literature (e.g., Shah, Freedman, & Vekiri, 2005; Shah & Hoeffner, 2002) focuses on simpler graphs, such as bar graphs and line graphs for interaction plots (e.g., Shah & Freedman, 2011); data visualizations and the graphs used to teach core concepts in statistics, such as correlation with scatter plots and distributions with histograms, use many more graphical marks (e.g., bars, dots, lines).

In the first section of this tutorial review, we will provide an overview of the research conducted in ensemble perception on mean and variance perception. We will summarize the factors that influence our capacity for mean and variance perception and discuss the relationship between the two capacities. The list of factors discussed is not meant to be exhaustive. While researchers study many factors, like spatial grouping/layout (e.g., Yildirim, Öğreden, & Boduroglu, 2018) and viewing distance (e.g., Tiurina & Utochkin, 2019), this tutorial review will focus on the factors related to statistical distributions that are most relevant to research in data visualization and statistics education. Next, we will provide an overview of the research conducted on data visualizations, focusing on studies that relate to ensemble perception or statistics education. Finally, we will provide an overview of the research conducted in statistics education on students’ reasoning about mean and variance, as well astheir misinterpretations of histograms.

The goal of this tutorial review is to draw connections between the three research areas in order to inspire research into investigating ways to optimize graphs for statistics education. The scope of this tutorial review has been narrowed based on this goal, and our review of each research area is not meant to be exhaustive. Our intended audience is any researcher in ensemble perception, data visualization, or statistics education who may be interested in contributing to or applying research from the other research areas.

Ensemble perception

There is bountiful sensory information around us—too much for our limited cognitive resources to encode everything. How do our brains cope? Researchers have proposed that our brains cope through condensing the information into smaller pieces of summary information. This ability to extract summary information (i.e., summary statistics such as mean) from a scene, or array of items, is referred to as ensemble perception. An ensemble (referred to as an array in experiments) refers to a set of objects, features, or configurations that have some meaningful or consistent relationship amongst its members and an underlying statistical distribution of some visual feature (Whitney & Yamanashi Leib, 2018). These ensembles could be composed of countable objects or refer to uncountable features of an object or scene (e.g., texture on a grass field). Some examples of ensembles include berries on a bush, animals in a herd, and people in a crowd. Ensemble perception, operationally defined, is our ability to reproduce statistical moments from ensembles or discriminate between the statistical moments of multiple ensembles (Whitney & Yamanashi Leib, 2018).

Statistical moments

Statistical moments measure characteristics of the shape of a distribution. There are different levels of these statistical moments, each abstracting information regarding the distribution one order higher than the previous one. In the field of statistics, there are four statistical moments that are commonly used as well as higher-order statistics (i.e., 5th, 6th; Hsu & Lawley, 1940) that are rarely used due to difficulty in interpretation. The four statistical moments discussed below are statistical definitions. Whether humans are sensitive to all four and to what extent still remains an unanswered question.

The first statistical moment is the one we are most familiar with—mean, or average. This is also the same level that most researchers study ensemble perception. The second statistical moment is variance, or an aggregate measure of how much individuals deviate from the mean of a group—the study of which is more recently popularized by, for example, Morgan et al. (2008). The third statistical moment is skewness, a measure of the degree and direction of asymmetry in a distribution. Few studies on the third statistical moment were found within visual ensemble perception literature (Chetverikov, Campana, & Kristjánsson, 2016; Oriet & Hozempa, 2016). However, research on people’s sensitivities to skewness is more prevalent in the risk-taking and financial decision-making literature (e.g., Åstebro, Mata, & Santos-Pinto, 2015; Kraus & Litzenberger, 1976). Finally, the fourth statistical moment is kurtosis—the measure of the flatness, peakedness, or tailedness of a distribution (Westfall, 2014). Research on the visual ensemble perception of the third and fourth statistical moments is scarce to none. Due to this scarcity, this tutorial review will only discuss mean and variance.

Individual representations

Foremost, all statistical moments are calculated using individual data points from a distribution and represent different relationships between those individual data points. Before discussing people’s perceptions of the first two statistical moments, we should discuss the relationship ensemble perception has with individual items in an array. Ensemble perception is thought to economize the limited capacity of our visual system by not requiring full encoding of each individual item in an array (Chong & Treisman, 2003).

Other important components of the operational definition of ensemble perception juxtaposes ensemble representations with representations of individual objects. For one, single item recognition is not required for ensemble coding, and it is often the case that individuals in an array cannot be reported (e.g., Ward, Bear, & Scholl, 2016). What is more, ensemble representations can be more vivid and accurate than that of individuals. For example, size discrimination is more accurate for arrays of same-sized circles than for individual circles (Chong & Treisman, 2003). This effect might be appreciated as follows. Estimating the size of a circle necessarily entails measurement errors. Having available multiple samples of the same-sized circle at multiple spatial locations provides an opportunity to measure the same circle repeatedly. The uncertainty of the mean of such measurements (or the standard error) is by definition smaller than that of any single measurement. Hence, discrimination sensitivity is higher with multiple circles than with a single circle on each side. Secondly, ensemble representations can be extracted as fast or faster than recognitions of individual objects or search for an object (e.g., Haberman & Whitney, 2009). For example, people perceive characteristics of crowd behavior (i.e., average speed and average direction of headings) better than that of individuals (i.e., the speed and direction of a particular individual in a crowd; Sweeny, Haroz, & Whitney, 2013). One way to summarize an ensemble of individuals is numerosity.

Numerosity, or sample size

Numerosity, or quantity, provides summary information about our environment. Being able to approximate number or count objects has undeniable ecological value, and our ability to do so has its evolutionary roots (e.g., Brannon & Merritt, 2011). However, researchers of ensemble perception do not usually list numerosity as a summary statistic (cf. Khvostov & Utochkin, 2019; Solomon & Morgan, 2018; Valsecchi, Stucchi, & Scocchia, 2018; “On the other hand, estimation involves an approximation of the number of different items or a sense of ensemble statistics”: p. 1, Chong & Evans, 2011). Our ability to extract information about numerosity from our environment is studied mostly by researchers of numerical cognition. In the field of numerical cognition, approximate number system (ANS) refers to the cognitive system that supports the approximation of quantities (see Odic & Starr, 2018, for a review). The acuity of this system, measured through nonsymbolic numerical comparison tasks (i.e., which array has more items), develops throughout childhood (M. Piazza & Izard, 2009) and can be improved through training even in adulthood (Cochrane, Cui, Hubbard, & Green, 2019). While we are able to very accurately estimate quantities less than four without counting, an ability referred to as subitizing (see Katzin, Cohen, & Henik, 2019, for a review), our enumeration of quantities above 10 follow Weber’s law (Revkin, Piazza, Izard, Cohen, & Dehaene, 2008). Mathematically, numerosity is involved in the calculation of the mean, the perception of which is popular in ensemble perception research.

Mean perception

The ability to extract the mean value of a visual feature from a set of items spans across low-level features, such as size (e.g., Allik, Toom, Raidvee, Averin, & Kreegipuu, 2013; Ariely, 2001; Corbett & Melcher, 2014; Corbett, Wurnitsch, Schwartz, & Whitney, 2012; Im & Halberda, 2013; Luo & Zhao, 2018; Myczek & Simons, 2008; Tiurina & Utochkin, 2019; Tokita, Ueda, & Ishiguchi, 2016), line length (Bauer, 2017; Utochkin, Khvostov, & Stakina, 2018), orientation (Solomon, 2010; Utochkin et al., 2018; Witt, 2019), and hue (Maule & Franklin, 2015, 2016; Michael, de Gardelle, & Summerfield, 2014; Tong et al., 2015), to high-level features, such as emotion and gender (e.g., Haberman & Whitney, 2007, 2009), facial expressions (e.g., Griffiths, Rhodes, Jeffery, Palermo, & Neumann, 2018; Li et al., 2016; Wolfe, Kosovicheva, Leib, Wood, & Whitney, 2015), and lifelikeness (e.g., Yamanashi Leib, Kosovicheva, & Whitney, 2016). Size is perhaps one of the most studied features, and is tested with various stimuli: typically, with dots/circles (e.g., Ariely, 2001; Chong & Treisman, 2003), but also with concrete illustrations of strawberries and lollipops (Yang, Tokita, & Ishiguchi, 2018).

Out of all four statistical moments, one may argue that the ability to extract the mean of a distribution from a visual scene is the most relevant for everyday life, and most ecologically grounded. For example, the ability to represent size and color distributions of berries would have been very useful when our hunter-gatherer ancestors were scanning for bushes that had the biggest, ripest, and most berries. When encountering other tribes and assessing whether they posed a threat, the ability to quickly assess the distribution of emotions, such as friendly or angry expressions, would support a tribe’s survival and facilitate befriending other tribes.

Factors that influence mean perception

As previously mentioned, adaptation to an array biases participants’ mean perception of a subsequent array (e.g., Corbett et al., 2012). For example, Corbett and Melcher (2014) had participants adapt to patches of small-sized and large-sized dots in opposite regions of the screen (left/right or top/bottom) and asked participants to judge which of the two test dots (one test dot per tested region) is bigger. Participants had the tendency of perceiving the subsequently viewed test arrays as having an average that is biased towards the opposite end of the feature spectrum of the adapted array. Ying, Burns, Lin, and Xu (2019) found that adapting to a group of unattractive faces made participants judge subsequent faces to be more attractive than they would without the adaptation.

Other factors that influence mean perception can be divided into the items within the ensembles and how those items compare to other items in the ensemble, or characteristics that describe the ensemble’s composition or feature distributions. As for the items within the ensembles, set size has been shown to influence perceptual averaging performance. For small set sizes of two to three items, viewers show a lot of uncertainty, taking longer to estimate the mean of the ensemble and often being very inaccurate—their estimates deviating more from the true mean value than estimates of larger set sizes. Typically, viewers perform better with more items in the ensemble (Allik et al., 2013; Baek & Chong, 2020; Haberman & Whitney, 2010). However, up to a certain large set size, accuracy of mean estimation no longer improves (Dakin, 2001; Robitaille & Harris, 2011), leading some researchers to believe that only a subset of items are actually used for estimating the mean, with the number of incorporated items being constant regardless of set size (Simons & Myczek, 2008), or exponentially decreasing as the set size increases (Dakin, 2001). However, some studies have found relatively constant sensitivity to the mean with increasing set size (Alvarez, 2011; Ariely, 2001; Chong & Treisman, 2005) with a few cases where averaging ability actually decreases with an increasing set size (Ji & Pourtois, 2018, when large variance exists in the ensemble).

As for the factors relating to an ensemble’s composition, measures related to the ensemble’s extreme values seem to play a role in influencing mean estimations. For example, an ensemble’s range changes the difficulty of a mean estimation task. Specifically, Maule and Franklin (2015) found that increasing the range of colors used, measured by the unit of just-noticeable-differences (JND; Witzel & Gegenfurtner, 2013), decreased accuracy and speed of averaging hue (e.g., Experiment 1: “green” to “blue”; Experiment 2: at the most extreme, mean of equally spaced [in JND] color hues between “pink” and “light blue” is “yellowish green”). Post hoc analysis of whether number of color categorical boundaries explain a decrease in performance with increased range revealed that it is the perceptual differences between items that affected performance.

While an ensemble can have a huge range, extreme values may be discounted if there are too few of them. Those items are regarded as outliers. Instead of being incorporated into the mean estimation, outliers seem to be discarded or less weighted when estimating the mean, as with the case of emotional outliers (Haberman & Whitney, 2010). The perceived means, in turn, better reflect the mean of the majority of the array’s items and are not pulled towards the outliers as one would expect from exact computation. This stands in stark contrast to what students typically do when asked to estimate the average of numbers, which is to overweight the outliers (Anderson, 1968). This tendency to underweight outliers in our visual environment may make sense from an ecological perspective. One or two angry people in a crowd ought not change one’s decision on how to deal with a foreign crowd of people, who is otherwise consistently friendly. Instead, one has many opportunities to build good relations with the new group and just need to avoid a select few where conflicts may arise.

People may have a preference for the mean, as with the literature on perceived attractiveness, suggesting that people with average facial features are more attractive than people with more unique/idiosyncratic facial features (e.g., Langlois & Roggman, 1990). For example, de Fockert and Wolfenstein (2009) investigated the extraction of mean identity by showing participants a set of four faces and then asking participants whether a test face was present in the previously seen set. It turns out that participants responded yes to faces that were the morphed mean (of all four faces and not previously presented) more often than to faces that was actually presented in the previously seen set. This result highlights how important the mean is to the brain, so much so that it created a false memory of the mean being in the set. It also suggests that our memory for the mean is clearer and more accessible than any one of the individuals (Posner & Keele, 1970). In another study, when shown a sequence of items and then asked which of two stimuli was a member of the set, participants preferentially chose the members that were the mean or close to it, and this was the case for various types of stimuli: circle size, line orientation, and circle brightness (Khayat & Hochstein, 2018). Finally, not only does the mean get classified as being part of a previously seen set but the mean also seems to influence the memory of items in that set. Utochkin and Brady (2020) found that participants’ memory for individual orientations was biased towards the mean. The false memory that the mean was previously seen in the set may relate to why students believe that the average of a data set exists within that set (Batanero, Cobo Merino, & Diaz, 2003). Students may bring their previous experiences with them into a statistics course and use those experiences to relate to the curriculum. Therefore, the misconception that the mean exists in a data set may reflect general beliefs or assumptions about the world—that the prototype of a category exists as a real and previously seen object.

Finally, mean perception is affected by the variance of an ensemble. We will discuss this relationship in more detail later, but first, we will discuss variance perception and factors that influence it.

Variance perception

The perception of the mean has been shown to be influenced by factors that have some relationship with variance, such as range, existence of outliers, and heterogeneity (e.g., Chong & Treisman, 2003; Marchant, Simons, & de Fockert, 2013), leading researchers to manipulate variance directly and study variance perception within ensemble perception. There is evidence that our visual system can extract the variance of an array. This has been seen in the reproduction of the variance of a previously seen array, such as with facial emotion (Haberman et al., 2015), and discrimination of the variances in two arrays, such as with orientation (Morgan et al., 2008; Solomon, 2010; Yang et al., 2018), gray hues (Tong et al., 2015), color hues (Bronfman et al., 2014), and motion direction (Suárez-Pinilla, Seth, & Roseboom, 2018).

Our sensitivity to variance may have been an evolutionary advantage. Just as there is preference for the mean, such as for the mean face (de Fockert & Wolfenstein, 2009), people also have preferences for low variance. People are more likely to categorize themselves as part of a low variance group if their reported emotion is close to the group average and part of a high variance group if their reported emotion is much different from the group average (Goldenberg, Sweeny, Shpigel, & Gross, 2020). For each of these cases, the chosen group optimizes one’s chance for acceptance within the group (i.e., to fitting in, not standing out, respectively).

Unfortunately, there is limited literature on variance perception. This could be due in part to the mean appearing to be the more evolutionarily useful summary statistic to extract visually, as compared with the auditory modality, where the mean may be less significant. For the auditory modality, it may have been more evolutionarily advantageous to develop an ability to extract other summary statistics, like variance and skewness, for situations like assessing the number of approaching herds and inferring the level of threat when it is hard to see.

Factors that influence variance perception

Given the paucity of literature on variance perception, there is limited knowledge on the factors that influence variance perception. As with mean perception, adapting to variance biases subsequent variance perception away from the adapted stimulus (as with orientation and color hue: Maule & Franklin, 2019). Spatial locations and spatial relationships between individual items seem to affect variance perception. For example, participants have more difficulty discriminating between small variance differences when the items are crowded in the display (Solomon, 2010). Which individual items are available near the fixation point also seems to affect variance perception. For example, in Lau and Brady (2018), items with extreme values, such as minimums and maximums that give a sense of the range of features, were either displayed close to or far from fixation. When these extreme values were close to fixation, participants seemed to rely on the range to estimate variance because they performed worse when there was incongruence between range and variance. Whereas, when the extreme values were far away from fixation, participants’ variability discrimination was not affected by range, possibly because they relied on the range of values near fixation and did not gather complete information about range.

Relationship between mean and variance

The relationship between the perception of mean and the perception of variance is still relatively unclear. How do the two relate to each other? Do they draw from similar mechanisms? Are they dependent on each other? Theoretically, we know that neither summary statistic—mean or variance—gives us information about the other without knowing the individuals that make up the set, so assessing one will not replace the assessment of the other. More importantly, we cannot infer whether two distributions are likely significantly different from each other without both pieces of information. While students’ visual system may incorporate both mean and variance into determining whether two groups are the same, the need for both summary statistics (i.e., the statistical terms) is not so clear to students, who often think that just the mean is enough to determine whether two data sets came from separate populations (Castro Sotos, Vanhoof, Van den Noortgate, & Onghena, 2007). Just like these two pieces of information work together to give us information about the distributions, it also appears that mean perception and variance perception support each other.

Keeping one of the two parameters constant seems to benefit the perception of the other. This is particularly the case for keeping the variance constant when having participants perceive the average visual feature of an array. For example, participants respond to average discrimination tasks faster when they are first exposed to an array with the same variance as the subsequent array that they need to extract the average from (Michael et al., 2014). Constant variance also serves as a buffer, as sensitivity to the mean is less influenced by the shape of the statistical distribution (e.g., uniform, normal, bimodal; Allik et al., 2013; Chong & Treisman, 2003; Haberman & Whitney, 2009). It is interesting that our visual system is robust against the shape of distributions, compared with how much the shape of a distribution affects students’ assessment of variance in histograms (Cooper & Shore, 2008; Kaplan, Gabrosek, Curtiss, & Malone, 2014; Meletiou-Mavrotheris & Lee, 2002). This difference suggests that the representation of data in histograms is not intuitive to our visual system.

Varying one of the parameters seems to reduce performance in perceiving the other. For example, increasing the variance of an array reduces sensitivity to the mean of the array (Dakin, 2001; Fouriezos, Rubenfeld, & Capstick, 2008; Haberman et al., 2015; Im & Halberda, 2013; Morgan et al., 2008; Solomon, Morgan, & Chubb, 2011). Likewise, varying the mean of arrays was found to reduce participants’ sensitivity to variance and increase perceived variance (Tong et al., 2015). In this particular study, participants compared two 9 × 9 grids of gray tones and were asked to choose the grid with more grayscale variability. Participants performed better when the mean gray tone matched between the two grids than when they were different.

While it appears that mean perception and variance perception support one another, it would be inaccurate to claim that the two perceptions depend on each other. For one, perception of the mean heading of a crowd is tolerant to variability of direction and speed of the crowd (Sweeny et al., 2013), which may be a specific characteristic of higher-level visual features. Second, some effects that provide evidence that people do process variance, such as the aftereffect from adapting to variance, are durable to changes in mean (Norman, Heywood, & Kentridge, 2015).

Research investigating the relationship between mean perception and variance perception directly are scarce. Recently, a study found an interaction between mean and variance representations, where adapting to high variance made discrimination of mean orientation more precise, and adaptation to mean orientation of 15°, which resulted in a tilt aftereffect, led to overestimation of orientation variance (Jeong & Chong, 2020). The authors concluded that mean and variance representations are interrelated because perceived variance changed with perceived mean orientation.

Data visualization

Bar graphs, line graphs, and pie graphs are some of the oldest and most commonly used graphs (see Friendly, 2008, for history of data visualizations). However, modern technology has enabled us to collect more data and thus, necessitated innovations in graphs. In order to better convey trends and patterns in large data sets, researchers have been developing and testing new visualizations. Graphical perception refers to our ability to accurately interpret data values and different statistics (e.g., mean, variance, range) from visualizations (e.g., Cleveland & McGill, 1984; Saket, Srinivasan, Ragan, & Endert, 2018). Historically, research on graphical perception focused on people’s abilities to accurately read data values from different graphs and to compare data values within a graph type. Data values are represented in graphs through marks (e.g., dots, line, bars), and the decoding of these marks are discussed through visual features (e.g., position, length). Though the reading of some marks may be intuitively attributed to certain visual features (e.g., dots = position, line = length), we later discuss that this is not necessarily the case for all tasks.

In early work, dots were used to study position, lines for lengths, and typically circles or rectangles for area. Cleveland and McGill (1986) used these marks for their pioneering work in graphical perception. In one of their studies, participants compared positions, lengths, angles, slope, and area by estimating the percentage of a target to a standard (Cleveland & McGill, 1986). Participants’ performance from most accurate to least accurate was position, lengths, angles, and slopes, and finally areas, using significant differences to distinguish between rank levels. Heer and Bostock (2010) were able to replicate these rankings in a crowdsourced experiment. Subsequently, research on graphical perception looked to vision science for inspiration on frameworks and methodology to better quantify people’s abilities, explain results, and give design recommendations.

Applications of vision science

Researchers often decompose graph reading into “elementary perceptual processes,” such as detecting the orientation of a line for line graphs, the height of a bar in bar graphs, or the size of an angle in pie graphs (Larkin & Simon, 1987; Simkin & Hastie, 1987). It is upon these elementary perceptual processes that researchers try to explain the advantages and disadvantages of using different graphs and data visualizations, and to make design suggestions.

Researchers incorporated psychophysics to create models to explain human performance and develop metrics to predict the effectiveness of a generated visualization. Experiments utilized more trials and adaptive staircase methods (e.g., Rensink & Baldridge, 2010), as well as different task types: discrimination trials (e.g., Rensink & Baldridge, 2010) and magnitude (re)production trials (e.g., Saket et al., 2018) to measure precision. Saket et al. (2018) found that participants’ accuracy at adjusting interactive marks to match a reference followed the same ranking as in Cleveland and McGill (1986), except for a significant difference between length and angle: position, length, distance, angle, curve, area, and then shading.

The different marks used in graphs, such as dots, lines, and rectangles, have different levels of precision in their representations (e.g., Bair, 2005; Palmer, 2002). Starting from the higher end of precision, we have dots that represent spatial locations, then lines for lengths, and then rectangles for areas (e.g., Bair, 2005; Palmer, 2002). This order matches the accuracy levels found in comparing data values using the respective graph marks (Yuan, Haroz, & Franconeri, 2019). People are also better at estimating the average of the diameter (length) rather than the area of circles, and better at judging average diameter and average areas than judging cumulative diameter and cumulative areas (Raidvee, Toom, Averin, & Allik, 2020). Because discriminability of total area worsens as the number of items increases, these authors suggest that bubble maps use circle diameter as the dimension that is proportional to data values, since data values in bubble maps that are proportional to area may be less accurately perceived. Indeed, it is difficult to compare circles by area in bubble charts (Heer & Bostock, 2010), and bubble charts can be used to misrepresent data (Kong, Heer, & Agrawala, 2010).

More focus was put on measuring people’s sensitivity to differences in visual features of marks. This focus allowed researchers to develop just-noticeable-difference formulas (e.g., Rensink & Baldridge, 2010; Szafir, 2018) and quantitative metrics to advise on using visual features more effectively (e.g., as with color in Szafir, 2018, and parameters of scatter plots in Micallef, Palmas, Oulasvirta, & Weinkauf, 2017). Modeling techniques are used to deduce the visual features used to accurately reading visualizations (e.g., Jardine, Ondov, Elmqvist, & Franconeri, 2020) and to isolate the contribution of different visual features (e.g., scale, density, curvature) on graphical perception (e.g., scatter plots: Sedlmair, Tatu, Munzner, & Tory, 2012; high-dimensional data visualizations: Bertini, Tatu, & Keim, 2011). These models then inform design decisions and recommendations.

Data value perception

The precision of reading values from a bar graph depends also on context. In most cases, it has been well documented that participants have a biased impression when judging heights of bars in bar graphs (Jarvenpaa & Dickson, 1988; Kosslyn, 2006; Peebles, 2008; Zacks, Levy, Tversky, & Schiano, 1998). Compared with line graphs and Kiviat charts (also called radar charts or spider charts; see Fig. 1 for an example), values in bar graphs are systematically underestimated (Peebles, 2008). However, bar graphs do prevail to line graphs, but only in a particular context. When relationships between three variables need to be conveyed, bar graphs are preferable to line graphs, since participants make fewer errors when interpreting those bar graphs (e.g., Ali & Peebles, 2013; Peebles & Ali, 2009).

Fig. 1
figure 1

Simplified examples of common data visualizations

Researchers have tried to infer what visual feature of marks people use to compare data values in different graphs based on their experimental results. For bar graphs, participants could be comparing the spatial locations of the top of bars, the height of bars, or the area of bars (assuming equal width). Yuan et al. (2019) concluded that people use a spatial position encoding rather than a length encoding because their participants performed as well at comparing pairs of bars in bar graphs as comparing pairs of data points in dot plots. Compared with pie graphs, bar graphs can be read faster and more accurately (Simkin & Hastie, 1987). This difference in performance may be due to the fact that any of the possible encodings of bars (spatial location, height, or area) can be compared directly, whereas pie angles are more difficult to compare with each other because the angles do not share a common starting orientation as compared with the common starting level of bar heights. Results from Huestegge and Pötzsch (2018) support this explanation, since participants performed better (shorter response times and lower error rates) on proportion judgments with rectangles in tree maps than with slices in pie graphs. Bar graphs and tree maps provide some common point of reference for comparing their components. Bars in bar graphs start from the same baseline and rectangles in tree maps typically share some dimension with adjacent rectangles. The structure and organization of rectangles in tree maps may be essential to their readability. When squares or rectangles of extreme aspect ratios were used in tree maps, area estimation became less accurate (Kong et al., 2010). Eye-tracking data suggests another explanation for the performance difference between tree maps and pie graphs: greater ease with scanning in tree maps than in pie graphs (Huestegge & Pötzsch, 2018). Like with tree maps, there are other visual cues people could use to make proportion judgments in pie graphs. Proportion judgments in pie graphs could be made using area, arc length, or angle of pie slices. Skau and Kosara (2016) tested this possibility by designing graphs that separated the encodings of area, arc length, and angle. Performance was negatively impacted by providing only angle information, but not by providing only arc length or area. Even when area was hard to decode, like with thin donut graphs, performance remained good. These results led the authors to conclude that arc length is the most important encoding and angle least important. This makes sense from an evolutionary standpoint, since circumference and arc length may be more meaningful and commonly assessed, such as tracking the phases of the moon, whereas objects that resemble pie graphs may not naturally occur. Further investigation into the essential components of pie graphs revealed the importance of its off-centered representation of proportions (Kosara, 2019). The variations of pie graphs that used centered shapes to represent proportions produced more error and consistent overestimation, which Kosara (2019) believed was due to increased difficulty in make area comparisons.

Incorporation of ensemble perception

Research on ensemble perception can be applied to human–computer interactions to convey information more effectively. Data visualizations encompass a lot of the visual features studied in ensemble perception, such as dot size in bubble maps and orientation in weather maps (see Fig. 1 for examples). There is an undeniable parallel between perceiving summary statistics from visual arrays typical in ensemble perception, such as line length, orientation, and (circle) size; and perceiving summary statistics from objects typical of graphs, such as points, lines, circles, and rectangles. In fact, we observe similar findings with graphs as with similar features in ensemble perception studies. In particular, the factors that influence performance when comparing mean values in two visual arrays also influence performance when comparing mean values between two graphs. For example, in bar graphs, the number of bars and the amount of variation in their height influence participants’ ability to discriminate between means (Fouriezos et al., 2008). This is similar to differences in performance observed with different set sizes for visual arrays (i.e., greater uncertainty when there are only a few items in an array vs. many; Dakin, 2001). It is also similar to decreased performance when there is a lot of variation amongst items in the array (e.g., Chong & Treisman, 2003; Marchant et al., 2013). Ensemble perception plays a role in making insightful observations when reading graphs and making effective design decisions for presenting information. For example, one may quickly estimate the mean of a graphs’ objects (e.g., dots, circles, bars) in order to create a reference point for making comparisons across other objects, which would aid in identifying key points of interest. Along the same line, presenters can use this knowledge to present key information in an intuitive way, thereby guiding the viewer to the relevant observations.

Ensemble coding tasks

Researchers in data visualization have drawn upon ensemble perception research to categorize and understand the processes involved in interpreting data visualizations. Szafir, Haroz, Gleicher, and Franconeri (2016) proposed four different types of ensemble coding in data visualizations, which describe the different visual aggregations that a viewer may perform with visualized data: summary, identification, pattern recognition, and segmentation. Summary tasks can be assessing numerosity in tagged text or assessing mean in bubble maps or weather maps. Identification tasks include locating outliers and extrema, and assessing range. Pattern recognition refers to trend assessment, such as correlation in scatter plots and silhouettes in weather maps. Segmentation refers to organizing data into subsets, such as clustering or categorization. Viewers’ accuracy when performing these tasks may depend on the visual features used in the data interpretation.

Decomposing graph reading into elementary perceptual processes and attributing advantages and disadvantages of different graphs to different visual features (e.g., spatial location, length, area) is not as straightforward when the task requires integrating across or summarizing trends from the data (Carswell, 1992; Ratwani, Trafton, & Boehm-Davis, 2008). Likewise, the recommendation of a visualization for highlighting trends is not straightforward. The effectiveness of a visualization depends on the task (e.g., Albers, Correll, & Gleicher, 2014; Saket, Endert, & Demiralp, 2018). Saket et al. (2018) compared people’s speed and accuracy at tasks, such as finding outliers, clusters, max/min, and estimating means and correlations, between tables, lines graphs, bar graphs, scatter plots, and pie graphs. They made separate recommendations depending on which statistical property one wants to highlight. For example, they recommended bar graphs for clusters, line graphs for correlations, and scatter plots for outliers.

Just as mean perception is a popular research topic in ensemble perception, people’s ability to judge mean values from graphs is a popular research topic in data visualizations. Researchers have assessed mean perception in commonly used graphs—namely, scatter plots, bar graphs, and line graphs. When asked to judge the mean values in bar graphs and scatter plots, participants’ estimates were systematically lower for bar graphs, but not for scatter plots (Godau, Vogelgesang, & Gaschler, 2016). The authors attributed this finding to the fact that scatter plots have data values represented at different spatial locations while bar graphs represent the same data values, but in a less precise dimension: area. It is because of this imprecision, the authors believe, that made participants systematically underestimate mean values from bar graphs. People are exceptionally good at judging the average value in scatter plots, even with increasing set size, redundant and conflicting encodings and additional sets (Gleicher, Correll, Nothelfer, & Franconeri, 2013). This ability could be explained by the fact that centroid localization—the ability to accurately locate the center of gravity in a dot cluster (Badcock & Westheimer, 1985; D. Whitaker & Walker, 1988; Wright, Morris, & Krekelberg, 2011) is a visual hyperacuity, where performance on tasks have substantially lower thresholds than the grain of the receiving layer (Westheimer, 2010). When asked to judge whether a value was larger or smaller than the average of the graph, participants only systematically underestimated values for bar graphs, but did not have a systematic bias with Kiviat charts (also called radar charts or spider charts; see Fig. 1 for an example) or line graphs (Peebles, 2008).

With which visual features (e.g., spatial position, length, area) people use to encode marks for summary tasks may be different than those used for comparing data values. Yuan et al. (2019) concluded that participants use a spatial position encoding when comparing data values in a bar graph. However, their results from a summary task suggest that participants use a different encoding. For the summary task, participants were asked to compare the average values between two groups of bars, whether the two groups have the same number (i.e., six vs. six) or different number (i.e., six vs. 10) of bars. The authors concluded that participants are using length encodings for their task. For one, participants’ performance was worse than what would be expected if they still used spatial locations to estimate the averages. Secondly, the comparison condition of unequal number of bars helped discount area encoding as a possible interpretation. Because unequal number of bars makes greater cumulative area a poor proxy for greater average, we can be more certain that they did average. However, averaging here may be a more unnatural task than summation because the adjacency of the bars may encourage the perception of one object instead of an ensemble of objects.

Given the ubiquitous use of scatter plots and histograms for teaching statistics, the findings discussed above should be put into consideration when lesson planning and kept in mind when addressing students’ misconceptions. Beyond being able to extract summary statistics and identify trends and patterns in graphs, statistics students also need to learn how to interact with inferential uncertainty.

Insights from uncertainty perception

Uncertainty perception refers to people’s perception of variance in data visualizations and their judgment of the (un)certainty of trends or predictions displayed in data visualizations. Many real-world problems require uncertainty to be represented in data visualizations. For example, viewers need to be reminded that weather and storm forecasts are predictions and that there is some level of uncertainty. Conveying uncertainty in data visualizations is a difficult problem for visualization designers, who have used very diverse approaches. For example, in the context of meteorology, contour lines have been used to convey atmospheric pressure predictions (Stephenson & Doblas-Reyes, 2000) and arranged to showcase median, spread, and outliers for temperature fields (R. T. Whitaker, Mirzargar, & Kirby, 2013). An ensemble of spatial positions with concentric confidence intervals has been used for hurricane prediction plots (L. Liu, Mirzargar, Kirby, Whitaker, & House, 2015). However, usage of spatial spread can mislead viewers. For example, when an uncertainty cone is used for hurricane forecasts, viewers had misperceptions that storm size increases with time and overestimated the likelihood that a storm will occur within the cone (Broad, Leiserowitz, Weinkle, & Steketee, 2007). L. Liu et al. (2017) used confidence interval bands (colored superimposed circles), splats, and icons to display storm predictions. They found that icon visualizations were better than the others for time-specific visualizations. Participants also estimated risk (an integration of likelihood and intensity) more accurately with icon visualizations.

These studies suggest that ensembles of discrete marks convey information about variance and uncertainty better than a (continuous) mark that summarizes the data set. The benefit of using ensembles of marks may derive from our abilities for ensemble perception. Each mark maps to a data point, the possible outcomes are visible, and the variance can be extracted via ensemble perception. On the other hand, for summary displays, a more abstract notion of variance or uncertainty is used, and the viewer needs to make inferences about the originating distribution of data points using the representation of variance or uncertainty (e.g., as from error bars in bar graphs). These displays may also require more basic knowledge about statistics. The advantage of using ensembles of discrete over continuous marks that represent data sets relates to the common finding that people reason about probabilities better when probabilities are presented as discrete events (see McDowell & Jacobs, 2017, for meta-analysis).

The literature on uncertainty perception in data visualizations could be valuable contribution to resolving some of the problems in statistics education. Results from research on uncertainty perception may highlight the undeniable visual aspect of learning statistics and predispositions people have when reading data visualizations. Inferential statistics is commonly taught in introductory statistics courses (Castro Sotos et al., 2007; Lavigne, Salkind, & Yan, 2008). Results of hypothesis testing are often reported using confidence intervals or shown in a bar graph where error bars represent standard errors. However, research shows that error bars and confidence intervals are often misinterpreted. People tend to find it more likely for data points to fall within the error bar inside the bars of bar graphs than within the error bar outside the bars (Newman & Scholl, 2012)—an effect the authors attribute to object perception. People also assume that overlapping error bars mean nonsignificant difference (Cumming, 2009) or all values within the error bars are equally likely (Ibrekk & Morgan, 1987). Though many of the participants in these studies were crowdsourced or had little or no statistics background, experts can also have problems with error bars. For example, Belia, Fidler, Williams, and Cumming (2005) recruited authors from leading journals in psychology, behavioral neuroscience, and medicine to adjust means and error bars in a graph to represent statistical significance. The researchers found that the experts had misconceptions about how error bars relate to statistical significance and how confidence intervals and error bars differ.

Error bars and confidence intervals may be difficult to interpret because they represent the variation of the population parameter and, as such, appear narrower in variation compared with other representations of variation. On the other hand, prediction intervals represent a range of predicted values for new observations and relate more directly to individuals in the population. When confidence intervals (CIs) are used in a graph to show two treatment options, participants tended to overstate the superiority of a treatment option, be more willing to pay for that option, and understate the variability of outcomes (Hofman, Goldstein, & Hullman, 2020). However, when the axes of the CI graph were rescaled to match those of the prediction interval (PI) alternative, in which variation of individual outcomes are represented, these biases were reduced (Hofman et al., 2020). This rescaling made the perceived difference between treatment options smaller, which translated to less certainty that the two options were significantly different. In that study, the visualizations that encoded the variation in individual outcomes: prediction interval and hypothetical outcome plots (HOPs: an animation of discrete outcomes) yielded more accurate judgments from participants.

Error bars do not represent the underlying distribution. Because of how error bars look, many people think that error bars represent an interval of uniform probability (Ibrekk & Morgan, 1987). Commonly recommended alternatives to error bars include the violin plot and gradient plot (e.g., Correll & Gleicher, 2014). Violin plots are believed to be easier to interpret correctly because they encode probability density as mark width (Barrowman & Myers, 2003; Kampstra, 2008). Gradient plots encode uncertainty using degree of transparency/opacity (Correll & Gleicher, 2014). These visualizations capitalize on our capacity to average line length and hues, respectively. Correll and Gleicher (2014) compared visualizations that encoded margin of error either binarily (error bars, box plots) or continuously (violin and gradient plots) and found that people were overconfident about their judgments of effects when using error bars compared with the other plots. Hullman, Resnick, and Adar (2015) compared people’s abilities to estimate properties of distributions (mean, probability of outcome above a threshold, probability of outcome between two thresholds) from error bars, violin plots, and HOPs. They found that people had more accurate estimates for multivariate distributions of two and three variables (either independent or correlated) when using HOPs. The authors believed HOPs were helpful because they allowed the use of counting and probabilities to infer properties of distributions. However, one drawback of HOPs was that participants had worse estimates of mean when the distribution had high variance. Since HOPs display discrete outcomes over time and we have the capacity of ensemble perception across time (e.g., Hubert-Wallander & Boynton, 2015), it is not surprising that Hullman et al. (2015) found this, since estimation of mean has also been found to be worse when arrays have higher variance in the ensemble perception literature (Dakin, 2001; Fouriezos et al., 2008; Haberman et al., 2015; Im & Halberda, 2013; Morgan et al., 2008; Solomon et al., 2011).

The proposed alternatives to error bars incorporate information about point estimates and probabilities. When displaying uncertainty in visualization, people make more optimal decisions when they have both point estimates and probabilities (Joslyn & LeClerc, 2013). Both pieces of information are important. Hullman et al. (2015) find that separating the visual marks that encode the underlying data from those that encode uncertainty may encourage participants to favor information about central tendency and consider probabilistic information less. Thus, it is recommended that visualizations of uncertainty be intrinsic to the representation instead of an extrinsic annotation of a distribution’s property (e.g., error bars or marks that locate mean, median, or mode; Kay, Kola, Hullman, & Munson, 2016). For example, in a density plot, mode and probability density are intrinsic to each other because the mode is represented by the peak of the density. Kay et al. (2016) compared people’s probabilistic estimates using density plots, stripe plots, and dot plots (20 or 100 dots). They found that people’s probability estimates were less variable when dot plots used easily countable quantities (20 dots), and that performance using dot plots with 100 dots was similar to density plots. The authors believed the countability of the dots made it easier to reason about hypothetical outcomes. These dot plots may be beneficial because they tap into our capacity for numerosity, which is less precise at larger numbers like 100 (Landy, Silbert, & Goldin, 2013) and for reasoning about probabilities in terms of frequencies.

Our survey of data visualization literature aims to highlight the importance of considering visual influences on reading graphs and identifying trends and patterns within them. One main takeaway is the limitations of graphical perception of bar graphs, which have direct implications for understanding students’ misconceptions about histograms that are discussed in the next section. Other takeaways include the misperceptions of the graphical representations of inferential uncertainty that are typically used in statistics classrooms (i.e., error bars, confidence intervals) and the assistance that ensemble representation has on accurate assessment of variance and uncertainty.

Statistics education

In the age of big data, it is more important than ever to have statistical literacy when entering the job market. Colleges prepare the next generation of workers by requiring one or more statistics courses in such majors as psychology, economics, and biology. However, many undergraduates still struggle to apply what they learned in their introductory statistics courses for real life problem-solving (e.g., Castro Sotos et al., 2007; delMas, Garfield, Ooms, & Chance, 2007; Lavigne et al., 2008). One contributing factor to students’ inability to apply these statistical concepts may be shallow or misunderstanding of statistical concepts that arise from not reading graphs properly. A lot of statistics instruction depends on the use of graphs, such as scatter plots, bar graphs, and most importantly, histograms. While most students likely encountered scatter plots and bar graphs in their K–12 education, most students are likely encountering histograms for the first time in an introductory statistics course.

In our discussion of literature on statistics education, we will focus on histograms because they are a fundamental component of statistics education and a commonly studied graph by researchers in statistics education. Histograms was also selected because they are not intuitive to our visual system as with scatter plots and is, therefore, a research topic that could benefit from insights from visual perception. While researchers in statistics education often take a cognitive perspective of understanding the difficulties students have with histograms, we want to start a discussion on possible perceptual difficulties. While many graphs may be difficult for students to interpret (e.g., Ali & Peebles, 2013), our discussion on histograms serves as an example of the types of conceptual problems that may arise from perceptual difficulty.

An example of using graphs: Histograms

The field of statistics aims to transform concrete or qualitative characteristics of the world into quantifiable measures and convert this raw data into numerical or visual summaries that help make inferences and conclusions about the world. This transformation of the concrete and observable into the abstract and intangible is perhaps the most challenging for students to understand and grasp fully. Transnumeration, or transformations that change data representation, has been deemed to be a fundamental framework for statistical reasoning and better understanding of variation, distribution, and other important statistical concepts (Wild & Pfannkuch, 1999). One example of these transformations is the histogram, which converts individual data points into a graphical representation.

Histograms are used as a graphical representation of the shape and variability of a distribution and a visual tool for explaining many statistical concepts, such as spread and skewness. The understanding of histograms has also been regarded as necessary for developing the understanding of density curves of distributions (delMas, Garfield, & Ooms, 2004). In fact, histograms, as a graphical representation, provides perhaps the most natural transition to teaching the concept of density curves as they already use area to represent proportions (delMas et al., 2004). The following are common misconceptions about histograms:

  1. (1)

    Histograms are (like) bar graphs (Biehler, 1997; Cooper & Shore, 2010; Meletiou-Mavrotheris & Lee, 2010).

  2. (2)

    Histograms display raw data, with each bar representing an individual observation rather than a grouped set of data (Lee & Meletiou-Mavrotheris, 2003).

  3. (3)

    Histograms display two-variables like scatter plots or time sequence plots (Lee & Meletiou-Mavrotheris, 2003).

  4. (4)

    The y-axis (where frequency is displayed) is used when finding the mean and mode of the distribution (i.e., mode is the y-value of the highest bar in the histogram; Kaplan et al., 2014).

  5. (5)

    Differences in the heights of bars (y-axis) can be used to compare the variation of two histograms (Lee & Meletiou-Mavrotheris, 2003).

  6. (6)

    Flatter histograms have fewer variable data (Cooper & Shore, 2008; Kaplan et al., 2014; Meletiou-Mavrotheris & Lee, 2002).

These misconceptions about how to read histograms are especially alarming because being able to correctly interpret histograms is integral to understanding key concepts in statistics, such as central tendency. Misconceptions about histograms may contribute to misconceptions about summary statistics.

Misconceptions about summary statistics

Central tendency

Students have misconceptions about measures of central tendency: mode (Huck, 2009), median (Cooper & Shore, 2008), and mean (Olani, Hoekstra, Harskamp, & Van der Werf, 2010). When asked to identify the three measures of central tendency from a histogram, where frequency information is displayed on the y-axis, students make some common errors that stem from misinterpreting the information portrayed by the axes (reading the y-axis as data values and not frequency) and by the bars.

Some students misinterpret mode, or the data values that appear the most often in a data set, as the height of the highest bar, giving the y-value as the mode (Huck, 2009; Ismail & Chan, 2015). Others give the number that appears the most in the axis labels (Ismail & Chan, 2015).

For the median, students know that it refers to the midpoint of some numbers, but some students give the median of the frequencies (or the middle bar), while other students give the median of the numbers that appear on the x-axis (Cooper & Shore, 2008; Ismail & Chan, 2015). Some make procedural errors, like knowing to divide some number by two to find a midpoint, but choose an irrelevant set of numbers to use and/or forget a step or two. For example, they may count the number of data values (i.e., irrelevant set of numbers) on the x-axis and divide that by two or add the frequencies and divide that by two, but forget to use that answer to find the corresponding data value (Ismail & Chan, 2015). These kinds of errors showcase the complete misinterpretation of a histogram. Those that correctly add up all the frequencies and find the data value corresponding to the midpoint of the sorted data values are the minority: less than 10% in Ismail and Chan (2015).

Out of the three measures of central tendency, the mean is the most computationally intensive and perhaps the most dependent on understanding what exactly is portrayed in the histogram. It requires a series of multiplications (each pair of data value and its frequency) and additions and finally a division (number of data points). Ismail and Chan (2015) observed eleven different computational errors. These include not using multiplication, using multiplication but with irrelevant numbers or using irrelevant numbers as divisors. These errors all stem from misinterpreting the axes and misunderstanding what is portrayed by bars.

Mean

Precollege and college students have many misconceptions about the concept of average (e.g., Cai & Moyer, 1995; Mevarech, 1983; Pollatsek, Lima, & Well, 1981; Strauss & Bichler, 1988). In particular, it appears that while students know how to compute the average when given a set of numbers, they do not necessarily know how to apply the concept of average in word problems. For example, in Cai & Moyer (1995), middle school students were given a word problem: there are five people who have blocks, given the average number of blocks across the five people and the number of blocks that four out of five total people have, how many blocks does the fifth person have? Some students incorrectly add up the number of blocks from the four people and the average number and divide by five (Cai & Moyer, 1995). Misconceptions about the concept of average can be remedied through instruction. After instruction, Batanero, Cobo Merino, and Diaz (2003)found significant improvement in the understanding that the mean value may not exist as a data point and how to use the mean algorithm to solve word problems like the one described above.

Variance

Variability refers to the variable attribute of an entity (e.g., that a population has variability in feature x), whereas variation, or variance, assesses or quantifies that attribute (e.g., as a summary statistic; Reading & Shaughnessy, 2004). The understanding of variability and variance is very important for statistical reasoning, as it lays the foundation to understanding distributions and statistical inference, such as hypothesis testing. However, students often have misconceptions about variability and variance, and reason about them incorrectly.

One common mistake involves the misinterpretation of histograms—looking at the y-axis for information about variability instead of the x-axis (Lee & Meletiou-Mavrotheris, 2003). As one might expect, misinterpreting the information portrayed in a histogram makes it very difficult for students to reason about variability correctly. For example, some students (~27%) may use the fact that the heights of bars in a bell-shaped histogram are more variable than those of another bell-shaped histogram to conclude that the former has greater variability (Cooper & Shore, 2008). Some students mistakenly connect the concept of variability with a concept more familiar to them: range. For example, some students (~20%) misunderstand ‘spread’ (commonly referring to variance) to be the range of data values on the x-axis and conclude that any two histograms with the same range also have the same amount of variability regardless of other features of the distribution (Cooper & Shore, 2008).

Another common mistake is using irrelevant features of a distribution to make conclusions about a distribution’s variability. For example, some students focus on the differences between the heights of bars in histograms, believing that variability depends on the number of data points, or on the irregularity of the shape of the distribution (Inzunza, 2006). One of the reasons why students use irrelevant features to reason about distributions may be this: Novices have a tendency to be drawn to irrelevant features of the graph not only for histograms, but also for bar graphs (e.g., Ali & Peebles, 2013; Okan, Galesic, & Garcia-Retamero, 2016).

Just as students have difficulty using relevant features to reason about a distribution’s variability, students also struggle with reasoning about a distribution’s standard deviation. In particular, they have a difficult time identifying what information from a histogram they can use to assess standard deviation and how standard deviation relates to other measures. For example, when high school students were shown two histograms, one uniformly distributed across x-values and another bell shaped, some students used total frequency or the mean to assess whether two graphs have the same standard deviation (Chan & Ismail, 2013). Using such features of a distribution to reason about its standard deviation suggests a deep misunderstanding of what standard deviation represents. Student’s lack of conceptual understanding of standard deviations is well documented (e.g., Garfield & Ben-Zvi, 2008). It is commonly the case that students know how to calculate the standard deviation from a data set but do not know how to interpret the number as a result of the computation (Garfield & Ben-Zvi, 2008). These misconceptions about variance are concerning because many researchers agree that the concepts of variability and distribution support each other in students’ learning of statistics (Bakker, 2004; Ben-Zvi, 2003; delMas & Liu, 2005; Makar & Confrey, 2003, 2005; Pfannkuch & Reading, 2006; Wild, 2006).

Perceptual barriers to reading histograms

Though we do not deny the possibility that students’ difficulties with histograms may have predominantly cognitive roots, we do think it is important to consider perceptual explanations because this is typically an overlooked perspective amongst researchers in statistics education. While a lot of discussed misconceptions about histograms seem to stem from not knowing which axis to use to find certain information, it is possible that the confusion is not that students do not remember but that the dimensions are so counterintuitive that they have trouble parsing the information that is there, falling back on what feels more intuitive—read it like a bar graph.

Perceptual difficulties should not be overlooked because they can even affect statistics experts. Statistics experts (e.g., graduate students, professors, researchers) seem to be susceptible to influences from the vertical dimension as well. This was suggested by worse performance (i.e., more errors, slower response times) consistent with using the height heuristic (i.e., histograms with taller bars have greater means) for comparing histograms with the same mean and shape of distribution, but with scaled up heights of bars, but not with other types of histogram pairs (Lem, Onghena, Verschaffel, & Van Dooren, 2014). Even with experts, we see some level of perceptual interference with statistical reasoning. Lem et al. (2014) found that experts had slower response times when height information could bias one to an incorrect response (incongruent trials) than when height information was sufficient for the correct response (congruent trials). This pattern of slower response times for incongruent trials is characteristic of the Stroop effect. The Stroop effect has been found for incongruence between the magnitude of numbers and their font size in numerical magnitude comparison tasks (e.g., when comparing 3 with 5, slower to select 3 as being the smaller number if it is printed in larger font; Henik & Tzelgov, 1982). In Lem et al. (2014), choosing the correct answer in the histogram comparisons may have required the suppression of irrelevant or distracting information and reasoning beyond the perceptually intuitive answer. However, that study also highlighted the importance of conceptual understanding—the undergraduates did not perform any better when they were given unlimited time to make the comparison and warned about potential fallacies in graphs.

Literature on ensemble perception and data visualization could provide a basis for understanding what makes learning about distributions from histograms so difficult. Research with bar graphs suggest that people treat bars like objects and judge data values that are within the bars as more likely to be in the data set than values outside of the bars (Newman & Scholl, 2012). Likewise, people may be treating bars within histograms like objects. When looking upon a collection of rectangular bars in a histogram, we may automatically perceive the average height of the bars, which provides no meaningful information about the distribution and is an inappropriate interpretation of the histogram. Existing perceptual tasks for interpreting histograms may require people to carry out a series of explicit, nonautomatic steps. For example, in order to locate the mean and median within a histogram, we may need to reconceptualize the bars, perceive cumulative area, and perform a mental division of that space. Unfortunately, the perception of cumulative area is less accurate than the perception of the mean (Raidvee et al., 2020) and perhaps less intuitive for histograms. Some researchers (e.g., Kahneman, 2011) have postulated that the imprecision of perceiving cumulative area may be created through taking the perceived average, a measurement that is more accurate, and using multiplication to reconstruct cumulative area. This explanation presumes that the observer can multiply or divide sensory magnitudes, which have been suggested by some experimental results (Torgerson, 1961). Regardless of the explanation, histograms may be difficult to interpret correctly for many reasons: bias to the vertical, perceiving bars as separate objects representing one data point each, and mostly importantly, difficulty of cumulative area perception, or the relevant dimension for correct histogram interpretation.

Inspiration for solutions

One of the greatest mysteries regarding students’ difficulties with learning statistics is how foreign they find the statistical concepts being taught. It could not be because they have not interacted with summary statistics, like mean and variance, or experienced distributions of examples in their everyday experiences of the world. Their visual systems automatically extract information regarding mean and variance to represent their environment. But somehow, when asked to reason explicitly about means, variances, and distributions, they lose all intuition. Perhaps some of this disconnect can be explained by the level of abstraction used in the teaching of statistics, which deals with numbers and formulas rather than relatable experiences.

One way to approach this problem is through a grounded cognition perspective on education: that environment and bodily experiences are important for developing cognitive processes (e.g., Barsalou, 1999). By creating a learning experience that is close to natural everyday experience, we may engage the intuition our visual system has about summary statistics and support the learning of statistics. Simulations and animations have been shown to be helpful for new statistics students (Neumann, Neumann, & Hood, 2011; Wang, Vaughn, & Liu, 2011). The usefulness of animations for learning statistics is unsurprising given that animation was helpful for conveying information about uncertainty to people without instruction (Hofman et al., 2020).

Simulations have also been effective at correcting misconceptions and improving understanding of correlation (T.-C. Liu, Lin, & Kinshuk, 2010). While computer simulations of sampling distributions have been shown to be effective for teaching the concept (e.g., Jamie, 2002; Lane, 2015), the use of hands-on activities prior to the use of computer simulations yielded better learning (Hancock & Rummerfield, 2020). There is also work leveraging our capacity for ensemble perception on improving statistical reasoning. For example, Yu, Goldstone, and Landy (2018) improved students’ statistical reasoning with sample size, variance, and difference between means by using an experientially grounded learning intervention that involved representing data points as different colored balls and using animation to illustrate the factory production of these balls.

Using these previous successes as inspiration and insights from ensemble perception and data visualization, we propose potential solutions for students’ difficulties with histograms in the corresponding future directions subsection.

Future directions

This tutorial review highlighted the gaps in ensemble perception literature, the recent usage of ensemble coding principles within the data visualization literature, and the hindrance histograms may pose to students’ learning and reasoning within statistics education literature. Within the ensemble perception literature, investigations on mean perception are abundant but those of variance perception are scarce. While the first study on variance perception was published more than a decade ago, it has not gained its deserved traction amongst researchers. Insights drawn from studying variance perception could provide context to the existing literature on mean perception, and in turn help us understand mean perception better. It may make more sense to study some visual features because of their ecological relevance (e.g., color hue for food, and facial expressions for social interactions) and therefore they should be given higher priority due to their potential for insights into human vision and behavior.

While researchers of data visualization have long manipulated graphical features to study their effects (e.g., 2D vs. 3D: Zacks et al., 1998; vertical vs. horizontal bars: Fischer, Dewulf, & Hill, 2005), the use of ensemble perception literature can provide more theoretical grounding for manipulating visual features in future studies and extend the literature to include overlooked or understudied visual features (e.g., dot size, color saturation) within graphs. Researchers can draw many similarities between the tasks used in ensemble perception research and those performed when interpreting trends from data visualizations (e.g., locating the centroid, or average spatial location, of a dot cloud: Gleicher et al., 2013), and use those similarities to hypothesis-test the effectiveness of different data visualizations. They could also directly test people’s performance on these visual tasks and how related perceptual fluency is to trend detection and accurate data interpretation. Given the listed opportunities to gain new insight, we hope that researchers continue to apply ensemble perception literature to research in data visualization and that this approach becomes more popular.

More importantly, the knowledge from ensemble perception can be used to propose new visual representations for conveying statistical properties about data and for teaching statistical concepts in the classroom. A major consideration for design decisions lies within our natural perceptual abilities, opting to align our perceptual strengths with the type of visual aids used and the types of visual tasks involved and to avoid our perceptual weaknesses or imprecisions. These perceptual abilities should be taken into consideration when creating graphs for scientific communication and education purposes, but not without consideration of the perceptual task (e.g., comparing values, finding averages, identifying outliers) as well. Visual aids should be designed to be as natural and straightforward for our visual system to perform the task of interest as possible, leaving the difficulty of interpretation to conceptualizations and reasoning rather than to perception.

The literature on ensemble perception and data visualizations also provide a context for understanding the difficulties students face when learning statistical concepts, especially since statistics education depends on the use of graphs. Some barriers to learning and reasoning about statistics may stem from automatic perceptual processes, such as being drawn to extreme values or representing sets of visual stimuli with summary statistics (e.g., the average height of bars within a histogram is more automatic than the estimation and manipulation of cumulative area).

Research on ensemble perception has significant theoretical and practical implications for more applied areas of study, such as data visualization and statistics education. Researchers pursuing more applied research questions can harness knowledge from ensemble perception to inspire new lines of research and new approaches to solving existing problems. Some integration of knowledge within these three areas of study has the potential to reveal new and impactful insights that would be of interest to other areas. The following are some specific suggestions for future research in each of the three areas, drawing connections to the other areas where appropriate.

In ensemble perception

The discovery that people are sensitive to variance in visual displays is surprising. Outside of a statistics classroom or context, people do not usually comment on the variance between things, even though they use “average” as a descriptive adjective often. Inside a classroom, students struggle to fully grasp the concept of variance, often unable to reason about variance accurately. Yet our visual systems seem to be able to extract, reproduce, and discriminate between variances. This is puzzling as well as amazing. Therefore, research that can demystify our ability to perceive variance deserves much attention. But first, is the information being processed really variance or do we rely upon proxies that our visual systems have learned to use due to their reliability in natural environments (i.e., high correlation between variance and the proxies)? Recent work suggests that people rely upon range to make judgments about variance because performance became worse when there was incongruency between range and variance (Lau & Brady, 2018). Future research should continue to test other explanations for accurate performance. If we conclude that our visual system is in fact performing computations of the variance, are all items within an ensemble equally likely to be incorporated into variance estimation? Or do we use smart subsampling strategies (as with mean: e.g., Marchant et al., 2013)? What assumptions are our visual system making? Most visual features in the world are normally distributed, is this a prerequisite for being able to perform variance discrimination tasks? Or would we be able to perform equally as well on unnatural or asymmetric distributions (e.g., gamma distributions)? For example, two arrays can have the same variance, but wildly different compositions of values and shapes of distributions. Finally, there needs to be more work on how mean perception and variance perception relate to each other, if at all. Do they rely upon similar mechanisms or share connections in their representations? If cued for a variance discrimination task, would we be able to access representations of mean after the stimulus is no longer in view? Or would performance significantly suffer?

In data visualization

Results from ensemble perception research can inspire research questions and practical solutions for data visualization and statistics education. Given our efficiencies at extracting summary statistics from different visual features, would certain data visualizations be more effective at conveying certain messages (e.g., highlighting variation in a data set) because of the visual features used? Would the optimal data visualization depend on whether the message is about relations/concentration (e.g., via color saturation) or quantity (e.g., via numerosity or size)? These design considerations are worth exploring because low-level visual features, such as color, can make a difference in the speed of interpreting trends in heat maps (e.g., dark is more: Silverman, Gramazio, & Schloss, 2016). However, performing visual tasks in ensemble perception experiments is not exactly synonymous to performing visual tasks with data visualizations and educational graphs like histograms. For one, typically, objects in ensembles are separable (e.g., shapes are not touching each other or overlapping; cf. Tong et al., 2015), while the objects used in data visualizations and statistics education (e.g., circles in bubble maps, rectangles in heat maps and histograms) are connected/laying adjacent or overlapping, which may make ensemble perception more difficult. Second, the graphical objects (e.g., axis labels, legends, text within the graph) may be distracting for performing summary tasks (e.g., perceiving mean or variance), so perceptual fluency may not always translate to accurate data interpretations. While research on ensemble perception could give suggestions of visual features to use for data visualizations, it is still necessary to test the effectiveness empirically because of the inherent differences in the tasks and the incorporation of higher-level cognition in interpreting the data.

One data visualization that deserves a lot more attention from researchers in data visualization is histograms. While literature from statistics education has highlighted the common misconceptions students have regarding histograms, we do not know how accurately histograms can be perceived when read properly. Can people accurately estimate the mean and median from histograms, assuming the distributions are unimodal? Does the skewedness of the distribution affect people’s ability to make these estimations? Perhaps estimating and segmenting cumulative area across bars of different heights is difficult. Would people do better if they were provided a density plot instead? These questions are important because if we are to use these graphs to teach statistical concepts, we need to know what students are capable of seeing in the graphs.

In statistics education

Histograms are essential to teaching statistical concepts related to distribution, but are commonly misinterpreted by students. While histograms are intended to be used as a visual aid for statistical reasoning, they are more often a barrier to it. Histograms could be redesigned to play to our perceptual strengths, like perceiving numerosity and averages, and prevent the misattribution of the bars to objects. This is not to say that the current version of the histogram needs to be replaced completely, but the problem could be addressed at different stages to see what is the minimal intervention necessary to make judgments from histograms more intuitive and less error prone. We can teach the statistical concepts related to distributions using dot frequency graphs (see Fig. 2a) and use this simpler visual representation for statistical reasoning exercises. Since prior research have shown that simulations and animations are helpful for learning statistics (Jamie, 2002) and estimating variance (e.g., Hofman et al., 2020), we could also try an animated version of this graph. Once extracting the relevant information from these graphs becomes more natural, we can introduce the bars as binning the previously presented individual data points into adjacent groups (see Fig. 2b). Although the use of dot frequency graphs may seem elementary, not all students have seen or used dot frequency graphs by the time they enter college. Dot frequency graphs also make good entry-level graphs and provide a natural transition into using bars in histograms. The most obvious limitation of these graphs is their smaller capacity to deal with large data sets due to our representation of numbers becoming less precise as we jump orders of magnitudes (e.g., from representing 10s to 100s or 1,000s of data points; Landy et al., 2013). Some structured perceptual training may speed up the transitions between visual aids, the adjustment against perceptual biases and the maintenance of perceptual fluency with the graphs.

Fig. 2
figure 2

Example of suggested alternative to histograms, utilizing our ability to perceive numerosity and average color hue. Variations of (a) can be tested to optimize effectiveness: having columns of dots be closer together or touching, have space in between dots in each column so they are not touching, having subtle hue differences within columns for more fine-grained representation of data values and for better transition to the concept of binning data values. A hybrid between dot frequency plots and histograms, such as in (b), can be used when transitioning from dot frequency plots to histograms. (Color figure online)

It is important that perceptual training is accompanied with explanations of corresponding concepts. The lack of explicit instruction and self-reflection can have consequences. While extensive training in one’s estimation of correlation from scatter plots was successful at curbing perceptual biases, students were unable to convert their perceptual learning into explicit knowledge, unaware of the contributions of the x-and-y-value variances to correlation and thus not able to identify the formula for correlation (Cui, Massey, & Kellman, 2018). In that study, despite being shown an example of a scatter plot with a negative correlation prior to training, students failed to transfer their improved estimation ability to negative correlation plots from positive correlation plots.

There are many exciting directions for research in the three reviewed research areas: ensemble perception, data visualization, and statistics education. While research in these three areas can certainly progress without the others, each area stands to benefit from being aware of the others, for inspiration, direction, and/or application.

Conclusion

The field of ensemble perception has significant implications and practical applications for research on data visualizations and statistics education. The visual features studied in ensemble perception (e.g., size, hue) are used in common data visualizations (e.g., heat map, tree map, bubble map; see Fig. 1) and common graphs used to teach statistics (e.g., dots in scatter plots, rectangles in histograms). Researchers have recently started to leverage the knowledge generated from research on ensemble perception to assess the effectiveness of different data visualizations on delivering key information and relevant trends to viewers, and to give design recommendations backed by research (e.g., Gleicher et al., 2013; Szafir et al., 2016). The application of ensemble perception literature is scarcely used in the domain of statistics education, but deserves more attention as it has been shown to be successful in improving statistical reasoning (Yu et al., 2018).

By comparing findings from ensemble perception and from statistics education, we can appreciate the similarities and differences in the factors that influence the perception and the conceptualization of mean and variance. Some of the misconceptions about mean and variance could be better understood when we look at how we naturally interact with that summary information. The mean of an ensemble is particularly meaningful for category learning, where we summarize a group of instances and store that summary to use as a reference for later categorization. What is stored is physical object(s) representing that category. This may be why we have a tendency to report an object with average visual features to be part of the previously encountered set of objects even though it did not exist (de Fockert & Wolfenstein, 2009) and why students believe that the average of a data set exists within that set (Batanero et al., 2003).

Unlike the mean, the variance of a set cannot be an object. Instead, what people may be doing is storing a set of objects that represent the variance in some way. Lau and Brady (2018) propose that people may be using range as a proxy for variance. Unlike variance, range can be stored as physical objects: minimum and maximum. Range also seems to be meaningful and intuitive to students. Many ways in which students misinterpret histograms involve range in some fashion, such as using the range of x and y values to find the median of a distribution (Ismail & Chan, 2015). We can also anticipate there being differences because ensemble perception involves physical objects whereas the formal statistics used in classrooms and research involve mostly numbers. For example, when estimating the average of a set, people have a tendency to overweight outliers with a set of numbers, but underweight outliers with a set of dots (Anderson, 1968). While our visual system may have a tendency to disregard outliers (e.g., Haberman & Whitney, 2010), the explicit assessment of numbers in a set may make outliers more salient, pulling one’s estimate of the mean towards the outlier.

Relating how people perceive summary statistics from a scene to how students think about summary statistics should not be overlooked because statistics education relies on graphs to demonstrate key statistical concepts and a lot of these graphs represent data via ensembles of objects (e.g., bars, dots). It is important to put (ensemble) perception into consideration when designing visual aids and curriculum for statistics classrooms because people are susceptible to perceptual biases and some features are easier or more intuitive to process than others (e.g., bars in bar graphs are perceived as objects, and therefore their size, or area and length, is easier to encode than spatial position—though not accurately; Yuan et al., 2019). Identifying perceptual biases matter because perceptual biases interfere with the development of conceptual understanding. Instead, we should think about how we can use perceptual features to support the learning of statistical concepts. That is where knowledge from ensemble perception comes in. A better understanding of ensemble coding of visual features commonly used in graphs can help us identify why certain graph types are hard to interpret or easy to misinterpret, as well as how to avoid such misinterpretations. This knowledge could also be used to design visual aids that have better learning outcomes than preexisting ones, potentially changing the way statistics is taught.