Eye-tracking data quality as affected by ethnicity and experimental design

Blignaut, Pieter; Wium, Daniël

doi:10.3758/s13428-013-0343-0

Eye-tracking data quality as affected by ethnicity and experimental design

Brief Communication
Published: 23 April 2013

Volume 46, pages 67–80, (2014)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Eye-tracking data quality as affected by ethnicity and experimental design

Download PDF

Pieter Blignaut¹ &
Daniël Wium¹

3540 Accesses
66 Citations
2 Altmetric
Explore all metrics

Abstract

Lack of accuracy in eye-tracking data can be critical. If the point of gaze is not recorded accurately and reliably, the information obtained or action executed might be different from what the user intended. This study reports trackability, accuracy, and precision as indicators of eye-tracking data quality as measured at various head positions and light conditions for a sample of participants from three different ethnic groups. It was found that accuracy and precision for Asian participants was worse than that for African and Caucasian participants. No significant differences were found between the latter two ethnic groups. Operating distance had the largest effect on data quality, since it affected all indicators for all ethnic groups. Illumination had no significant effect on accuracy or precision, but the accuracy achieved by African and Caucasian participants was better when the stimulus was presented on a dark background. Large gaze angles proved to be detrimental for trackability for African participants, while accuracy and precision were also affected adversely by larger gaze angles for two of the ethnicities.

Introduction

Eye tracking can be used to obtain information about how people acquire and process information while they read (Rayner, Pollatsek, Drieghe, Slattery & Reichle, 2007), browse a Web site (Goldberg, Stimson, Lewenstein, Scott & Wichansky, 2002), shop (Vikström, Wallin & Holmqvist, 2009), drive a motor car (Crundall, Chapman, Phelps & Underwood, 2003), interpret medical images (Donovan, Manning & Crawford, 2008), and perform other tasks where the ability to decipher the visual world around them is critical. Eye tracking can also be used as input modality during computer interaction, such as eye typing (Abe, Ohi & Ohyama, 2007) or other gaze-contingent systems. No matter what the application area, it is important that the quality of data is sufficient to support the investigation or action at hand.

Data quality may be affected by one or more of three primary sources of error—namely, participant characteristics, equipment features, or the experimental setup. First, the output from eye-tracking devices may vary with individual differences in the shape or size of the eyes, such as the corneal bulge and the relationship between the eye features (pupil and corneal reflections) and the foveal region on the retina. Ethnicity, viewing angle, head pose, eye color, cleft and texture, position of the iris within the eye socket, and the state of the eye (open or closed) all influence the appearance of the eye (Hansen & Ji, 2010) and, therefore, the quality of eye-tracking data (Holmqvist, Nyström & Mulvey, 2012). Equipment-related problems might occur because of unstable or unsuitable sampling frequency, low camera resolution, incorrect or unstable identification of the pupil and glint centers, the fixation identification algorithm and associated parameter settings, the calibration procedure, and other hardware- and software-related issues. Experimental conditions, such as light conditions, head position, and stimulus can also largely affect experimental outcomes, along with incorrect analysis of data. It is also not always obvious what experimental setup would be the best to obtain optimum eye-tracking results. Holmqvist, Nyström, Andersson, Dewhurst, Jarodzka and Van de Weijer (2011) lists a wide variety of possible sources of error, with illustrative values.

Several separate constructs may be used to quantify data quality (or the lack thereof) (Holmqvist et al., 2011). Spatial accuracy refers to the distance between the actual and reported gaze positions, while temporal accuracy (a.k.a. latency) refers to the time difference between the actual and reported gaze events. Spatial precision (sometimes referred to as noise) and temporal precision refer to the variance in position and latency measures, respectively. Trackability (or robustness) refers to the proportion of raw data samples that are lost during a recording; the more data are lost, the worse the trackability is. Spatial resolution refers to the smallest eye movement that can be detected reliably in the data. Spatial resolution is highly dependent on spatial precision, since it is impossible to detect eye movements that are smaller than the sample-to-sample variability in the data.

This article focuses on trackability, spatial accuracy (or just accuracy), and spatial precision (or just precision) as indicators of the variance in data quality between participants from three different ethnic groups—namely, Africans, Asians, and Caucasians—under varying but controlled experimental conditions.

Measures of data quality

ISO standard 5725–1 defines precision as the “closeness of agreement between independent test results obtained under stipulated conditions” (ISO 5725–1, 1994). Precision should not be confused with accuracy, which is defined as the “closeness of agreement between a test result and the accepted reference value” (ISO 5725–1, 1994). Informally, precision refers to the spread (or dispersion) of the recorded raw gaze data samples, while accuracy refers to the offset between the observed and actual fixation positions (Hornof & Halverson, 2002). Figure 1 (left) shows imprecision around the stimulus (crosshair), while in Fig. 1 (right), the precision is better but accuracy is poor.

Both accuracy and precision can be computed separately for the two eyes and separately in the horizontal and vertical directions. Trackability, accuracy, and precision values are calculated from samples that are recorded when a participant (or pair of artificial eyes) is assumed to fixate a stationary target.

Trackability

Trackability (or robustness or versatility) refers to how well an eyetracker works for a variety of participants (Holmqvist et al., 2011). Trackability can be quantified by dividing the number of valid recorded raw data samples by the total number of samples that were supposed to be captured in the tracking period.

Loss of data typically occurs when some of the critical features of the eye image—for example, the pupil and/or corneal reflection—cannot be reliably detected (Holmqvist et al., 2011). Sometimes, it might be possible to do a regression to fill the gaps, but long periods of data loss cannot be addressed in this way. Holmqvist shows how data loss can reduce the number of processed fixations, while increasing the fixation duration.

Typically, glasses and contact lenses may cause reflections that can either obscure the corneal reflection or be regarded by the eyetracker as being actual corneal reflections. We have found that tilting the glasses somewhat might have a huge impact on data quality (Fig. 2).

Participant-related eye physiology—for example, droopy eyelids or narrow eyes—may also obscure the glint or pupil or part thereof with subsequent loss of data (Fig. 3). If the eyetracker allows, such problems can be addressed by adjusting the eye camera in order to manipulate the gaze angle (angle at the eye between eye camera and gaze target).

Other aspects that could make a difference to trackability are adjustment of resolution, number of eye cameras, quality of camera sensors, and number of and positioning of infrared illuminators, as well as parameters to the image-processing algorithms. Unfortunately, researchers very seldom have access to these variables, since manufacturers mostly hide these from the user in favor of ease of use.

Accuracy

Explanation and significance

Lack of accuracy, also known as systematic error, may not be a problem when the areas of interest are large (e.g., 8º of visual angle) and are separated by large distances (Zhang & Hornof, 2011), but in studies where the stimuli are closely spaced, as in reading, or irregularly distributed as on a Web page, uncertainty of as little as 0.5°–1° can be critical in the correct analysis of eye-tracking data. Accuracy is also of great importance for gaze-input systems (Abe et al., 2007). With regard to reading research, for example, Rayner et al. (2007) states that “there can be a discrepancy between the word that is attended to even at the beginning of a fixation and the word that is recorded as the fixated word. Such discrepancies can occur for two reasons: (a) inaccuracy in the eye-tracker and (b) inaccuracy in the eye-movement system” (p. 522). Lack of accuracy may also result from bad calibrations, head movements, astigmatism, eyelid closure, or other sources that are strongly dependent on the particular characteristics of the individual participant (Hornof & Halverson, 2002).

In principle, the accuracy of eye tracking refers to the difference between the measured gaze direction and the actual gaze direction for a person positioned at the center of the head box (Borah, 1998; Tobii, 2010). It is measured as the distance, in degrees, between the position of a known target point and the average position of a set of raw data samples, collected from a participant looking at the point (Hornof & Halverson, 2002; Holmqvist et al., 2011). This error may be averaged over a set of target points that are distributed across the display.

In order to compare and replicate research results, it is essential that researchers report the accuracy of eye tracking. Stating the manufacturers’ specifications could be misleading, since it is known that accuracy can vary considerably across participants and experimental conditions (Blignaut & Beelders, 2009; Hornof & Halverson, 2002). Some researchers—for instance, Tatler (2007) and Foulsham and Underwood (2008)—make a point of recalibrating until the measured accuracy is below 0.5° and only then start recording.

Reported accuracy of video-based eyetrackers

Video-based eye tracking is the most widely practiced eye-tracking technique (Hua, Krishnaswamy & Rolland, 2006). This technique is based on the principle that when near infrared light is shone onto the eyes, it is reflected off the different structures in the eye to create four Purkinje reflections (Crane & Steele, 1985). The vector difference between the pupil center and the first Purkinje image (also known as the glint or corneal reflection is tracked. Corneal reflection/pupil devices are largely unobtrusive and easy to operate.

While an accuracy of 0.3° has been reported for tower-mounted high-end systems operated by skilled operators (Holmqvist et al., 2011; Jarodzka et al., 2010), remote systems are usually less accurate. Komogortsev and Khan (2008), for example, used an eyetracker with an accuracy specification of 0.5° but found that after removing all invalid recordings, the mean accuracy over participants was 1°. They regarded systematic errors of less than 1.7° as being acceptable. Johnson, Liu, Thomas and Spencer (2007), using an alternative calibration procedure for another commercially available eyetracker with stated accuracy of 0.5°, found an azimuth error of 0.93° and an elevation error of 1.65°. Zhang and Hornof (2011) also found an accuracy of 1.1°, which is well above the 0.5° as stated by the manufacturer. Using a simulator eyeball, Imai et al. (2005) found that a video-based eyetracker has an x-error of 0.52° and a y-error of 0.62°. Van der Geest and Frens (2002) reported an x-error of 0.63° and a y-error of 0.72°. Chen, Tong, Gray and Ji (2008) proposed a “robust” video-based eye-tracking system that has an accuracy of 0.77° in the horizontal direction and 0.95° in the vertical direction, which they deemed acceptable for many HCI applications that allow natural head movements. Brolly and Mulligan (2004) also proposed a video-based eyetracker with accuracy in the order of 0.8°. Hansen and Ji (2010) provided an overview of remote eyetrackers and reported the accuracy of most model-based gaze estimation systems to be in the order of 1°–2°.

Hornof and Halverson (2002) thoroughly studied the nature of systematic errors in a set of eye-tracking data collected from a visual search experiment. They found that the systematic error tends to be constant within a region of the display for each participant. Specifically, the magnitude of the disparities between the target visual stimuli and the corresponding fixations were “somewhat evenly distributed around 40 pixels (about 1º of visual angle) and most were between 15 and 65 pixels” (Hornof & Halverson, 2002, p. 599). Horizontal and vertical disparities remained constant to a certain degree for each participant. Thus, accuracy was not randomly distributed across all directions or sizes but was, as the name implies, systematic.

The role of calibration

Calibration refers to a procedure to gather data so that the coordinates of the pupil and one or more corneal reflections in the coordinate system of the eye video can be converted to x- and y-coordinates that represent the participant’s point of regard in the stimulus space. The procedure usually consists of asking the participant to look at a number of predefined points at known angular positions while storing samples of the measured quantity (Abe et al., 2007; Kliegl & Olson, 1981; Tobii, 2010). Theoretically, this transformation should remove any systematic error, but eyetrackers often maintain some systematic error even directly after careful calibration (Hornof & Halverson, 2002).

Precision

The precision of an eyetracker is defined as the ability to reliably reproduce a measurement (Holmqvist et al., 2011). Precision should be calculated from data samples that are recorded when the eye is fixating on a stationary target (Holmqvist et al., 2011). The use of artificial eyes presents a good way to ensure that there is no movement of the eyes and that the precision measured is a property of variation in the eyetracker only.

High precision is needed when measuring small fixational eye movements such as tremor, drift, and microsaccades. Poor precision can also be detrimental to fixation and saccade detection algorithms. Holmqvist et al. (2012) found that the number of fixations that are identified by an adaptive velocity threshold algorithm increases with precision, while the duration of fixations decreases. They surmise that higher noise levels prevent the algorithm from identifying shorter saccades, with the result that nearby but separate fixations are merged into longer fixations.

Poor precision can be caused by a multitude of technical and participant-specific factors like hardware limitations and eye color (Holmqvist et al., 2011). Our own analysis of eye videos and comparisons between human and artificial eyes have led us to conclude that poor precision is usually caused by unstable pupil and glint detection by the imaging software and is not necessarily participant related. This is especially the case when the images taken by the eye camera is of lower quality.

Participant-related causes of imprecision can be addressed by using a chinrest or bite-board, while hardware-related causes can be addressed by using different eye-tracking hardware—for example, by using an eyetracker with a high-quality eye camera and/or using a tower- or head-mounted unit instead of a remote unit. If none of these are possible, the variability can be hidden through averaging the gaze points by means of a fixation identification algorithm (Hornof & Halverson, 2002).

The standard deviation (σ) of a set of data samples is a measure of the spread around the mean or central value. It is defined as \( \sigma =\sqrt{{\frac{1}{N}\sum\nolimits_{i=1}^N {d_1^2} }} \) where N is the number of samples and d _i is some distance measure between an individual sample, i, and the central value. In two dimensions, the standard deviation would be \( \sqrt{{\left( {\sigma_x^2+\sigma_y^2} \right)/2}} \) where \( \sigma_x^2=\frac{1}{N}\sum\nolimits_{i=1}^N {{{{\left( {{x_i}-\overline{x}} \right)}}^2}} \) and \( \sigma_y^2=\frac{1}{N}\sum\nolimits_{i=1}^N {{{{\left( {{y_i}-\overline{y}} \right)}}^2}} \).

Methodology

The trackability, accuracy, and precision of eye-tracking data under various conditions were determined by asking participants of three ethnic groups to follow some dots on the screen.

Hardware and software

A Tobii TX300 eyetracker with a frame rate of 300 Hz, screen resolution of 1,920 × 1,080 and a pixel size of 0.26 mm was used. According to the product description (Tobii, 2010), the operating distance is specified as 50–80 cm, and the freedom of head movement at 65 cm is 37 cm horizontally and 17 cm vertically. The firmware version for the TX300 at the time of testing was 3.0.1. The Metrics software (version 2.1.7.65534) developed by Tobii was used to capture the raw data, but it was not used to analyze the data.

It is important to note that this study is about the differences in data quality between three different ethnic groups, and not about the specific instrument that was used to capture the data. Another eyetracker, of the same model, might produce different results, and the absolute values for data quality should not be used to evaluate or compare the quality of this model of eyetracker with that of another.

Participants

Participants were recruited so as to provide a balance over ethnic group as far as possible. Eventually, 71 participants were tested, of which 26 were African, 22 Asian, and 23 Caucasian.

In order to focus on the effects of ethnicity and avoid possible influence of sight correction on data quality, participants who could not see clearly on the screen without glasses were not recruited. Although no formal screening was done with regard to other demographic or participant-related factors, some figures are provided in Table 1. It might seem that eye color is quite unbalanced, but it must be kept in mind that all Africans and most Asians have brown eyes.

Table 1 Number of participants per demographic factor

Full size table

Laboratory setup

The eyetracker was placed on a special table that is adjustable in three dimensions (Fig. 4) to control the head position relative to the eyetracker. A chinrest was used to ensure that the head position stays constant for a specific configuration. The center position for the eyetracker was defined as the position where the cameras are 650 mm from the participant's eyes and with eyes in the horizontal and vertical center of the head-box. The screen was perpendicular to the table at all times.

Five adjustable lights were used to adjust the lighting conditions inside the laboratory. One light, attached to the ceiling, was directly above the eyetracker, while the other four lights were placed in the corners of the room, just below the ceiling. The room was darkened so that illumination of approximately 0 lux could be achieved when all the lights were switched off. Two light meters, one facing the participant and one facing the eye camera, were used to measure lux levels. The light conditions were adjusted until similar readings were recorded on both light meters.

Experimental procedure

A participant session consisted of a calibration routine followed by a series of tests where a set of dots were displayed, and participants had to fixate each one of the dots. A participant session lasted about 30–45 min, on average.

Calibration was done once per participant at the start of a session by expecting participants to fixate nine white dots on a black background at random positions. Calibration was done in the center of the head box at 300 lux. A calibration was regarded as valid if data could be captured for each dot. If necessary, the calibration routine was repeated for one or more dots. A session was terminated if the eyetracker was unable to track a participant's eyes.

The trackability, accuracy, and precision of the eye-tracking data were determined for each of the following conditions:

Head position with respect to horizontal and vertical position relative to the eyetracker, as well as the operating distance (distance from the eye camera to the eyes)
Light conditions
Gaze angle
Stimulus background

A participant session consisted of 21 tests, each with the eyetracker in a different position or with different light conditions, dot positions, or background color. Table 2 summarizes the various combinations. From the center position, the eyetracker was moved along the x-, y- and z-axes in 5-cm intervals, in both the positive and negative directions, and a test was done at each position (see Fig. 5 for the definition of the axes). Note that the z-axis is not perpendicular to the x–y plane but, rather, gives an indication of the operating distance to the eye camera.

Table 2 Experimental configurations

Full size table

Note also that −10 cm means that the table was lowered (or moved to the left) 10 cm and that the head is thus 10 cm above (or to the right of) the center position. Similarly, +10 cm means that the table was raised (or moved 10 cm to the right), with the result that the head was 10 cm below (or to the left of) the center position.

Stimuli

Unless stated otherwise (Table 2), the stimuli for a single test consisted of nine dots, each within a larger disk (diameter 36 pixels) (Fig. 6). The dots were arranged in a 3 × 3 grid (Fig. 7) but appeared one at a time in random order for 2 s each, with 1 s in between. For all tests but that for set F, the dots were displayed on a black background, while for set F, the dots were displayed in white within a black disk on a white background. For all sets but set G, the dots were displayed at positions such that the two upper corner dots appeared at 20° (see angle α in Fig. 8). For set G, two dots at 25° and two dots at 30° were displayed in each corner (Fig. 9). The test in darkness was always performed last, and 3 min was allowed for the participant's pupil size to adjust before the test commenced.

Analysis

In order to maximize the probability that a participant’s eyes were fixating on a specific dot, only data collected between 1,000 and 1,500 ms since the dot’s appearance were used during analysis. The raw data for each participant were exported to a database, and no filtering, event detection, or gap filling of any kind was done.

SQL queries were used to aggregate the results over the various combinations, and a statistical package was used to conduct an analysis of variance (ANOVA) test for each one of the factors, while controlling for the others. In other words, the Metrics software was used only to present the stimuli and capture the raw data, while the analysis was performed independently. That is, the data quality results as provided by Metrics were not used.

The accuracy and precision were averaged over all dots for a specific participant–test combination. All results were compared with the results for set A, which was regarded as the benchmark configuration.