Diagnostic spatial frequencies and human efficiency for discriminating actions

Thurman, Steven M.; Grossman, Emily D.

doi:10.3758/s13414-010-0028-z

Diagnostic spatial frequencies and human efficiency for discriminating actions

Open access
Published: 16 November 2010

Volume 73, pages 572–580, (2011)
Cite this article

Download PDF

You have full access to this open access article

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Diagnostic spatial frequencies and human efficiency for discriminating actions

Download PDF

Steven M. Thurman¹ &
Emily D. Grossman¹

1227 Accesses
14 Citations
Explore all metrics

Abstract

Humans extract visual information from the world through spatial frequency (SF) channels that are sensitive to different scales of light-dark fluctuations across visual space. Using two methods, we measured human SF tuning for discriminating videos of human actions (walking, running, skipping and jumping). The first, more traditional, approach measured signal-to-noise ratio (s/n) thresholds for videos filtered by one of six Gaussian band-pass filters ranging from 4 to 128 cycles/image. The second approach used SF “bubbles”, Willenbockel et al. (Journal of Experimental Psychology. Human Perception and Performance, 36(1), 122–135, 2010), which randomly filters the entire SF domain on each trial and uses reverse correlation to estimate SF tuning. Results from both methods were consistent and revealed a diagnostic SF band centered between 12-16 cycles/image (about 1-1.25 cycles/body width). Efficiency on this task was estimated by comparing s/n thresholds for humans to an ideal observer, and was estimated to be quite low (>.04%) for both experiments.

Human Psychophysical Functions, an Update: Methods for Identifying their form; Estimating their Parameters; and Evaluating the Effects of Important Predictors

Article 04 September 2014

Diana E. Kornbrot

Optimal Measurement of Visual Motion Across Spatial and Temporal Scales

PsiMLE: A maximum-likelihood estimation approach to estimating psychophysical scaling and variability more reliably, efficiently, and flexibly

Article 19 May 2015

Darko Odic, Hee Yeon Im, … Justin Halberda

Introduction

The human visual system is organized to process visual information through spatial frequency (SF) channels that are each sensitive to a particular range of frequencies of repeating light-dark patterns across the visual field (see De Valois & De Valois, 1980). Since the discovery of the contrast sensitivity function for sinusoidal gratings (Campbell & Robson, 1968), one tradition in vision science has been to determine what SF information is critical for recognizing objects. SF tuning has been measured for various stimuli such as faces (Costen, Parker, & Craw, 1996; Fiorentini, Maffei, & Sandini, 1983; Gold, Bennett, & Sekuler, 1999), letters (Chung, Legge, & Tjan, 2002; Parish & Sperling, 1991), and objects (Norman & Ehrlich, 1987). These studies illustrate that diagnostic stimulus information is available in specific SF bands for different objects, and that observers readily extract this information for visual categorization and recognition (Sowden & Schyns, 2006).

Measuring SF tuning for objects gives vision researchers information about the scale of the diagnostic features for recognizing a given object or for discriminating different exemplars within an object class. For example, low spatial frequencies carry primarily information about global object shape and large-scale relations among features, while higher spatial frequencies carry information about object shape as well as fine-grained features of the object. Importantly, local and global stimulus features can be extracted to different degrees from a very large range of spatial frequencies because the coarse-to-fine processing mode is orthogonal to the local-to-global mode (e.g. Oliva & Schyns, 1997). Nonetheless, determining SF tuning for a visual stimulus is one method for estimating the relative importance of large and small-scale stimulus features for recognition. Since there is currently debate about the role of such features for perception of biological motion, the following question arises naturally: What are the diagnostic spatial frequency bands for recognizing dynamic human actions?

Although this question has not been addressed directly, a study by Kuhlmann and Lappe (2006) investigated perception of human actions from natural scenes that were systematically blurred with a Gaussian filter, essentially removing high SF content (equivalent to a low-pass filter). The results from this study demonstrated that human actions could be reliably discriminated on the basis of very crude low SF information, even if only a few frames of the action sequence were shown. However if only a single frame was shown, performance dropped precipitously as blur level increased compared to the full action sequence. Results from Kuhlmann and Lappe (2006) highlight a few important points about action perception. First, action recognition is robust even when filtered to contain only low spatial frequencies, and there appears to be a critical point at which performance deteriorates when the image is too blurry. Second, observer performance improves when motion information is present in the action video by displaying three consecutive frames, though Kuhlmann and Lappe (2006) argue that motion aids primarily in segmenting the actor from the background, and not in the recognition process. However, it is still unclear exactly what role high SF information plays in action recognition and how this compares to intermediate and lower spatial frequencies.

The goal of the current study was to determine which spatial frequencies carry the most diagnostic information for discriminating actions. Researchers have used a variety of methods for measuring SF tuning of object recognition. For instance, one common method is to apply a series of increasing band-pass SF filters to a stimulus, ranging from low to high spatial frequencies, and then measuring signal to noise ratio (s/n) thresholds for each band-pass level (e.g. Gold et al., 1999; Parish & Sperling, 1991). Other methods for identifying critical SF bands for object recognition include low pass filtering (Rubin & Siegel, 1984), combining low and high-pass filtering (Fiorentini et al., 1983; Solomon & Pelli, 1994), and critical-band masking (Majaj, Pelli, Kurshan, & Palomares, 2002). Gold et al. (1999) present a useful table summarizing many previous studies including methods and results.

Willenbockel and colleagues (2010) recently introduced a novel method for measuring SF tuning inspired by the “bubbles” technique using reverse correlation (Gosselin & Schyns, 2001). The SF bubbles method involves filtering the SF domain of a stimulus on each trial with a random sampling vector, which can be envisioned as applying multiple, random band-pass filters of varying amplitude and bandwidth. Performing a multiple linear regression of observer accuracy with the sampling vectors across trials results in a classification vector revealing the SF bands that tend to lead to accurate discriminations. One benefit of the SF bubbles method is that it does not allow observers to adapt to a particular SF range during the experiment, and that it is basically a combination of all possible SF filtering experiments since it is equivalent to either low-pass, high-pass, or multiple band-pass filtering on each trial (Willenbockel et al., 2010).

In the current experiment we measured SF tuning for discriminating human actions using two different methods. We used a more traditional band-pass SF filtering method similar to Parish and Sperling (1991), and then we used the SF bubbles method described above (Willenbockel et al., 2010). The purpose of the current experiment was threefold. First, we wanted to determine if particular spatial frequency bands are more diagnostic than others for action discrimination. Second, we sought to compare the patterns of SF tuning derived from both techniques. Last, we performed ideal observer analysis in order to estimate how efficient observers are at extracting information at various spatial scales while discriminating human actions. The results of this experiment shed light on the spatial scale of diagnostic features for biological motion perception and allow us to quantitatively compare human efficiency for discriminating actions to previous estimates of efficiency for discriminating other types of objects.

Experiment 1

Participants

Seven participants were recruited at the University of California, Irvine, and were offered course credit for participation in the experiment. Author S.T. was one of the participants in this experiment. All participants had normal or corrected to normal vision.

Stimuli and apparatus

Stimuli included videos of nine human actors performing four different actions: walking, running, skipping and jumping. Videos were selected from an online database of freely available human actions (see Gorelick, Shechtman, Irani, & Basri, 2007). The actions chosen for this experiment represented four different types of ambulation; hence the actions differed primarily in terms of limb articulation, speed and body posture. We chose not to include non-ambulating actions, such as jumping jacks or hand waving, as it would be trivial to discriminate between ambulating figures and non-translating figures. The videos were recorded at a resolution of 180 by 144 pixels at 50 frames per second, and each of the videos had the same wall as a common background. In order to avoid edge artifacts that result from aliasing when Fourier analysis is performed with image dimensions that are not a power of 2, the videos were cropped and resized using bi-cubic interpolation to be 256 by 256 pixels. Each video was converted to grayscale and edited to consist of 25 frames. All image processing was performed using MATLAB.

The action stimuli were filtered with one of six Gaussian band-pass filters, each separated by one octave with a standard deviation of 0.5 octaves. The transfer functions of the filters are displayed in Fig. 1a. The centers of the filters were 4, 8, 16, 32, 64, and 128 (high-pass) cycles/image, which corresponded to center frequencies of 0.28, 0.57, 1.13, 2.27, 4.54, 9.08 cycles/degree visual angle. We created Gaussian noise fields by drawing independent samples from a Gaussian distribution (mean = 0, SD = 1) for each pixel in a 256 x 256 array. The noise fields were then filtered with one of the six band-pass filters, creating six sets of filtered Gaussian noise. Each set contained 100 unique filtered noise fields, and dynamic noise was created by randomly choosing 25 frames from the set of 100 on each trial.

The stimuli were presented on a 16 x 12 inch ViewSonic CRT monitor with a resolution of 1024 × 768 pixels and refresh rate of 100 Hz. Participants were seated in a dark room 38 cm from the screen with a chinrest to help maintain a constant viewing distance. At this distance the stimuli subtended 14.08 × 14.08 deg and were presented at a rate of 50 Hz. The experiment was programmed in Matlab using the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997), powered by a 2 Ghz Intel core Apple Mac Mini.

Procedure

Observers first performed a training block to ensure that the actions could be reliably discriminated and to familiarize the observers with SF filtered stimuli. The training block consisted of 140 trials in which seven types of action videos (6 filtered and 1 unfiltered original) were displayed 20 times each, in random order. The actor and action in the video was chosen randomly on each trial. Since this was training, no noise was added to the stimuli. After checking to ensure that observers had less than a 5% error rate on the unfiltered original action videos, the experiment commenced.

Signal to noise ratio thresholds were estimated using the method of constant stimuli with a four-choice discrimination task. Observers performed two blocks of 360 trials. In each block, each of the six band-pass filtered action stimuli was displayed a total of 60 times. Those 60 trials of action stimuli consisted of five different s/n ratio levels shown 12 times each. The five s/n levels were 0.005, 0.01, 0.05, 0.1, 0.5, and were chosen based on pilot data to sample the psychometric function from about chance performance to above threshold performance across SF bands. Signal to noise ratio was measured by computing signal power, s, from the contrast variance of the signal (filtered action video) and noise power, n, from the contrast variance of the noise (filtered Gaussian fields), and dividing s by n. On each trial the contrast variance of the noise field was multiplied by a scaling factor in order to achieve the target s/n level. This method for computing and adjusting s/n is described in detail by Parish and Sperling (1991).

On each trial a stimulus consisted of a randomly chosen filtered action video plus an identically filtered noise field with a target s/n randomly chosen from the list of five s/n levels (see Fig. 1b). The stimulus was presented for 500 ms, followed by an answer screen displaying the mapping between keyboard responses and the four possible actions. The response screen remained until the observer responded by using the numbers 1-4 on a keyboard. No feedback was given. The next trial commenced after a pause of 3 seconds. The entire experiment lasted about one hour.

Data analysis

For each observer, accuracy was computed for each of the five s/n levels in the six SF bands. In total, accuracy was computed from 24 data samples per condition (30 total conditions). Data for each of the seven observers was combined and a best-fitting (least-squares) Weibull psychometric function was fitted to the mean data in order to estimate s/n thresholds at 80% performance for each SF band.

Ideal observer

As argued by Gold et al. (1999), optimal performance for this task can be computed by maximizing the cross-correlation between the filtered, noise-masked stimulus and each of the templates (unfiltered action videos). This is analogous to the spatial correlator ideal discriminator described by Parish and Sperling (1991). We used this method to estimate ideal observer thresholds for the current experiment. We performed Monte Carlo simulations testing the accuracy of the ideal observer discriminating action videos that were filtered in each of the same six SF bands and masked with identically filtered Gaussian noise at a variety of s/n levels. The response of the ideal observer on each simulated trial was chosen by computing the correlation between the test stimulus and each of the 36 action templates (4 actions by 9 actors), and then choosing the template with the maximum correlation. We tested performance of the ideal observer on 80 trials for each condition (6 SF bands by 5 s/n levels) and estimated s/n thresholds with a best-fitting Weibull function.

Results

Figure 2 shows human observer and ideal observer performance as a function of s/n level. The ideal observer was able to discriminate actions at all band-pass filter levels, so information was clearly available in all SF bands to perform the task. Human observers reached the 80% threshold in all SF bands except the first band. In fact observer performance discriminating actions in the first SF band during the practice block, which contained no noise, was only 77% on average (SD = 12%), so it was not surprising that observers failed to reach threshold with the s/n levels used.

The mean observer data and ideal observer data was fit with psychometric functions and 80% performance thresholds were estimated and plotted in Fig. 3a. Human observer tolerance to noise peaked in SF band 3, which corresponded to 16 cycles/image. Since the average body torso in the videos was about 20 pixels wide and the average body height was about 120 pixels tall, this was equivalent to 1.25 cycles/body width and 7.5 cycles/body height.

Figure 3B shows human efficiency for each SF band measured. Efficiency was estimated using the same formula as Parish and Sperling (1991) as the ratio of the s/n thresholds of human and ideal observers at the 80% threshold criterion. Efficiency in SF band 1 was set to zero since observers failed to reach threshold in this condition. Efficiency had a sharp peak at 8 cycles/image, but overall was low across all SF bands, especially in comparison to the efficiency measured for other stimuli such as letters (Parish & Sperling, 1991; but see Gold et al., 1999).