eyeCode: An Eye-Tracking Experimental Framework for Program Comprehension

Michael Hansen, Rob Goldstone, Andrew Lumsdaine

Percepts and Concepts Lab, Spring 2014

Motivation and Goals

Motivation

Program comprehension is poorly understood (Détienne, 2002)
Language design is not human-centered (Stefik, 2013)
Eye movements can be mapped to cognitive processes (Salvucci, 1999)

Goals

Learn about program comprehension via eye movements
Compare strategies/errors between versions of a program
Develop a quantitative cognitive model of simple program comprehension

The Experiment

eyeCode - programmers read and evaluate small Python programs
Different versions randomly assigned
All keystrokes recorded (eye movements locally)

Demographics (Bloomington Only)

Experiment Interface

1 experiment = 10 programs
2-3 versions of each program

Anatomy of a Trial

1 trial = 1 program
All answers final
Intermediary keystrokes recorded

Programs

10 categories, 2-3 versions each (25 total)

between - filter two lists, intersection
counting - for loop with bug
funcall - function call with different spacing
initvar - summation and factorial
order - 3 functions called (in order/shuffled)

rectangle - compute area of 2 rectangles
overload - overloaded + operator (number/strings)
partition - partition list of numbers
scope - function calls with no effect
whitespace - linear equations

[ most errors ] • [ fewer errors ] • [ no errors ]

Response Correctness

Participant response vs. true program output
Normalized Levenshtein distance (0-1)
Perfect vs. correct (formatting ignored)

True Output

[1, 2]
[3, 4]

Perfect (0.0)

[1, 2]
[3, 4]

Correct (0.46)

1 2
3,4

Incorrect (0.08)

[1, 3]
[3, 4]

The eyeCode Library

Python library for analyzing eye movements
Built on top of pandas (R-like data frames) and scipy/numpy (stats/numerics)
Automatic areas of interest (text, code)
Hit testing with layers
Code-specific plots and metrics

[http://github.com/synesthesiam/eyecode]

Pandas and Scipy

In [10]:

import pandas, scipy.stats, numpy

# Numpy and pandas work together to create R-like data frames
data = numpy.random.uniform(size=25).reshape((5, 5))
frame = pandas.DataFrame(data, columns=["A", "B", "C", "D", "E"])

# Non-numeric columns are supported
frame["F"] = ["x", "x", "o", "o", "o"]
frame

Out[10]:

	A	B	C	D	E	F
0	0.988141	0.601001	0.313569	0.493887	0.702926	x
1	0.256549	0.802963	0.632752	0.250193	0.152852	x
2	0.875510	0.354493	0.665779	0.787402	0.279158	o
3	0.473622	0.000874	0.153416	0.064031	0.127221	o
4	0.255761	0.794496	0.568220	0.868812	0.488199	o

5 rows × 6 columns

In [29]:

# Statistics are available in scipy.stats
r, p = scipy.stats.spearmanr(frame.A, frame.B)
print "Spearman: r={0}, p={1}".format(r, p)

# Pandas lets you sort, group, and slice data frames
for f_val, sub_frame in frame.groupby("F"):
    # Underneath, they are accessible as standard numpy arrays
    s_vals = sub_frame[["A", "B", "C", "D", "E"]].values.flatten()
    print "When F={0}, sum is {1}, std is {2}".format(f_val, s_vals.sum(), s_vals.std())

Spearman: r=-0.5, p=0.391002218956
When F=o, sum is 6.75699318412, std is 0.291021433612
When F=x, sum is 5.19483392306, std is 0.259278513253

Areas of Interest

Automatic with Pygments
Works for any supported language!

Block

Line

Syntax

Hit Testing

Fixations mapped to areas of interest
Highest area of overlap (or point)
Overlap area ∝ probability?

"Super" Code Image

Saccade Angles

Saccade = quick jump between fixations
Plot angles and lengths

Scanpaths

Fixations → AOIs → Scanpath
Repeats removed, duration lost

Transition Matrices

First-order transition probabilities
Steady state = AOI importance?

Fixation Timelines

Example: words and fixations in LTR reading order
All fixations are same duration

Fixation Timelines

Time on each line is proportional to line length

Rolling Metrics

Computed over a rolling time window
Window size and step can vary

Experiment Results

Reading behavior (descriptive, predicting)
Saccade angles and scanpaths (string-edit distance)
Important lines (fixation duration, steady state)
Strategies and calculations (rolling metrics)

Reading Behavior

Fixation Duration - 273 ms avg (expected 300-400 ms)
Fixation Count - 172 avg (0.97 corr w/ duration)
Fixation Rate - 2.73 per sec avg
Code/Output Transitions - 6 avg per trial

Reading Behavior Prediction

Simple OLS models, fit with statsmodels package
Keywords least fixated
Line length, number useful for predicting duration

Type	Dep. Var	Useful	Not Useful
Line	First fixation	Line #	Other text/content
Line	Fixation duration	Line length, whitespace prop	Experience level
Element	Fixation duration	Category, line #

Saccade Angles

Left - eyeCode experiment (all programs)
Center - eyeCode experiment (counting programs only)
Right - Natural language experiment

Scanpath Comparisons

For each pair of trials (same program)
Block or line + output box scanpath (no repeats)
Normalized Levenshtein distance (string edit distance)
Consider ScanMatch (Needleman-Wunsch algorithm)

Important Lines/Elements

Fixation duration
Steady state

Calculations and Rolling Metrics

Spikes in average fixation duration correlated with mental calculations

Responses

 1 intercept = 1
 2 slope = 5
 3 
 4 x_base = 0
 5 x_other = x_base + 1
 6 x_end = x_base + x_other + 1
 7 
 8 y_base = slope * x_base + intercept
 9 y_other = slope * x_other + intercept
10 y_end = slope * x_end + intercept
11 
12 print x_base, y_base
13 print x_other, y_other
14 print x_end, y_end

Strategies

Batch (top)
- Read through entire program, then respond
Incremental (bottom)
- Respond to each print statement

rectangle - tuples

 1 def area(xy_1, xy_2):
 2     width = xy_2[0] - xy_1[0]
 3     height = xy_2[1] - xy_1[1]
 4     return width * height
 5 
 6 r1_xy_1 = (0, 0)
 7 r1_xy_2 = (10, 10)
 8 r1_area = area(r1_xy_1, r1_xy_2)
 9 print r1_area
10 
11 r2_xy_1 = (5, 5)
12 r2_xy_2 = (10, 10)
13 r2_area = area(r2_xy_1, r2_xy_2)
14 print r2_area

Program-Specific Results

between

Focus on lists and between() comparison
Less time on second list
Common error assumes common() for x_between and y_between

True Output

[8, 7, 9]
[1, 0, 8, 1]
[8, 9, 0]

Common Error

[8, 7, 9]
[1, 0, 8, 1]
[8]

Program-Specific Results

counting

nospace - 80% correct (top), twospaces - 36% correct (bottom)
Line 2 → 3 transition twice as likely for:
- nospace trials
- Correct twospaces trial

Program-Specific Results

overload

More (relative) time spent on string literals in plusmixed
Priming from mathematical use of +?
More likely transitions from output box to "5" + "3"

Double-checking in plusmixed? (at 35 seconds)

Program-Specific Results

rectangle

Focus on area() arguments
Width/height calculation verification in tuples?

basic

tuples

Results Summary

Reading Behavior

Average fixation duration different than other experiments (273 ms vs. 300-400 ms)
Experience level not correlated with avg. fix duration
Keywords receieved least amount of time

Linear Models

Time on line - line length and whitespace proportion
First fix on line - line number
Time on code element - category and line number

Version Differences

Only counting, initvar, overload, and rectangle
Other program versions not distinguishable with eye/performance metrics
Programs may be too simple

Process/Cognitive Models (Future Work)

Goal

Develop a quantitative cognitive model of program comprehension
Compare model to human eye movements and errors

Models

Line reader model
Basic ACT-R model
Advanced ACT-R model

Cognitive Complexity Model (Cant)

Process model for program comprehension
Chunking
- Reading and understanding a coupled block of code
- Faster when chunks are smaller, less complex
Tracing
- Visual jump to required sub-chunk
- Faster when chunk is closer, more salient
Inspiration for quantitative models

Parameters

Chunking (7)
Tracing (5)

Limitations

Definition of a chunk
Not a cognitive model
All parameters necessary?
Does not make errors

Chunking

Tracing

Line Reader Model

Reads and evaluates lines like a debugger
Chunk = 1 line
Chunk complexity ∝ f(line length)
Tracing complexity ∝ g(line count)

Parameters

Chunking delay
Tracing delay
Chunk/tracing recall factors
Typing speed

Limitations

Single line
Not a cognitive model
Does not make errors
Unrealistic typing

ACT-R

ACT-R is a cognitive architecture, programmed with Common Lisp
- Adaptive Control of Thought - Rational
- Computational theory of cognition
Cognition is transforming and moving chunks between buffers
- Chunks are a collection of key/value pairs
- Buffers hold one chunk at a time
- Productions are micro-steps in a task
Modules for
- Declarative memory
- Task state (imaginal), goals
- Vision, hearing (aural), speaking (vocal), typing (manual)
Bottleneck in procedural
Sub-symbolic effects
- Memory retrieval delay
- Manual jamming
- Eye movements

(p encode-letter
   =goal>
      isa         read-letters
      state       attend
   =visual>
      isa         text
      value       =letter1
   ?imaginal>
      buffer      empty
==>
   =goal>
      state       wait
   +imaginal>
      isa         array
      letter1     =letter1
)

Basic ACT-R Model

Declarative memory
- Chunk/tracing recall
Visual module
- Chunk size, tracing distance
Manual module
- Typing response

Parameters

Initial DM activations
Standard ACT-R parameters

Limitations

Single line
Cannot recognize algorithms
Beginner typing skill

Advanced ACT-R Model

Basic model + ontology + spatial
Map visual locations/areas to program semantics (loops, functions)
Use qualtitative spatial reasoning
Variable name similarity (prefix), semantic priming

Parameters

Initial DM activations
DM weights (similarity)
Standard ACT-R parameters

Limitations

No IDE/tool interaction
Only single file, simple programs
High level rules of discourse?

Thank you!

References

Françoise Détienne and Frank Bott. Software design–cognitive aspects. Springer Verlag, 2002.
Dario D. Salvucci. Mapping eye movements to cognitive processes. PhD thesis, Carnegie Mellon University, 1999.
Andreas Stefik and Susanna Siebert. An empirical investigation into programming language syntax. ACM Transactions on Computing Education (TOCE), 13(4):19, 2013.