In [1]:
from IPython import display

eyeCode: An Eye-Tracking Experimental Framework for Program Comprehension

Michael Hansen, Rob Goldstone, Andrew Lumsdaine

Percepts and Concepts Lab, Spring 2014

Motivation and Goals

Motivation

  • Program comprehension is poorly understood (Détienne, 2002)
  • Language design is not human-centered (Stefik, 2013)
  • Eye movements can be mapped to cognitive processes (Salvucci, 1999)

Goals

  • Learn about program comprehension via eye movements
  • Compare strategies/errors between versions of a program
  • Develop a quantitative cognitive model of simple program comprehension

The Experiment

  • eyeCode - programmers read and evaluate small Python programs
  • Different versions randomly assigned
  • All keystrokes recorded (eye movements locally)

Demographics (Bloomington Only)

Experiment Interface

  • 1 experiment = 10 programs
  • 2-3 versions of each program

Anatomy of a Trial

  • 1 trial = 1 program
  • All answers final
  • Intermediary keystrokes recorded

Programs

10 categories, 2-3 versions each (25 total)


  • between - filter two lists, intersection
  • counting - for loop with bug
  • funcall - function call with different spacing
  • initvar - summation and factorial
  • order - 3 functions called (in order/shuffled)

  • rectangle - compute area of 2 rectangles
  • overload - overloaded + operator (number/strings)
  • partition - partition list of numbers
  • scope - function calls with no effect
  • whitespace - linear equations


[ most errors ] • [ fewer errors ] • [ no errors ]

Response Correctness

  • Participant response vs. true program output
  • Normalized Levenshtein distance (0-1)
  • Perfect vs. correct (formatting ignored)

True Output

[1, 2]
[3, 4]

Perfect (0.0)

[1, 2]
[3, 4]

Correct (0.46)

1 2
3,4

Incorrect (0.08)

[1, 3]
[3, 4]

The eyeCode Library


  • Python library for analyzing eye movements
  • Built on top of pandas (R-like data frames) and scipy/numpy (stats/numerics)
  • Automatic areas of interest (text, code)
  • Hit testing with layers
  • Code-specific plots and metrics


[http://github.com/synesthesiam/eyecode]

Pandas and Scipy

In [10]:
import pandas, scipy.stats, numpy

# Numpy and pandas work together to create R-like data frames
data = numpy.random.uniform(size=25).reshape((5, 5))
frame = pandas.DataFrame(data, columns=["A", "B", "C", "D", "E"])

# Non-numeric columns are supported
frame["F"] = ["x", "x", "o", "o", "o"]
frame
Out[10]:
A B C D E F
0 0.988141 0.601001 0.313569 0.493887 0.702926 x
1 0.256549 0.802963 0.632752 0.250193 0.152852 x
2 0.875510 0.354493 0.665779 0.787402 0.279158 o
3 0.473622 0.000874 0.153416 0.064031 0.127221 o
4 0.255761 0.794496 0.568220 0.868812 0.488199 o

5 rows Ă— 6 columns

In [29]:
# Statistics are available in scipy.stats
r, p = scipy.stats.spearmanr(frame.A, frame.B)
print "Spearman: r={0}, p={1}".format(r, p)

# Pandas lets you sort, group, and slice data frames
for f_val, sub_frame in frame.groupby("F"):
    # Underneath, they are accessible as standard numpy arrays
    s_vals = sub_frame[["A", "B", "C", "D", "E"]].values.flatten()
    print "When F={0}, sum is {1}, std is {2}".format(f_val, s_vals.sum(), s_vals.std())
Spearman: r=-0.5, p=0.391002218956
When F=o, sum is 6.75699318412, std is 0.291021433612
When F=x, sum is 5.19483392306, std is 0.259278513253

Areas of Interest

  • Automatic with Pygments
  • Works for any supported language!

Block

Line

Syntax


Hit Testing

  • Fixations mapped to areas of interest
  • Highest area of overlap (or point)
  • Overlap area ∝ probability?


"Super" Code Image

Saccade Angles

  • Saccade = quick jump between fixations
  • Plot angles and lengths


Scanpaths

  • Fixations → AOIs → Scanpath
  • Repeats removed, duration lost

Transition Matrices

  • First-order transition probabilities
  • Steady state = AOI importance?


Fixation Timelines

  • Example: words and fixations in LTR reading order
  • All fixations are same duration

Fixation Timelines

  • Time on each line is proportional to line length


Rolling Metrics

  • Computed over a rolling time window
  • Window size and step can vary

Experiment Results

  • Reading behavior (descriptive, predicting)
  • Saccade angles and scanpaths (string-edit distance)
  • Important lines (fixation duration, steady state)
  • Strategies and calculations (rolling metrics)

Reading Behavior

  • Fixation Duration - 273 ms avg (expected 300-400 ms)
  • Fixation Count - 172 avg (0.97 corr w/ duration)
  • Fixation Rate - 2.73 per sec avg
  • Code/Output Transitions - 6 avg per trial


Reading Behavior Prediction

  • Simple OLS models, fit with statsmodels package
  • Keywords least fixated
  • Line length, number useful for predicting duration

Type Dep. Var Useful Not Useful
Line First fixation Line # Other text/content
Line Fixation duration Line length, whitespace prop Experience level
Element Fixation duration Category, line #

Saccade Angles

  • Left - eyeCode experiment (all programs)
  • Center - eyeCode experiment (counting programs only)
  • Right - Natural language experiment


Scanpath Comparisons

  • For each pair of trials (same program)
  • Block or line + output box scanpath (no repeats)
  • Normalized Levenshtein distance (string edit distance)
  • Consider ScanMatch (Needleman-Wunsch algorithm)


Important Lines/Elements

  • Fixation duration
  • Steady state


Calculations and Rolling Metrics

  • Spikes in average fixation duration correlated with mental calculations

Responses

  1. 0
  2. 1
  3. 1
  4. 6
  5. 2
  6. 11


 1 intercept = 1
 2 slope = 5
 3 
 4 x_base = 0
 5 x_other = x_base + 1
 6 x_end = x_base + x_other + 1
 7 
 8 y_base = slope * x_base + intercept
 9 y_other = slope * x_other + intercept
10 y_end = slope * x_end + intercept
11 
12 print x_base, y_base
13 print x_other, y_other
14 print x_end, y_end

Strategies


  • Batch (top)
    • Read through entire program, then respond
  • Incremental (bottom)
    • Respond to each print statement

rectangle - tuples

 1 def area(xy_1, xy_2):
 2     width = xy_2[0] - xy_1[0]
 3     height = xy_2[1] - xy_1[1]
 4     return width * height
 5 
 6 r1_xy_1 = (0, 0)
 7 r1_xy_2 = (10, 10)
 8 r1_area = area(r1_xy_1, r1_xy_2)
 9 print r1_area
10 
11 r2_xy_1 = (5, 5)
12 r2_xy_2 = (10, 10)
13 r2_area = area(r2_xy_1, r2_xy_2)
14 print r2_area


Program-Specific Results

between

  • Focus on lists and between() comparison
  • Less time on second list
  • Common error assumes common() for x_between and y_between

True Output

[8, 7, 9]
[1, 0, 8, 1]
[8, 9, 0]

Common Error

[8, 7, 9]
[1, 0, 8, 1]
[8]

Program-Specific Results

counting

  • nospace - 80% correct (top), twospaces - 36% correct (bottom)
  • Line 2 → 3 transition twice as likely for:
    • nospace trials
    • Correct twospaces trial

Program-Specific Results

overload

  • More (relative) time spent on string literals in plusmixed
  • Priming from mathematical use of +?
  • More likely transitions from output box to "5" + "3"


  • Double-checking in plusmixed? (at 35 seconds)

Program-Specific Results

rectangle

  • Focus on area() arguments
  • Width/height calculation verification in tuples?

basic

tuples


Results Summary

Reading Behavior

  • Average fixation duration different than other experiments (273 ms vs. 300-400 ms)
  • Experience level not correlated with avg. fix duration
  • Keywords receieved least amount of time

Linear Models

  • Time on line - line length and whitespace proportion
  • First fix on line - line number
  • Time on code element - category and line number

Version Differences

  • Only counting, initvar, overload, and rectangle
  • Other program versions not distinguishable with eye/performance metrics
  • Programs may be too simple

Process/Cognitive Models (Future Work)

Goal

  • Develop a quantitative cognitive model of program comprehension
  • Compare model to human eye movements and errors

Models

  • Line reader model
  • Basic ACT-R model
  • Advanced ACT-R model


Cognitive Complexity Model (Cant)


  • Process model for program comprehension
  • Chunking
    • Reading and understanding a coupled block of code
    • Faster when chunks are smaller, less complex
  • Tracing
    • Visual jump to required sub-chunk
    • Faster when chunk is closer, more salient
  • Inspiration for quantitative models


Parameters

  • Chunking (7)
  • Tracing (5)

Limitations

  • Definition of a chunk
  • Not a cognitive model
  • All parameters necessary?
  • Does not make errors

Chunking


Tracing

Line Reader Model


  • Reads and evaluates lines like a debugger
  • Chunk = 1 line
  • Chunk complexity ∝ f(line length)
  • Tracing complexity ∝ g(line count)


Parameters

  • Chunking delay
  • Tracing delay
  • Chunk/tracing recall factors
  • Typing speed

Limitations

  • Single line
  • Not a cognitive model
  • Does not make errors
  • Unrealistic typing

ACT-R


  • ACT-R is a cognitive architecture, programmed with Common Lisp
    • Adaptive Control of Thought - Rational
    • Computational theory of cognition
  • Cognition is transforming and moving chunks between buffers
    • Chunks are a collection of key/value pairs
    • Buffers hold one chunk at a time
    • Productions are micro-steps in a task
  • Modules for
    • Declarative memory
    • Task state (imaginal), goals
    • Vision, hearing (aural), speaking (vocal), typing (manual)
  • Bottleneck in procedural
  • Sub-symbolic effects
    • Memory retrieval delay
    • Manual jamming
    • Eye movements

(p encode-letter
   =goal>
      isa         read-letters
      state       attend
   =visual>
      isa         text
      value       =letter1
   ?imaginal>
      buffer      empty
==>
   =goal>
      state       wait
   +imaginal>
      isa         array
      letter1     =letter1
)

Basic ACT-R Model


  • Declarative memory
    • Chunk/tracing recall
  • Visual module
    • Chunk size, tracing distance
  • Manual module
    • Typing response


Parameters

  • Initial DM activations
  • Standard ACT-R parameters

Limitations

  • Single line
  • Cannot recognize algorithms
  • Beginner typing skill

Advanced ACT-R Model


  • Basic model + ontology + spatial
  • Map visual locations/areas to program semantics (loops, functions)
  • Use qualtitative spatial reasoning
  • Variable name similarity (prefix), semantic priming


Parameters

  • Initial DM activations
  • DM weights (similarity)
  • Standard ACT-R parameters

Limitations

  • No IDE/tool interaction
  • Only single file, simple programs
  • High level rules of discourse?

Thank you!

References

  • Françoise Détienne and Frank Bott. Software design–cognitive aspects. Springer Verlag, 2002.
  • Dario D. Salvucci. Mapping eye movements to cognitive processes. PhD thesis, Carnegie Mellon University, 1999.
  • Andreas Stefik and Susanna Siebert. An empirical investigation into programming language syntax. ACM Transactions on Computing Education (TOCE), 13(4):19, 2013.