Quantifying Program Comprehension

Michael Hansen, Rob Goldstone, Andrew Lumsdaine

Percepts and Concepts Lab, Spring 2013

Outline

eyeCode Experiment
Participants and Response Data
Eye-tracking Analysis
Future Work

The eyeCode Experiment

Task

Predict printed output of 10 short Python programs
2-3 versions of 10 programs, randomly assigned
Pre/post surveys

Goals

Small code changes = large effects?
Complexity is more than metrics
Eye-tracking data for modeling program comprehension

Home Screen

Trial Screen

Anatomy of a Trial

Response proportion $\approx 0.5$
Keystroke coefficient = 4/2 = 2
- Keystroke count = 4
- True output characters = 2
Grade = 10 (perfect)

Tobii TX300 Eye-Tracker

Free-standing (no head mount, chin rest)
300 Hz (fixations $\ge 100$ ms)

In [10]:

display.YouTubeVideo("gwAT6mvlR3Q", width=800, height=450)

Out[10]:

Programs (1/2)

10 categories, 2-3 versions each (25 total)

between - filter two lists, intersection
- functions - between/common in functions (24 lines)
- inline - no functions (19 lines)
counting - simple for loop with bug
- nospace - no blank lines in loop body (3 lines)
- twospaces - 2 blank lines in loop body (5 lines)
funcall - simple function call with different values
- nospace - calls on 1 line, no spaces (4 lines)
- space - calls on 1 line, spaced out (4 lines)
- vars - calls on 3 lines, different vars (7 lines)

overload - overloaded + operator (number strings)
- multmixed - numeric *, string + (11 lines)
- plusmixed - numeric +, string + (11 lines)
- strings - string + (11 lines)
partition - partition list of numbers
- balanced - odd number of items (5 lines)
- unbalanced - even number of items (5 lines)
- unbalanced_pivot - even number of items, pivot var (6 lines)

Programs (2/2)

10 categories, 2-3 versions each (25 total)

initvar - summation and factorial
- bothbad - bug in both (9 lines)
- good - no bugs (9 lines)
- onebad - bug in summation (9 lines)
order - 3 simple functions called
- inorder - call order = definition order (14 lines)
- shuffled - call order $\ne$ definition order (14 lines)
rectangle - compute area of 2 rectangles
- basic - x,y,w,h in separate vars, area() in function (18 lines)
- class - x,y,w,h,area() in class (21 lines)
- tuples - x,y,w,h in tuples, area() in function (14 lines)

scope - function calls with no effect
- diffname - local/global var have same name (12 lines)
- samename - local/global var have different name (12 lines)
whitespace - simple linear equations
- linedup - code is aligned on operators (14 lines)
- zigzag - code is not aligned (14 lines)

Participants and Response Data

162 total participants
- 29 Bloomington ($10)
- 130 Mechanical Turk ($0.75)
- 3 E-mail
1602 trials
- 18 trials discarded

Demographics

Grades

0 to 10 (perfect)
$\ge 7$ correct modulo formatting

print "1" + "2"
print 4 * 3

True Output

12
12

Correct (7)

"12",12

Common Error (4)

3
12

Incorrect (0)

barney

Grades

0 to 10 (perfect)
$\ge 7$ correct modulo formatting

Median trial grade = 10
Median experiment grade = 81

Grade Distributions by Program

scope - samename

def add_1(added):
    added = added + 1

def twice(added):
    added = added * 2

added = 4
add_1(added)
twice(added)
add_1(added)
twice(added)
print added

scope - diffname

def add_1(num):
    num = num + 1

def twice(num):
    num = num * 2

added = 4
add_1(added)
twice(added)
add_1(added)
twice(added)
print added

Trial Duration

45 minutes for entire experiment
No time limit on individual trials

Median trial duration: 55 sec
Median experiment duration: 773 sec (12.9 min)

Response Proportions by Program

Time spent responding / trial time

between - functions

def between(numbers, low, high):
    winners = []
    for num in numbers:
        if (low < num) and (num < high):
            winners.append(num)
    return winners

def common(list1, list2):
    winners = []
    for item1 in list1:
        if item1 in list2:
            winners.append(item1)
    return winners

x = [2, 8, 7, 9, -5, 0, 2]
x_btwn = between(x, 2, 10)
print x_btwn 

y = [1, -3, 10, 0, 8, 9, 1]
y_btwn = between(y, -2, 9)
print y_btwn 

xy_common = common(x, y)
print xy_common

between - inline

x = [2, 8, 7, 9, -5, 0, 2]
x_between = []
for x_i in x:
    if (2 < x_i) and (x_i < 10):
        x_between.append(x_i)
print x_between

y = [1, -3, 10, 0, 8, 9, 1]
y_between = []
for y_i in y:
    if (-2 < y_i) and (y_i < 9):
        y_between.append(y_i)
print y_between

xy_common = []
for x_i in x:
    if x_i in y:
        xy_common.append(x_i)
print xy_common

Keystroke Coefficient

Number of keystrokes / characters in true output
$\gt 1$ is less efficient

counting - nospace

for i in [1, 2, 3, 4]:
    print "The count is", i
    print "Done counting"

counting - twospaces

for i in [1, 2, 3, 4]:
    print "The count is", i 


    print "Done counting"

Eye-Tracking Analysis

29 participants, 290 trials
About $5 \frac{1}{2}$ hours of video
Fixations + saccades, corrected manually by experiment

Uncorrected

Corrected

Fixations and Areas of Interest

Need to quantize fixation positions

Line-based AOIs

Indentation is part of line AOI

Syntax-based AOIs

Current data is too noisy to use syntax AOIs

Fixations and Areas of Interest

By line and output box

Hit Testing

AOI with largest area overlap wins

Fixation Times by Line

Proportions of total fixation times (all participants)

Median grade = 10

Median grade = 4

Scanpath Comparisons

Levenshtein distance (string edit distance)
Needleman-Wunsch (DNA sequence matching)

AOI Transitions

for i in [1, 2, 3, 4]:
    print "The count is", i 


    print "Done counting"

Correct Trials

Incorrect Trials

Future Work

Collect more data
- New programs
- Chin rest for eye-tracker
Codify eye movements $\rightarrow$ participant strategies
- Differences between experts and novices
- Implications for programming education
Model comprehension process
- Qualitative theories to computational model
- Active vision model with procedural/declarative/spatial memory

Thank you!

In []: