In [1]:
%run prelude.ipy
In [3]:
%%html
<link rel="stylesheet" href="files/custom_devel.css" />

Quantifying Program Complexity and Comprehension

Michael Hansen, Andrew Lumsdaine, Rob Goldstone, Raquel Hill, Chen Yu

Dissertation Proposal

Indiana University, Fall 2013

Big Question

  • How should we quantify the psychological or cognitive complexity of a program?

Medium Questions

  • What is cognitive complexity in the context of programming?
  • Which aspects of a program/programmer affect this cognitive complexity?
  • Can we quantify a program's cognitive complexity?
  • Knowing a program's cognitive complexity, what can we predict?

Cognitive Complexity in the Context of Programming

Software complexity is "a measure of resources expended by a system [human or other] while interacting with a piece of software to perform a given task."

Basili, 1980
  • Human system
  • Resource expenditure
  • Task

Presentation Overview

  1. Measuring Software Complexity
  2. Psychology of Programming
  3. Experiments
  4. Modeling: Mr. Bits

1. Measuring Complexity

  • Completed work
    • Literature review
    • Complexity vs. reuse experiment
  • Proposed work
    • Cohesive write-up, complexity vs. readability

Kinds of Complexity

  • Problem/computational complexity
    • Complexity of underlying problem or domain
    • Usually considered fixed
  • Representational complexity
  • Cognitive/psychological complexity

Computational Complexity

  • Bounds on computing resources as a function of input size
  • \(\mathcal{O}(c) \lt \mathcal{O}(n) \lt \mathcal{O}(n^c) \lt \mathcal{O}(c^n)\)

Kolmogorov Complexity

  • Computational resources needed to specify an object
  • Size of smallest program in language \(\mathcal{L}\)
  • Not computable

Kinds of Complexity

  • Problem/computational complexity
  • Representational complexity
    • Physical form of the program
    • Language, formatting, naming, etc.
  • Cognitive/psychological complexity

Source Code Metrics

  • Syntactic - Size/Spatial/Graph/Counter-Factual
    • Lines of Code
    • Function Complexity
    • Inheritance Depth
    • Minimum Description Length
  • Readability
    • Line length
    • Number of identifiers/identifier length
    • Indentiation/blank lines
  • Concepts
    • Formal Concept Analysis
    • Concept Identification

Weyuker's Properties

  • Proposed properties of syntactic software complexity measures
Property Description
\((\exists P, Q)(|P| \ne |Q|)\) Not all programs should have the same complexity
\((\forall c)( \{ P \; | \; |P| = c \} \textrm{ is finite})\) The set of programs whose complexity is \(c\) is finite
\((\exists P, Q)(|P| = |Q| \textrm{ and } P \ne Q)\) Some programs share the same complexity
\((\exists P, Q)(P \equiv Q \textrm{ and } |P| \ne |Q|)\) Functional equivalence does not imply complexity equivalence
\((\forall P, Q)(|P| \le |P; Q| \textrm{ and } |Q| \le |P; Q|)\) Concatenation cannot decrease complexity
\((\exists P, Q, R)(|P| = |Q| \textrm{ and } |P; R| \ne |Q; R|)\) Context matters for complexity after concatenation
\((\exists P)(|P| \ne |\textrm{permute}(P)|)\) The order of statements matters
\((\forall P)(|P| = |\textrm{rename}(P)|)\) Identifier and operator names do not matter
\((\exists P, Q)(|P| + |Q| < |P; Q|)\) Concatenated programs may be more complex than the sum of their parts

Kinds of Complexity

  • Problem/computational complexity
  • Representational complexity
  • Cognitive/psychological complexity
    • Influenced by problem, representational complexity
    • Function of programmer experience, mental resource constraints
    • Task dependent: reuse vs. debugging vs. modification

Qualitative Models

  • Integrated Metamodel (von Mayrhauser, 1995)
    • Program/situation models + top-down planning
  • Cognitive Dimensions of Notation (Blackwell and Green, 1995)
    • Programming languages are used to try out ideas
    • Hidden dependencies, viscosity, consistency, ...
  • Rules of Discourse (Soloway and Ehrlich, 1984)
    • Unwritten rules internalized by experts
    • Expectations that drive understanding process

Kinds of Complexity

  • Problem/computational complexity
  • Representational complexity
  • Cognitive/psychological complexity
    • Influenced by problem, representational complexity
    • Function of programmer experience, mental resource constraints
    • Task dependent: reuse vs. debugging vs. modification

Quantitative Models

  • Cognitive Weights (Chhabra, 2011; Shao et. al, 2003)
    • Assign weights to syntactic & semantic elements
    • Complexity = \(f({weights})\)
  • Cognitive Complexity Metric (Cant et. al, 1995)
    • Process model based on chunking & tracing
    • Terms for chunk size, control structures, boolean expressions, etc.
    • Complexity = \(f({chunking}) + g({tracing})\)
  • Mr. Bits
    • Embodied process model based on eye movements, memory, spatial reasoning, inference
    • Task is to predict printed output
    • Complexity = time spent, errors made, etc.

Complexity vs. Reuse

2. Psychology of Programming

  • Completed work
    • Literature review
    • Onward! workshop paper (cognitive architectures)
  • Proposed work
    • Text understanding models (Kintsch, 1978)
    • Other eye-tracking studies of programming

Cant's Cognitive Complexity Metric (CCM)

One feature which all of these [theoretical] approaches have in common is that they begin with certain characteristics of the software and attempt to determine what effect they might have on the difficulty of the various programmer tasks.

A more useful approach would be first to analyze the processes involved in programmer tasks, as well as the parameters which govern the effort involved in those processes. From this point one can deduce, or at least make informed guesses, about which code characteristics will affect those parameters.

Cant et. al, 1995

Cant's Cognitive Complexity Metric (CCM)

  • Immediate chunk complexity (\(R_i\))
  • Sub-chunk complexity (\(C_j\))
  • Tracing difficulty (\(T_j\))

CCM Terms

Term Description
\(R_F\) Speed of recall or review (familiarity)
\(R_S\) Chunk size
\(R_C\) Type of control structure in which chunk is embedded
\(R_E\) Difficulty of understanding complex Boolean or other expressions
\(R_R\) Recognizability of chunk
\(R_V\) Effects of visual structure
\(R_D\) Disruptions caused by dependencies
 
\(T_F\) Dependency familiarity
\(T_L\) Localizatio
\(T_A\) Ambiguit
\(T_S\) Spatial distance
\(T_C\) Level of cueing

3a. Experiment 1: Output Prediction

  • Completed work
    • Data collection from Mechanical Turk and Bloomington (162 participants)
    • arXiv paper with statistical analysis
  • Proposed work
    • Journal article with additional complexity/performance metrics
    • Classifier analysis

Task

  • Predict printed output of 10 short Python programs
  • 2-3 versions of 10 programs, randomly assigned
  • Pre/post surveys
  • No feedback, syntax highlighting

Participants

  • 162 total participants
    • 29 Bloomington ($10)
    • 130 Mechanical Turk ($0.75)
    • 3 E-mail
  • 1,602 trials
    • 18 trials discarded

In [20]:
fig = plot.misc.demographics(experiments)
fig.savefig("images/demographics-all.png", dpi=50)
pyplot.clf()

Demographics (All Participants)

In [21]:
fig = plot.misc.demographics(experiments[experiments.location == "bloomington"])
fig.savefig("images/demographics-bloomington.png", dpi=50)
pyplot.clf()

Demographics (Bloomington Participants)

  • Younger participants, more experienced programmers

Home Screen

  • Program order is randomized

Trial Screen

  • Images instead of text, no feedback

Programs (1/2)

10 categories, 2-3 versions each (25 total)

  • between - filter two lists, intersection
    • functions - between/common in functions (24 lines)
    • inline - no functions (19 lines)
  • counting - simple for loop with bug
    • nospace - no blank lines in loop body (3 lines)
    • twospaces - 2 blank lines in loop body (5 lines)
  • funcall - simple function call with different values
    • nospace - calls on 1 line, no spaces (4 lines)
    • space - calls on 1 line, spaced out (4 lines)
    • vars - calls on 3 lines, different vars (7 lines)
  • overload - overloaded + operator (number strings)
    • multmixed - numeric *, string + (11 lines)
    • plusmixed - numeric +, string + (11 lines)
    • strings - string + (11 lines)
  • partition - partition list of numbers
    • balanced - odd number of items (5 lines)
    • unbalanced - even number of items (5 lines)
    • unbalanced_pivot - even number of items, pivot var (6 lines)

Programs (2/2)

10 categories, 2-3 versions each (25 total)

  • initvar - summation and factorial
    • bothbad - bug in both (9 lines)
    • good - no bugs (9 lines)
    • onebad - bug in summation (9 lines)
  • order - 3 simple functions called
    • inorder - call order = definition order (14 lines)
    • shuffled - call order \(\ne\) definition order (14 lines)
  • rectangle - compute area of 2 rectangles
    • basic - x,y,w,h in separate vars, area() in function (18 lines)
    • class - x,y,w,h,area() in class (21 lines)
    • tuples - x,y,w,h in tuples, area() in function (14 lines)
  • scope - function calls with no effect
    • diffname - local/global var have same name (12 lines)
    • samename - local/global var have different name (12 lines)
  • whitespace - simple linear equations
    • linedup - code is aligned on operators (14 lines)
    • zigzag - code is not aligned (14 lines)

Grades

  • 0 to 10 (perfect)
  • \(\ge 7\) correct modulo formatting
print "1" + "2"
print 4 * 3

True Output

12
12

Correct (7)

"12",12

Common Error (4)

3
12

Incorrect (0)

barney

Anatomy of a Trial

  • Response proportion \(\approx 0.5\)
  • Keystroke coefficient = 4/2 = 2
  • Keystroke count = 4
    • True output characters = 2
  • Grade = 10 (perfect)

Performance Metrics

Classifier Analysis

3b. Experiment 2: Eye Tracking

  • Completed work
    • Data collection from Bloomington (29 participants)
    • Videos and preliminary analysis available via web
    • Koli Calling workshop paper with automated coding
  • Proposed work
    • Paper with fixation metric and scanpath comparisons
    • Release Python library, data, and complete analyses
In [10]:
trial_id = 17
screen_img = data.hansen_2012.trial_screen(trial_id)
img = plot.fixations.circles(all_fixations[all_fixations.trial_id == trial_id], screen_img, alpha=0.6)
img.thumbnail((1000, 1000), Image.ANTIALIAS)
img.save("images/17-circles.jpg")

Eye-Tracking Hardware

  • Tobii TX300 - 300Hz
  • 23 in. screen, 1920x1080
  • Free-standing (no chinrest)
  • Tobii Studio 2.2
  • Fixations from single trial
  • between_functions program
  • Radii proportional to duration

Data Correction

  • Fixations were manually correct by experiment
  • Vertical shifts only

Uncorrected

Corrected


Mapping Fixations to Areas of Interest

  • Multiple layers of AOIs, disjoint intra-layer
  • In each layer, fixation \(\rightarrow\) 0 or 1 AOI
  • Circle around fixation point, AOI with largest area overlap

Scanpath Comparisons

  • Levenshtein distance (string edit distance)
  • Needleman-Wunsch (DNA sequence matching)
In [30]:
from nltk.util import ngrams
fixations = data.hansen_2012.all_fixations()
aois = data.hansen_2012.areas_of_interest()

trial_id = 17
t_fixes = fixations[fixations.trial_id == trial_id]
t_aois = aois[aois.trial_id == trial_id]

line_scan = aoi.scanpath_from_fixations(
    t_fixes, repeats=False,
    aoi_names = { "line": [] })
tri_grams = pandas.Series(
    ngrams(line_scan, 3)).value_counts()
ax = tri_grams[:10].plot(kind="barh")
ax.invert_yaxis()
ax.figure.tight_layout()
ax.figure.savefig("images/example-hist.png")
pyplot.clf()

The eyeCode Library

  • Data importing/cleaning
  • Fixations \(\rightarrow\) AOIs
  • Scanpath construction/comparison
  • Visualization
# Load library and experiment data
from eyecode import data, aoi
fixes = data.hansen_2012.all_fixations()
aois = data.hansen_2012.areas_of_interest()

# Filter down to a single trial
trial_id = 17
t_fixes = fixes[fixes.trial_id == trial_id]
t_aois = aois[aois.trial_id == trial_id]

# Compute scanpath and plot top 10 tri-grams
line_scan = aoi.scanpath_from_fixations(
    t_fixes, repeats=False,
    aoi_names = { "line": [] })
pandas.Series(nltk.util.ngrams(line_scan, 3))\
    .value_counts()[:10].plot(kind="barh")

3c. Experiment 3: Follow-Up

  • Completed Work
    • Experiment design and software
    • New programs
  • Proposed Work
    • Run new Mechanical Turk experiment
    • Run new eye-tracking experiment (optional)

Updated Interface

New Programs

4. Modeling: Mr. Bits

Comparison to KLM and GOMS

Mr. Chips Retina

  • Open Questions
    • Add noise to saccades?
    • Distinguish numbers/letters/operators in para-foveal?

ACT-R Declarative Memory

  • Chunks (key/value pairs) with spreading activation
  • Decay and retrieval latency predictions

Questions

  • Probe by identifier prefix? p_y when p_x is seen
  • Summary versus details for variables/functions
  • Existing programming ontology?

Glasgow Spatial Array

  • Qualitative spatial reasoning (left-of, contains, etc.)

Glasgow Spatial Array

  • Qualitative spatial reasoning (left-of, contains, etc.)

Questions

  • Time cost for array inspection and reasoning?
  • How to represent data flow? Time \(\approx\) space?

State Machine Behavior Models

Cognitive Domain Ontologies

  • Domain knowledge represented as entities, relationships, constraints
  • Constraint solver solutions are "possible worlds"

Cognitive Domain Ontologies

  • Top-down: entity is asserted, evidence is searched for
  • Bottom-up: evidence is found, possible entities are considered

Prototype Model: Mr. Bits 0.1a

Threats to Validity

Conclusion and Future Work

In []: