Quantifying Program Complexity and Comprehension

Michael Hansen, Andrew Lumsdaine, Rob Goldstone, Raquel Hill, Chen Yu

Dissertation Proposal

Indiana University, Fall 2013

Big Question

How should we quantify the psychological or cognitive complexity of a program?

Medium Questions

What is cognitive complexity in the context of programming?
Which aspects of a program/programmer affect this cognitive complexity?
Can we quantify a program's cognitive complexity?
Knowing a program's cognitive complexity, what can we predict?

Cognitive Complexity in the Context of Programming

Software complexity is "a measure of resources expended by a system [human or other] while interacting with a piece of software to perform a given task."
Basili, 1980

Human system
Resource expenditure
Task

Presentation Overview

Measuring Software Complexity
Psychology of Programming
Experiments
Modeling: Mr. Bits

1. Measuring Complexity

Completed work
- Literature review
- Complexity vs. reuse experiment
Proposed work
- Cohesive write-up, complexity vs. readability

Kinds of Complexity

Problem/computational complexity
- Complexity of underlying problem or domain
- Usually considered fixed
Representational complexity
Cognitive/psychological complexity

Computational Complexity

Bounds on computing resources as a function of input size
$\mathcal{O}(c) \lt \mathcal{O}(n) \lt \mathcal{O}(n^c) \lt \mathcal{O}(c^n)$

Kolmogorov Complexity

Computational resources needed to specify an object
Size of smallest program in language $\mathcal{L}$
Not computable

Kinds of Complexity

Problem/computational complexity
Representational complexity
- Physical form of the program
- Language, formatting, naming, etc.
Cognitive/psychological complexity

Source Code Metrics

Syntactic - Size/Spatial/Graph/Counter-Factual
- Lines of Code
- Function Complexity
- Inheritance Depth
- Minimum Description Length
Readability
- Line length
- Number of identifiers/identifier length
- Indentiation/blank lines
Concepts
- Formal Concept Analysis
- Concept Identification

Weyuker's Properties

Proposed properties of syntactic software complexity measures

Property	Description
$(\exists P, Q)(\|P\| \ne \|Q\|)$	Not all programs should have the same complexity
$(\forall c)( \{ P \; \| \; \|P\| = c \} \textrm{ is finite})$	The set of programs whose complexity is $c$ is finite
$(\exists P, Q)(\|P\| = \|Q\| \textrm{ and } P \ne Q)$	Some programs share the same complexity
$(\exists P, Q)(P \equiv Q \textrm{ and } \|P\| \ne \|Q\|)$	Functional equivalence does not imply complexity equivalence
$(\forall P, Q)(\|P\| \le \|P; Q\| \textrm{ and } \|Q\| \le \|P; Q\|)$	Concatenation cannot decrease complexity
$(\exists P, Q, R)(\|P\| = \|Q\| \textrm{ and } \|P; R\| \ne \|Q; R\|)$	Context matters for complexity after concatenation
$(\exists P)(\|P\| \ne \|\textrm{permute}(P)\|)$	The order of statements matters
$(\forall P)(\|P\| = \|\textrm{rename}(P)\|)$	Identifier and operator names do not matter
$(\exists P, Q)(\|P\| + \|Q\| < \|P; Q\|)$	Concatenated programs may be more complex than the sum of their parts

Kinds of Complexity

Problem/computational complexity
Representational complexity
Cognitive/psychological complexity
- Influenced by problem, representational complexity
- Function of programmer experience, mental resource constraints
- Task dependent: reuse vs. debugging vs. modification

Qualitative Models

Integrated Metamodel (von Mayrhauser, 1995)
- Program/situation models + top-down planning
Cognitive Dimensions of Notation (Blackwell and Green, 1995)
- Programming languages are used to try out ideas
- Hidden dependencies, viscosity, consistency, ...
Rules of Discourse (Soloway and Ehrlich, 1984)
- Unwritten rules internalized by experts
- Expectations that drive understanding process

Kinds of Complexity

Problem/computational complexity
Representational complexity
Cognitive/psychological complexity
- Influenced by problem, representational complexity
- Function of programmer experience, mental resource constraints
- Task dependent: reuse vs. debugging vs. modification

Quantitative Models

Cognitive Weights (Chhabra, 2011; Shao et. al, 2003)
- Assign weights to syntactic & semantic elements
- Complexity = $f({weights})$
Cognitive Complexity Metric (Cant et. al, 1995)
- Process model based on chunking & tracing
- Terms for chunk size, control structures, boolean expressions, etc.
- Complexity = $f({chunking}) + g({tracing})$
Mr. Bits
- Embodied process model based on eye movements, memory, spatial reasoning, inference
- Task is to predict printed output
- Complexity = time spent, errors made, etc.

Complexity vs. Reuse

2. Psychology of Programming

Completed work
- Literature review
- Onward! workshop paper (cognitive architectures)
Proposed work
- Text understanding models (Kintsch, 1978)
- Other eye-tracking studies of programming

Cant's Cognitive Complexity Metric (CCM)

One feature which all of these [theoretical] approaches have in common is that they begin with certain characteristics of the software and attempt to determine what effect they might have on the difficulty of the various programmer tasks.

A more useful approach would be first to analyze the processes involved in programmer tasks, as well as the parameters which govern the effort involved in those processes. From this point one can deduce, or at least make informed guesses, about which code characteristics will affect those parameters.
Cant et. al, 1995

Cant's Cognitive Complexity Metric (CCM)

Immediate chunk complexity ($R_i$)
Sub-chunk complexity ($C_j$)
Tracing difficulty ($T_j$)

CCM Terms

Term	Description
$R_F$	Speed of recall or review (familiarity)
$R_S$	Chunk size
$R_C$	Type of control structure in which chunk is embedded
$R_E$	Difficulty of understanding complex Boolean or other expressions
$R_R$	Recognizability of chunk
$R_V$	Effects of visual structure
$R_D$	Disruptions caused by dependencies

$T_F$	Dependency familiarity
$T_L$	Localizatio
$T_A$	Ambiguit
$T_S$	Spatial distance
$T_C$	Level of cueing

3a. Experiment 1: Output Prediction

Completed work
- Data collection from Mechanical Turk and Bloomington (162 participants)
- arXiv paper with statistical analysis
Proposed work
- Journal article with additional complexity/performance metrics
- Classifier analysis

Task

Predict printed output of 10 short Python programs
2-3 versions of 10 programs, randomly assigned
Pre/post surveys
No feedback, syntax highlighting

Participants

162 total participants
- 29 Bloomington ($10)
- 130 Mechanical Turk ($0.75)
- 3 E-mail
1,602 trials
- 18 trials discarded

Demographics (All Participants)

Demographics (Bloomington Participants)

Younger participants, more experienced programmers

Home Screen

Program order is randomized

Trial Screen

Images instead of text, no feedback

Programs (1/2)

10 categories, 2-3 versions each (25 total)

between - filter two lists, intersection
- functions - between/common in functions (24 lines)
- inline - no functions (19 lines)
counting - simple for loop with bug
- nospace - no blank lines in loop body (3 lines)
- twospaces - 2 blank lines in loop body (5 lines)
funcall - simple function call with different values
- nospace - calls on 1 line, no spaces (4 lines)
- space - calls on 1 line, spaced out (4 lines)
- vars - calls on 3 lines, different vars (7 lines)

overload - overloaded + operator (number strings)
- multmixed - numeric *, string + (11 lines)
- plusmixed - numeric +, string + (11 lines)
- strings - string + (11 lines)
partition - partition list of numbers
- balanced - odd number of items (5 lines)
- unbalanced - even number of items (5 lines)
- unbalanced_pivot - even number of items, pivot var (6 lines)

Programs (2/2)

10 categories, 2-3 versions each (25 total)

initvar - summation and factorial
- bothbad - bug in both (9 lines)
- good - no bugs (9 lines)
- onebad - bug in summation (9 lines)
order - 3 simple functions called
- inorder - call order = definition order (14 lines)
- shuffled - call order $\ne$ definition order (14 lines)
rectangle - compute area of 2 rectangles
- basic - x,y,w,h in separate vars, area() in function (18 lines)
- class - x,y,w,h,area() in class (21 lines)
- tuples - x,y,w,h in tuples, area() in function (14 lines)

scope - function calls with no effect
- diffname - local/global var have same name (12 lines)
- samename - local/global var have different name (12 lines)
whitespace - simple linear equations
- linedup - code is aligned on operators (14 lines)
- zigzag - code is not aligned (14 lines)

Grades

0 to 10 (perfect)
$\ge 7$ correct modulo formatting

print "1" + "2"
print 4 * 3

True Output

12
12

Correct (7)

"12",12

Common Error (4)

3
12

Incorrect (0)

barney

Anatomy of a Trial

Response proportion $\approx 0.5$
Keystroke coefficient = 4/2 = 2
Keystroke count = 4
- True output characters = 2
Grade = 10 (perfect)

Performance Metrics

Classifier Analysis

3b. Experiment 2: Eye Tracking

Completed work
- Data collection from Bloomington (29 participants)
- Videos and preliminary analysis available via web
- Koli Calling workshop paper with automated coding
Proposed work
- Paper with fixation metric and scanpath comparisons
- Release Python library, data, and complete analyses

Eye-Tracking Hardware

Tobii TX300 - 300Hz
23 in. screen, 1920x1080
Free-standing (no chinrest)
Tobii Studio 2.2

Fixations from single trial
between_functions program
Radii proportional to duration

Data Correction

Fixations were manually correct by experiment
Vertical shifts only

Uncorrected

Corrected

Mapping Fixations to Areas of Interest

Multiple layers of AOIs, disjoint intra-layer
In each layer, fixation $\rightarrow$ 0 or 1 AOI
Circle around fixation point, AOI with largest area overlap

Scanpath Comparisons

Levenshtein distance (string edit distance)
Needleman-Wunsch (DNA sequence matching)

The eyeCode Library

Data importing/cleaning
Fixations $\rightarrow$ AOIs
Scanpath construction/comparison
Visualization

# Load library and experiment data
from eyecode import data, aoi
fixes = data.hansen_2012.all_fixations()
aois = data.hansen_2012.areas_of_interest()

# Filter down to a single trial
trial_id = 17
t_fixes = fixes[fixes.trial_id == trial_id]
t_aois = aois[aois.trial_id == trial_id]

# Compute scanpath and plot top 10 tri-grams
line_scan = aoi.scanpath_from_fixations(
    t_fixes, repeats=False,
    aoi_names = { "line": [] })
pandas.Series(nltk.util.ngrams(line_scan, 3))\
    .value_counts()[:10].plot(kind="barh")

3c. Experiment 3: Follow-Up

Completed Work
- Experiment design and software
- New programs
Proposed Work
- Run new Mechanical Turk experiment
- Run new eye-tracking experiment (optional)

Updated Interface

New Programs

4. Modeling: Mr. Bits

Comparison to KLM and GOMS

Mr. Chips Retina

Open Questions
- Add noise to saccades?
- Distinguish numbers/letters/operators in para-foveal?

ACT-R Declarative Memory

Chunks (key/value pairs) with spreading activation
Decay and retrieval latency predictions

Questions

Probe by identifier prefix? p_y when p_x is seen
Summary versus details for variables/functions
Existing programming ontology?

Glasgow Spatial Array

Qualitative spatial reasoning (left-of, contains, etc.)

Glasgow Spatial Array

Qualitative spatial reasoning (left-of, contains, etc.)

Questions

Time cost for array inspection and reasoning?
How to represent data flow? Time $\approx$ space?

State Machine Behavior Models

Cognitive Domain Ontologies

Domain knowledge represented as entities, relationships, constraints
Constraint solver solutions are "possible worlds"

Cognitive Domain Ontologies

Top-down: entity is asserted, evidence is searched for
Bottom-up: evidence is found, possible entities are considered

Prototype Model: Mr. Bits 0.1a

Threats to Validity

Conclusion and Future Work

In []:

Property	Description
\((\exists P, Q)(\|P\| \ne \|Q\|)\)	Not all programs should have the same complexity
\((\forall c)( \{ P \; \| \; \|P\| = c \} \textrm{ is finite})\)	The set of programs whose complexity is \(c\) is finite
\((\exists P, Q)(\|P\| = \|Q\| \textrm{ and } P \ne Q)\)	Some programs share the same complexity
\((\exists P, Q)(P \equiv Q \textrm{ and } \|P\| \ne \|Q\|)\)	Functional equivalence does not imply complexity equivalence
\((\forall P, Q)(\|P\| \le \|P; Q\| \textrm{ and } \|Q\| \le \|P; Q\|)\)	Concatenation cannot decrease complexity
\((\exists P, Q, R)(\|P\| = \|Q\| \textrm{ and } \|P; R\| \ne \|Q; R\|)\)	Context matters for complexity after concatenation
\((\exists P)(\|P\| \ne \|\textrm{permute}(P)\|)\)	The order of statements matters
\((\forall P)(\|P\| = \|\textrm{rename}(P)\|)\)	Identifier and operator names do not matter
\((\exists P, Q)(\|P\| + \|Q\| < \|P; Q\|)\)	Concatenated programs may be more complex than the sum of their parts

Term	Description
\(R_F\)	Speed of recall or review (familiarity)
\(R_S\)	Chunk size
\(R_C\)	Type of control structure in which chunk is embedded
\(R_E\)	Difficulty of understanding complex Boolean or other expressions
\(R_R\)	Recognizability of chunk
\(R_V\)	Effects of visual structure
\(R_D\)	Disruptions caused by dependencies

\(T_F\)	Dependency familiarity
\(T_L\)	Localizatio
\(T_A\)	Ambiguit
\(T_S\)	Spatial distance
\(T_C\)	Level of cueing