19 October 2011

Programming Languages Suck at UX

I wish I could say that programming languages are notorious for their terrible usability—unfortunately, very few people seem to have noticed, and those that have noticed don’t seem to be particularly upset about it. Frankly, that pisses me off, so I’m going to rant about it, and hopefully give the problem some of the notoriety it deserves. I’d like to point out that this is by no means an exhaustive list of the (many, many) problems with programming languages!

(If you don’t have the time or desire to read all of the ranting, at least skim the boldfaced points. There’s some good stuff in there.)

The Humanity!

When it comes to any kind of functional design, user experience is not just some tangential concern: it is of the utmost importance. The moment you make a design choice that doesn’t benefit the user, you’re doing it wrong, and your design will be bad. I’ll get more articulate in a minute, but first I want to admonish all language designers. You are wrong and bad.

We’re all guilty of it. No one can design something useful that will satisfy everybody’s expectations. But we can avoid making language design choices that benefit the computer without benefiting the programmer—such choices are nonsensical and absurd.

If you’re not designing for humans, why are you designing at all?

That doesn’t mean we shouldn’t explore new computational paradigms, new ways of reasoning about and optimising programs. What it means is that we should do so in the spirit of making programming easier and more enjoyable, so that developers can focus on the problems that they want to solve, rather than the linguistic obstacles that get in their way.

♦ ♦ ♦

Non-linguistic Languages

I find it baffling that most designers of programming languages give no concern to the language aspect. We are linguistic creatures by nature, and we have strong intuitions about how language is supposed to work.

Larry Wall is a notable (partial) exception to the norm, appropriate considering he studied linguistics back in the day. Perl is what you might call a semi-naturalistic programming language. This has nothing to do with “natural-language programming”, which is largely bunk: natural language itself is not suitable for programming computers any more than Nahuatl is suitable for talking to a Welshman.

A naturalistic language, then, is one that exhibits usability characteristics of natural languages. I’ll wager that everyone’s first languages are either spoken or signed, followed shortly by the written forms (if any) of those languages. A second language can be much easier to learn if it is more similar to your native tongue, and I think this holds for programming languages as well.

We learn new things in terms of what we already know.

One of my favourite things about Perl (well, not Perl 6) is the fact that it has noun classes, inflected with sigils (basically typed dereferencing operators). $x (“the scalar x”) is a completely different thing from @x (“the array x”), which too is a completely different thing from %x (“the hash x”). Agreement in type is analogous to grammatical agreement in number or gender, which is very common in human language.

Different things should look different; related things, related.

Another thing Perl mimics in natural language is implicit reference. $_ is like the pronoun “it”, a default thing, the current subject of discussion, which in many situations can be assumed. Programming languages use explicit reference almost exclusively. In order to perform a series of operations on a value, the programmer must explicitly name that value for every operation. Oddly enough, this is even the case in the highly English-like Inform.

In one of the object-oriented languages of which everyone seems to be so fond, this could be as simple as keeping track of the current object “under discussion” and allowing it to be assumed where an object would otherwise be expected.

Languages should let us elide repetition.

One thing I like about concatenative languages such as Forth and Factor is that there is essentially no named state—you can create variables as a convenience, but at the heart of it, everything is just dataflow.

Computer languages are (characteristically) so wholly unlike and inferior to natural language that it’s almost comical to call them languages at all. You express ideas every day in your native tongue that have no analogue in any programming language in existence. In most languages, you can apply productive rules to derive new terms, whereas basically all programming languages are purely isolating with positional syntax.

Naturalistic programming languages will never be pretty. They are not minimal, or elegant, or simple, but despite all that, they are intuitive and useful. More importantly, they meet users’ expectations about how language is supposed to work.

Humans have expectations. Do not judge them; only exploit them.

All of the “ugly” features of natural languages evolved for specific reasons, and the designs that evolution has wrought can be borrowed to create better designs for our artificial languages.

♦ ♦ ♦

Erroneous Errors

Real-world languages have loads of redundancy, which greatly improves error recovery; a programming language with judicious syntactic redundancy can issue warnings instead of errors for imperfect but understandable input, improving the user’s experience.

If a compiler can deliver detailed diagnostic messages, then it must do so; if there is not enough information to provide a meaningful diagnostic, then it shouldn’t provide any more information than is relevant to the programmer.

If I write int f(X x), where X is an undeclared type, the compiler should not do what GCC does, which is to write the following:

error: expected ‘)’ before ‘x’

This tells me nothing about what is actually wrong, but in an effort to be specific, it gives me misleading specific information. If I were to insert ) before x, I would then get:

error: expected declaration specifiers before ‘x’

Followed by further, largely nonsensical errors that depend on what follows the declaration of f(). It should say either something both specific and helpful, such as:

error: use of undeclared type ‘X’ in parameter list of function ‘f’

Or, if that is not possible, then at least something not so specific that it becomes incorrect:

error: the parameter list of function ‘f’ is invalid

Errors should be useful, or vague enough not to be misleading.

This is not just an implementation issue. Languages are often structured in such a way that it is very difficult to provide meaningful error messages, because constructs are too context-dependent or informationally sparse to obtain a meaningful message from an appropriately narrow and targeted view of the source.

♦ ♦ ♦

Terrible Typography

Most programmers don’t seem to care about legibility. If they did, they would probably be angry that we still use fixed-width fonts for programming. Our programming languages aren’t designed in such a way that they look any good in proportional-width fonts, and monospaced fonts are needed to carry out tabular formatting in editors that don’t support the sanest known way to handle tabulation.

Sigh. At least most of us agree that it’s a good idea to keep code width low. It is a bit sad, though, that we call the rule of thumb the “80-column rule”.

I yearn for the day when we measure code width in ems.

Monospaced fonts arose in the first place due to technical limitations: first, because typewriters could only move a fixed distance per character typed, and second, because it was easier to address characters on a graphical display as a regular grid of fixed-size sprites. Fixed-width fonts are a typographical oddity that survived through tradition and little else.

Programming notation is the way it is primarily because of the fallacy that programming is mathematics. In reality, writing software is also very much like, well, writing.

Programming notation should evoke mathematics and prose alike.

In prose, for example, punctuation merely explicates the structure and organisation of the text, and gives hints as to cadence and prosody. In programming languages, as in mathematics, punctuation abbreviates common operations—but it is also abused to stand in for structures that, in mathematical notation, would ordinarily be indicated with richer formatting.

The formatting problem is a side-effect of the fact that we still live in the Stone Age when it comes to our notational character set. No offense to ASCII, but it was already showing its age in the late 80s when Unicode showed up.

But thanks to the peculiar tenacity of fixed-width typefaces, most programming languages can still be comfortably written on a 50-year-old typewriter. How’s that for backward compatibility?

♦ ♦ ♦

Impossible Input Methods

On account of the limited circa-1960 palette of characters we use in our languages, we’re constantly making notational compromises, approximating glyphs with digraphs and trigraphs. Should “<=” really mean “≤” when it actually looks like “⇐”?

When I tutored computer science, I saw students write => instead of >= many a time, because they expected the conceptual reverse of <= to be the visual reverse of it, as with ≤ and ≥. Similarly, the students frequently confused the left and right sides of assignment—i.e., they would write y = x when they meant x = y—because with “=” there is no visual indication of the direction of assignment, nor of the fact that mutating assignment is not the same as mathematical equality.

Why don’t we use hyphens for hyphenation, minus signs for subtraction, and dashes for ranges, instead of the hyphen-minus for all three? Why do we use straight quotes ("") when curved quotes (“”) can be nested (or contain straight quotes) without escaping? There’s a wealth of time-tested typographical convention in both mathematics and prose that programming language designers simply discard without a second thought. Without a first thought.

Throw away traditions that don’t work; respect those that do.

These compromises would be totally unnecessary thanks to Unicode, but our editors and input methods lag so far behind that it’s still infeasible to comfortably enter many non-ASCII characters without dedicated editor support.

And no such support exists, because a language that uses “->”, which can be written in any editor, is more marketable than one that uses “→” and relies (however little!) on outside tools. On Linux I can use a compose key, which is only a mild inconvenience, but on Windows I’m stuck with Alt codes or Character Map.

We value terseness in a language because it increases the information density of the code, so we can work with more meaning at once, both per screen and per brain. Punctuation symbols are generally the most concise way to express a concept, and some symbolic notation in a language is absolutely good, insofar as it reduces cognitive load and eye movement. But the legibility of a punctuation-heavy language suffers just as greatly as that of one without any punctuation at all.

♦ ♦ ♦

The Fear

Perhaps the biggest problem with programming language design is that, because it is so bad, people are afraid to use tools that can help them. They are afraid of the whitespace sensitivity of Python, because they assume it will violate their expectations, and therefore cause them a hassle. In reality, whitespace sensitivity in Python and Haskell are totally innocuous, and actually quite helpful—because they are designed with some thought behind them.

When you only know the bad, you quietly assume all things are bad.

It’s no wonder people stick so tenaciously to a single language or language family. Their expectations were violated time and again when they were learning how to program, so they assume that learning a new language is always like that—and sadly, it often is. They don’t want their expectations to be violated again, so they stick to what they know. Even if it’s bad, at least it’s familiar.

We need to get rid of that kind of thinking—not just so that language design can move forward, but so that we can quit worrying about languages and get shit done.

15 October 2011

Tabs vs. Spaces? I’ll Do You One Better

Now, I don’t ordinarily like to get involved in holy wars, because arguing about programming serves largely to waste time that could be spent actually programming. But I’m making an exception in this case in order to make, shall we say, a modest proposal.

While doing research for an article, I rediscovered elastic tabstops, which are a fairly sane way of managing indentation and tabulation, especially if you like to use proportional-width fonts for programming. Unfortunately, it’s not as widely implemented as it ought to be, and the reference implementation (a gedit plugin) leaves something to be desired in terms of efficiency.

The search for an Emacs implementation (which, sadly, proved fruitless) led me to all kinds of juicy “discussions” concerning the merits of tabs versus spaces. There are many arguments for both, but summarising and responding to them are beyond the scope of this article.

The peculiar thing to me was that people kept stressing the semantic nature of the tab character versus the space character. In typewriter terms, spaces are for moving the cursor a fixed width, while tabs are for moving until the next tabstop.

On a typewriter and in word-processing programs, you can easily change the global positions of the tab stops, but formatting code is a subtler problem, and I think, as long as we’re sticking to tabs versus spaces, elastic tabstops are the best solution.

But who says we must stick to just tabs and spaces? There are plenty of other ASCII control characters that just sit around doing nothing these days; why don’t we repurpose them? As long as we’re thinking semantically, the ASCII control characters offer some interesting possibilities in the way of code formatting and semantic markup:

  • HT (Horizontal Tabulation) would be used only for formatting of tabular data, in the same manner as elastic tabstops—but never for indentation.
  • VT (Vertical Tabulation) would serve to terminate rows in tabular data, allowing table cells to vary in height, e.g., by containing linefeeds.
  • SI (Shift In) and SO (Shift Out) could be repurposed to mark increases and decreases, respectively, in indentation. This has the benefit of not requiring indentation characters to be repeated at the start of every line, but the redundancy of doing so would improve degradation for tools that rely heavily on indentation, such as Python.
  • BS (Backspace) would appear at the beginning of a line to “dedent” code such as a line or case label in C, which is subordinate to its parent, yet not equal to its siblings.
  • FF (Form Feed) could separate logical sections of code, and thus be used like the “region” markers available in many IDEs. You still see some old source files with this convention from time to time.
  • US, RS, and GS (Unit, Record, and Group Separators) could provide further logical division of code, or semantically mark up data types and their related operations.

And that’s just in the C0 range. Consider the possibilities with some control codes from the C1 set:

  • BPH (Break Permitted Here) would indicate where lines should be wrapped before other text-wrapping methods are applied. This would eliminate the need to manually insert linefeeds to break long lines.
  • NBH (No Break Here) like its cousin above, would provide an otherwise absent means of indicating that no line break is to be inserted at the current position.
  • SSA and ESA (Start and End of Selected Area) could be used by editors to save the current selection, preserving editor state in a central location in the file.
  • HTS and VTS (Horizontal and Vertical Tabulation Set) would cause a tabstop to be set at the current position, overriding the automatic behavior of the elastic tabstops.
  • HTJ (Horizontal Tabulation with Justification) would produce a right-aligned rather than left-aligned field in a table.
  • PLD and PLU (Partial Line Down and Up) could automatically produce rich formatting for subscripts and superscripts.

Imagine just throwing a few simple control characters in your files, and being able to interact with beautifully typeset, richly formatted code, with all styling defined by local CSS so you never have to see code formatted in a way you don’t like.

Control characters have the advantage of being immune to interpretation by a lot of tools. Existing editors and compilers would need only minimal changes to accept code with semantic markup. An appropriately designed editor could even convert code on the fly for backward compatibility with unmodified compilers.

I hope you get that I’m not really serious about this—an earnest proposal for something as broad as “inline language-agnostic semantic code markup” would need to take a lot more into consideration than some old telecom codes.

I think many would agree that the idea has some merit, but few would agree on any particular implementation of it. Then again, who knows? Literate programming has its adherents, and that’s not so very different in spirit.

(Oh, and for the love of all that is holy, don’t anybody get the idea that comments containing HTML or LaTeX or whatever would be a good solution. Just try it and see how far you get.)

08 October 2011

Please Write More Crappy Productivity Articles

Every article about productivity is either crappy or excellent. If it’s excellent, I read it all the way through in order to absorb every detail about how I can get more things done—in doing so, I get nothing done. If, however, the article sucks, then I quit reading it immediately and go work on something I care about.

For me, then, good articles about productivity are usually counter to my productivity, and crappy articles are helpful. I’m so disgusted with them that I can’t help but want to work on something to counteract the time I lost by starting to read them.

Therefore, bloggers, please write more crappy productivity articles, so I can get shit done.

Regexes Parse XML Just Fine, Actually

Despite what you may believe, or may have heard from a particular famous Lovecraftian answer on Stack Overflow, you can actually use regular expressions to parse arbitrary XML. Of course, they’re not regular expressions in the strict mathematical sense, but rather in the sense with which most of us are familiar—Perl regexen.

This relies on a seemingly little-known but tremendously handy feature of Perl regular expressions: (?N) will recursively match an instance at the current position of whatever is in capture group N. This lets you recursively match delimiters, limited to a depth of 50 unless you build yourself a special Perl.

The only bit you can’t do in the regular expression engine itself is assert that the names of two matching tags are equal, because you can’t match backreferences in recursive submatches. This isn’t a problem if you assume your input is well-formed. I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML.


I wrote a quick Perl script to demonstrate this using a number of test strings, both those well formed and those less than perfect. This was in part because I feel like doing “impossible” things lately, and in part because I wanted to brush up on my Perl. The point of it is not to be a good solution, but rather the opposite: there is a certain joy in getting something to work in completely the wrong way.

Anyway, the body of it goes like this:

my $element = qr{ ( $STag ( $CharData? (?: $Reference | $CDSect | $PI
    | $Comment | (?1) )* $CharData? ) $ETag ) }sx;
                 # ^
                 # The morsel that matches $element recursively.

my @stack = '<html><head><title>Title</title></head><body><h1>Lies.</h1>'
    . '<p><i>You</em>, my friend, have been told them.</p></body></html>';

$" = "\n\n";
while (@stack) {
    my $string = shift @stack // '';
    my @groups = $string =~ m/$element/g;
    print "@groups\n\n" if @groups;
    unshift @stack, map {
        s/^<($Name)(?:$S*$Attribute)*$S?>//; my $a = $1;
        s/<\/($Name)$S*>$//;                 my $b = $1;
        $a eq $b ? $_ : undef
    } @groups;
}

This recursively enumerates all tags and their contents in some rough semblance of hierarchical order:
  • <html>…</html>
  • <head>…</head><body>…</body>
  • <head>…</head>
  • <title>…</title>
  • <body>…</body>
  • <h1>…</h1><p>…</p>
  • <h1>…</h1>
  • Lies.
  • <p>…</p>
  • <i>…</em>, my friend, have been told them.
  • <i>…</em>
  • You
  • <title>…</title>
  • Title
Here for your enjoyment is the script in its entirety, which ought to match all matched tags that are valid according to the XML specification, including weird tag names, comments, <![CDATA[...]]>, and all that other fun stuff.

I’m admittedly unsure of whether the rules involving lookahead and lookbehind actually match the spec, but it was what came to mind. The rules in question are $CData, $CharData, $PITarget, $PI, and $Comment (but they seem to work okay from what little testing I’ve done).

#!/usr/bin/perl

use warnings;
use strict;

my $S = qr{ \x20 | \x09 | \x0D | \x0A }x;

my $NameStartChar = qr{ : | [A-Z] | _ | [a-z] | [\xC0-\xD6] | [\xD8-\xF6]
    | [\xF8-\x{2FF}] | [\x{370}-\x{37D}] | [\x{37F}-\x{1FFF}]
    | [\x{200C}-\x{200D}] | [\x{2070}-\x{218F}] | [\x{2C00}-\x{2FEF}]
    | [\x{3001}-\x{D7FF}] | [\x{F900}-\x{FDCF}] | [\x{FDF0}-\x{FFFD}]
    | [\x{10000}-\x{EFFFF}] }x;

my $NameChar = qr{ $NameStartChar | - | \. | [0-9] | \xB7
    | [\x{0300}-\x{036F}] | [\x{203F}-\x{2040}] }x;

my $Char = qr{ \x09 | \x0A | \x0D | [\x20-\x{D7FF}] | [\x{E000}-\x{FFFD}]
    | [\x{10000}-\x{10FFFF}] }x;

my $Name = qr{ $NameStartChar $NameChar* }x;

my $EntityRef = qr{ & $Name ; }x;

my $CharRef = qr{ % $Name ; }x;

my $Reference = qr{ $EntityRef | $CharRef }x;

my $AttValue = qr{ " (?: [^<&"] | $Reference )* " }x;

my $Attribute = qr{ $Name $S? = $S? $AttValue }x;

my $STag = qr{ < $Name (?: $S* $Attribute )* $S? > }x;

my $ETag = qr{ </ $Name $S? > }x;

my $CDStart = qr{ <!\[CDATA\[ }x;

my $CDEnd = qr{ \]\]> }x;

my $CData = qr{ (?<! $CDEnd ) $Char* (?! $CDEnd ) }x;

my $CharData = qr{ (?<! $CDEnd ) [^<&]* (?! $CDEnd ) }x;

my $CDSect = qr{ $CDStart $CData $CDEnd }x;

my $PITarget = qr{ (?! [Xx][Mm][Ll] ) $Name }x;

my $PI = qr{ <\? $PITarget (?: $S (?: (?<! \?> ) $Char* (?! \?> )))? \?> }x;

my $Comment = qr{ <!-- (?: (?: (?! - ) $Char ) | (?: - (?! - ) $Char ) )*
    --> }x;

my $element = qr{ ( $STag ( $CharData? (?: $Reference | $CDSect | $PI
    | $Comment | (?1) )* $CharData? ) $ETag ) }sx;

my @stack = '<html><head><title>Title</title></head><body><h1>Lies.</h1>'
    . '<p><i>You</em>, my friend, have been told them.</p></body></html>';

$" = "\n\n";
while (@stack) {
    my $string = shift @stack // '';
    my @groups = $string =~ m/$element/g;
    print "@groups\n\n" if @groups;
    unshift @stack, map {
        s/^<($Name)(?:$S*$Attribute)*$S?>//; my $a = $1;
        s/<\/($Name)$S*>$//;                 my $b = $1;
        $a eq $b ? $_ : undef
    } @groups;
}

I would call this a good (okay, slow and overspecified, but still kinda neat) practical solution to a problem that’s theoretically unsolvable. That, incidentally, will be the theme of many of my upcoming posts, so stay tuned if you’re into it.

06 October 2011

Correctness is a Boolean

The most misleading thing about the way programming is taught is how programs are graded. Students are used to receiving a score that represents the sum of the correct parts of an assignment. You can think of it start as starting with a zero and earning points up to a maximum of 100, or as starting with 100 and losing points from there. As for me, I prefer to simply avoid thinking about it.

But when it comes to software, this kind of grading makes almost no sense whatsoever. Especially in the imperative languages that are generally taught to beginning students of programming, the effect of a statement depends crucially on the effects of the statements preceding it, and not just lexically, but temporally.

An error in a single line of a program is not necessarily local to that line, but more likely has effects that propagate forward in time or outward in an expression to cause further errors. By grading programs subjectively according to how many of the statements seem appropriately designed and sequenced, teachers influence students to treat programs as though their objective correctness is dependent on the mere sum of the subjective correctness of their parts.

Beginning programmers often have difficulty thinking above the level of individual statements in order to develop a high-level understanding of program structure and meaning. Professors don’t need to further their confusion by giving them a skewed representation of what makes a program right.

A program ought to be like a proof: a single flaw, and it’s not correct. “Correct” means that it does everything it ought to, and nothing it oughtn’t. That says nothing about the amount of work required to make an incorrect program correct, because that’s entirely subjective; it could be a difference as small as a single character, or as great as the entire program. The changes needed are not necessarily local to individual statements, nor even comprehensible at the statement level.

That’s why, if I ever end up teaching programming, I’ll grade my assignments on the basis of correctness and subjective style separately, and draw a clear distinction between the two. If your program is correct, you’re guaranteed a passing grade. All the points beyond that you gotta earn by writing elegantly. It’s only fair.