15 October 2011

Tabs vs. Spaces? I’ll Do You One Better

Now, I don’t ordinarily like to get involved in holy wars, because arguing about programming serves largely to waste time that could be spent actually programming. But I’m making an exception in this case in order to make, shall we say, a modest proposal.

While doing research for an article, I rediscovered elastic tabstops, which are a fairly sane way of managing indentation and tabulation, especially if you like to use proportional-width fonts for programming. Unfortunately, it’s not as widely implemented as it ought to be, and the reference implementation (a gedit plugin) leaves something to be desired in terms of efficiency.

The search for an Emacs implementation (which, sadly, proved fruitless) led me to all kinds of juicy “discussions” concerning the merits of tabs versus spaces. There are many arguments for both, but summarising and responding to them are beyond the scope of this article.

The peculiar thing to me was that people kept stressing the semantic nature of the tab character versus the space character. In typewriter terms, spaces are for moving the cursor a fixed width, while tabs are for moving until the next tabstop.

On a typewriter and in word-processing programs, you can easily change the global positions of the tab stops, but formatting code is a subtler problem, and I think, as long as we’re sticking to tabs versus spaces, elastic tabstops are the best solution.

But who says we must stick to just tabs and spaces? There are plenty of other ASCII control characters that just sit around doing nothing these days; why don’t we repurpose them? As long as we’re thinking semantically, the ASCII control characters offer some interesting possibilities in the way of code formatting and semantic markup:

  • HT (Horizontal Tabulation) would be used only for formatting of tabular data, in the same manner as elastic tabstops—but never for indentation.
  • VT (Vertical Tabulation) would serve to terminate rows in tabular data, allowing table cells to vary in height, e.g., by containing linefeeds.
  • SI (Shift In) and SO (Shift Out) could be repurposed to mark increases and decreases, respectively, in indentation. This has the benefit of not requiring indentation characters to be repeated at the start of every line, but the redundancy of doing so would improve degradation for tools that rely heavily on indentation, such as Python.
  • BS (Backspace) would appear at the beginning of a line to “dedent” code such as a line or case label in C, which is subordinate to its parent, yet not equal to its siblings.
  • FF (Form Feed) could separate logical sections of code, and thus be used like the “region” markers available in many IDEs. You still see some old source files with this convention from time to time.
  • US, RS, and GS (Unit, Record, and Group Separators) could provide further logical division of code, or semantically mark up data types and their related operations.

And that’s just in the C0 range. Consider the possibilities with some control codes from the C1 set:

  • BPH (Break Permitted Here) would indicate where lines should be wrapped before other text-wrapping methods are applied. This would eliminate the need to manually insert linefeeds to break long lines.
  • NBH (No Break Here) like its cousin above, would provide an otherwise absent means of indicating that no line break is to be inserted at the current position.
  • SSA and ESA (Start and End of Selected Area) could be used by editors to save the current selection, preserving editor state in a central location in the file.
  • HTS and VTS (Horizontal and Vertical Tabulation Set) would cause a tabstop to be set at the current position, overriding the automatic behavior of the elastic tabstops.
  • HTJ (Horizontal Tabulation with Justification) would produce a right-aligned rather than left-aligned field in a table.
  • PLD and PLU (Partial Line Down and Up) could automatically produce rich formatting for subscripts and superscripts.

Imagine just throwing a few simple control characters in your files, and being able to interact with beautifully typeset, richly formatted code, with all styling defined by local CSS so you never have to see code formatted in a way you don’t like.

Control characters have the advantage of being immune to interpretation by a lot of tools. Existing editors and compilers would need only minimal changes to accept code with semantic markup. An appropriately designed editor could even convert code on the fly for backward compatibility with unmodified compilers.

I hope you get that I’m not really serious about this—an earnest proposal for something as broad as “inline language-agnostic semantic code markup” would need to take a lot more into consideration than some old telecom codes.

I think many would agree that the idea has some merit, but few would agree on any particular implementation of it. Then again, who knows? Literate programming has its adherents, and that’s not so very different in spirit.

(Oh, and for the love of all that is holy, don’t anybody get the idea that comments containing HTML or LaTeX or whatever would be a good solution. Just try it and see how far you get.)