ProleText Information

Why?

A way is needed to bring together the plain-text world used in most E-mail, USENET news and many ordinary files with the rich-text world of HTML, word processor formats and other structured document forms. Our "Invisible Formatting" (alternately called invisible HTML or Proletext -- not very rich text) allows documents to contain formatting structure but still look like ordinary text files, viewable with any document viewer in the world. Existing types like RTF, Mime Richtext and even HTML look ugly as ordinary text.

ClariNet designed this format to allow us to publish our news stories which can be read with all ordinary tools (particularly USENET newsreaders) but which can be displayed nicely formatted after conversion to HTML either in a special server, with a CGI, or directly in a web browser.

It is worth noting that while Proletext worked well for delivering ClariNet's news, I have also proposed a far richer and more powerful system called Out of Band HTML which would make more sense as a netwide standard for an easy transition from plain to richer forms of text.

How?

Invisible formatting information is embedded in trailing spaces and tabs on the ends of lines in an ordinary looking document. In addition, "blank" lines contain spaces and tabs with hidden formatting meanings. Assuming typical 60 column lines, one can have over 300 different codings on the end of a line without going past 80 columns. (Far fewer are needed.) On a blank line, almost a billion codings are possible.

Documents with invisible formatting always start with a magic line, which begins with "<SP><SP><TAB><SP><SP><TAB>" followed by version encoding. Thus documents can be spotted and formatted even without a Mime Content-Type header for this new text type. This otherwise useless combination of spaces and tabs on a blank line should virtually assure that random documents are not treated as formatted.

The invisible formatting is thus entirely invisible, independent of special RFC822 headers, and can even be cut and pasted in an ordinary editor, with some care. In the event that the document goes through a process that strips trailing spaces, it still displays acceptably, since the absence of trailing spaces is the code that a line should be displayed in monospace, with no line folding -- ie. plain text. In addition, a process that maps tabs to spaces or removes them will cause the master header line to be invalid, stopping formatting. The system will not survive something like a space being added to all lines or other more random changes, however.

In addition, certain character sequences are defined to support intra-line formatting. These are based on common conventions, but are only to be interpreted inside an invisibly formatted document, so random use in other documents will not cause problems. Invisibly formatted documents must use these formally. For example, bounding text in " *" and "* " makes it bold, " _" and "_ " makes it italic.

If you take a "$" to be the invisible end of a line, a typical paragraph  $ 
would look like this one, with two spaces at the end of the first $ 
line and a single space after the continuation lines $

You can also look at a more detailed Proletext example, with 4 views of a Proletext formatted document.

Features

The basic features of HTML are supported -- paragraphs, headers, monospace lines, formatted lines with line breaks, bold, italics, lists and definition tables, horizontal rules, and simple embedded images and hyperlinks.

However, the language is not designed to be a one to one mapping with HTML. HTML is more powerful, and its tag style is not fully amenable to a system where tags may only be marked at the ends of lines and between paragraphs. However, all features of the invisible formatting language will have a direct HTML counterpart, so that translation from Proletext to HTML is always easy to do.

As no-trailing-space maps to ordinary monospace presentation, the worst that can happen to a document is that it displays in the basic text form. The language is designed to be extensible in a number of ways, but it is inherently extensible in the sense that any unknown formatting codes found by an old browser or translator can always cause the text to be displayed as plain text -- this is generally always correct behaviour.

Usage

On Feb 1, 1995 Clarinet switched to version 1 of ProleText. We had until that time been using another, incompatible version known as version 0 that was never officially released. It is not compatible with version 0. Netcom uses Proletext in their specialized CGIs to display ClariNet articles for HTML browsers. It is possible that we will switch, and renumber version 1 to be version 0, the first official release.

For examples of documents with this formatting examine stories in the newsgroup biz.clarinet.sample. We can provide you with the HTML convertor.

In the Browser

Ideally, Proletext should be supported directly in the browser. Code to detect the Proletext header and map the text to HTML for use inside the browser is easy to use. We have code available to do this on request. If this is done, ordinary text files and USENET news postings which contain invisible formatting would pop up directly as nicely formatted documents.

We hope that browser developers will work with us to help define this format and support it directly in their browsers. We think users will be very pleased to see browsers that make formatted text appear "like magic" where other viewers show only uglier plain text. We have no interest in financial gain from the format, and will make our own code to process the format available free.

Due to the slight chance of errors in formatting or the accidental appearance of space-tab sequences in ASCII artwork, documents with invisible formatting should cause menu items or buttons to appear asking the browser to display the document as plain-text, and offering a help URL.

Mapping between formats

An HTML to Proletext translator is needed. This would allow people to take HTML documents (or documents in a format that can be turned into HTML) and quickly turn them into Proletext, so they can leave them on servers as plain text files. ClariNet has not written this translator as it is not necessary for our own publishing goals, but we, or another party might write one quickly. Adapting an existing HTML handling tool to do this should be quite simple.

Limitations

Clearly the full depth of HTML can't be handled in such a system, most notably low-level tags, including complex links and image inclusions. Some things just can't be made invisible, such as URLs and filenames. A tag that specifies, "This line is raw HTML" in theory supports everything, but of course means lines that will look ugly to those viewing the file as ordinary text.

What's missing from current tools?

Our translator does not yet support links and images the way we have planned. We would like to discuss the best methods for this before implementation, since we don't currently use this. We imagine a suitable link syntax might well be something like

		*HLink:URLOptions (Description)

mapping to:

		<A href=http://URL [Options xlation]>Description</A>

We would also like to consider table support. To support tables, you would define blank-line "start table" codes which would encode the table options and the number of elements, width ratios and so on. In particular you would encode the table delimiter characters. Then the table would be presented in ASCII, with line-drawing characters like "-" and "|" used to delimit rows and columns, as well as physical column numbers in the ASCII lines. It may be necessary to make the first line of the table have visible data, such as column tag marks, to make this work well. The translator would then parse the ACSII table and turn it into an HTML table.

Support for HTML forms is also possible, though this may be going too far. In theory, any feature of HTML can be supported by simply defining a nice looking visible syntax for the feature in question, and a means to parse the visible and invisible data.

It is not planned to go to the limit doing all features. Beyond a certain point, the document should be considered a rich document, meant for presentation in HTML or another suitable format, viewable only by special browsers.

Specifics

The language has the first line of any text block tagged with a trailing code. Continuation lines of the block are tagged by a single trailing space. Ordinary paragraphs have their first (control) line tagged with two trailing spaces. Thus most typical documents are bulked out very little. As noted, tabular lines have no trailing space. Any line with the HR tag (<TAB><SP>) is mapped to an HTML <HR> -- the line is ignored, but will probably be a line of dashes etc.

Facilities are present to define comment lines which don't go into the HTML, headers, centered lines etc.

The system supports mappings for future opcodes, so that old translators will be able to guess appropriate behaviour, or be told what it is. This may or may not be used greatly.

Code to translate is already available, a formal spec will be forthcoming.

The Specification

You may wish to read the more formal specification of the formatting language.