ProleText Specification

ClariNet's "Invisible Formatting" language, known as ProleText, provides a mechanism to embed format control information in documents through the use of normally unseen characters, namely spaces and tabs at the end of lines and in apparently "blank" lines.

Spaces and tabs, so long as they don't take a line past the 80 column mark, do not affect the display of ordinary text documents viewed in a fixed-width font on most computer screens and on printers. As such, documents can be prepared that look perfectly normal and can be viewed with almost every file viewer in existence. However, the documents contain hidden information on paragraph arrangement, headers, lists and formatting, which can be used by an intelligent viewing program to display them in a more pleasant or structured way.

In addition, this specification formalizes certain visible codes which affect formatting within lines, to allow the formal specification of bold face and italics/underlining as well as the insertion of hyperlinks and graphics.

The functionality of the "language" is a subset of HTML. Documents prepared in Proletext can always be mapped well and easily to HTML. In general, they can also be mapped to most other rich text file formats, including Microsoft RTF and many formatter languages, unless they use the "raw HTML" tag which allows the insertion of unfiltered HTML into the document.

In general, the semantics of formatting in Proletext match the semantics of formatting for the corresponding HTML. The same principles of leaving many layout decisions to the formatting program apply.

Many documents can also be mapped from HTML or other languages to this format with some loss of information.

In this document, trailing spaces on lines are known as tags. The most common tags have been assigned the shortest encodings. For example, a "continuation of paragraph/block" encoding is a single trailing space.

There is one special case. The encoding for "Show this line verbatim, in a fixed-width font" is to have no trailing space. This means that even if segments of a document have their trailing space removed, they still display in a readable way.

Reference Implementation

A freely available reference implementation, which translates Proletext to HTML. It is known as inform and is available in inform.tar.Z on my FTP site.

Lines

Formatted lines in a Proletext document must be kept to no more than 60 characters, allowing up to 19 columns of tag. In general, column 80 should not be used as many displays wrap in this event. Non-formatted lines may be up to 79 columns long, and may even be up to 160 characters long if they are unconcerned about the effect of wrapping. This is particularly true if they contain links. The behaviour of processors on lines over 160 characters is undefined, though "undefined" never means that the processor will trap, abort of perform memory overwrites or illegal instructions.

Tag Encoding

In theory, over 300 codes can be defined on lines that are 60 columns long and have 19 columns for the tag. Tabs, an essential part of tag encoding, are assumed to be placed every 8 characters, ie. at column 1, 9, 17 and so on (1-origin). As such, a tab may not be used past column 72.

Almost a billion encodings are possible on blank lines (line-tags). In theory, if we wished to make the requirement that tabs be set in every column, far more encodings could be used. The use of control characters, many of which do not display in any viewer, could expand the space vastly -- so vastly that any line-based encoding feature could be put in ProleText -- but this is not the goal of this system.

Lines shorter than 60 columns could of course have many more encodings. However, the need is not great. A future version of the language might well shorten the line length to gain more encodings.

For the purposes of this specification, the tags are mapped to integers through a system encoding units (consisting of a run of spaces and possible trailing tab) into 4 bits. Each 4-bit nybble encodes the number of spaces, plus 1. 0 is used to indicate the tag is over, and that no tab is present after the spaces of the previous tag. The low-order nybble is the first sequence of spaces.

This encoding into integers is here only for the purpose of the reference implementations and the definitions in the header file invis.h. This encoding has holes, and does not encode all possible tags, and not all integers are tags. However, it encodes more than enough tags for current and foreseen purposes.

Tags are thus defined in the header file with tuples. For example, the tag (3,2,1) means 3 spaces, a tab, 2 spaces, a tab, one space.

A very few tags (mostly line-tags) have arguments, and in this case the nybble encoding is important, since any tag with the proper low order nybbles is considered an instance of the tag, and the high order nybbles provide the arguments.

Start/End line-tags

All Proletext blocks must start with HEADER line-tag. This is a a special tag that has an argument. All lines beginning with (2,2,0) (space-space-tab-space-space-tab-[optional tab and more tuples]) are header tags. They indicate the start of a Proletext block. While normally an entire document will be in Proletext and as such this line will be the first line, it is possible to have a document slip in and out of Proletext, and more notably, in and out of different versions of Proletext.

It is possible that the header line-tag may appear out of band, in message headers. In this case, the software would start processing a document assuming it was in Proletext from the start, with the appropriate version information. How to do this is beyond the scope of this spec. It was decided to include the start line-tag so that Proletext documents could exist outside of any type headers, including Mime type headers. (Sadly, while the Mime spec states that unknown Mime "text" types should be treated as plain text, some popular software does not do this.) This way Proletext documents can be kept in ordinary files, mailed or posted to news easily. The worst case is that they simply get viewed as ordinary text.

The header line-tag comes with 3 additional elements in the tuple. A full header tag is (2,2,0,docmajor,docminor,minversion).

docmajor is the major component of the Proletext version number for this document. This means currently only 15 major version numbers are possible. docminor is a minor component of this version number.

The value minversion is the level of the earliest processor that can handle this document. It is not expected that this will commonly be more than 0. For it to change, there must be such a major revision of the language so that older processors will simply be unable to handle the new documents at all. Remember that normally old processors can handle new features by just displaying them as plain text. However, should a new functionality, such as forms, be added, in a way that doesn't match the extensible operators feature, it is possible that documents in this format would declare to old processors that they should not attempt to format the document. This could also be used if the fundamentals of the language are changed.

The TRAILER line-tag indicates the end of Proletext. If this is the last line in a document, it is not needed. Lines after this should not be formatted until a another HEADER appears. Note that this could be used to have different parts of a document be formatted in different versions of Proletext.

Text Blocks & Continuation

Like most text formatting languages, text is processed in blocks or paragraphs, a series of lines that will be joined together. In Proletext 1.0, the first line in any block contains the tag for the block. Any lines to be joined to that first line (continuation lines) get a tag of CONTINUATION (1). This allows processing software to know from the first line what it is going to do with a block or paragraph.

A very small number of blocks treat the different lines of a multi-line block differently.

Truly blank lines indicate a paragraph break.

General Tags

PARA (2)

This block is an ordinary paragraph. HTML: [Block]

BREAK (0,0)

This is ordinary text (formatted as desired) with a line break after it. HTML: [Block]

MONO (No tag)

This text is to be output verbatim, in a fixed width font. It may be tabular material. In HTML, blocks of this text should be bounded with <PRE> ... </PRE> or <XMP> ... </XMP> tags.

CENTHEAD (1,0)

The block is to be centered, and is a header of some sort.

H1 (2,0)

The block is a first level header. HTML: <H1> [Block]

H2 ... H5 (3,0) ... (6,0)

The blocks are lower level headers, levels 2 through 5, as the HTML header tags.

TITLE (7,0)

The block is a document title. Same semantics as HTML <TITLE> [Block] </TITLE>

H1TITLE (8,0)

The block is both top level header and document title. However, only the first line of a multi-line block is made the document title. The entire block is a top level header.

LI (3)

The block is an element in a list, or the definition part of a term/definition pair in a definition block. HTML is either <LI> [Block] or <DD> [Block] depending on context. If the first few characters of the block, after initial whitespace are "* ", "o " or "nnn)" or "nnn." where nnn is a decimal number or one or two character alphanumeric string, these characters should be removed, or optionally interpreted, to be replaced by the list enumeration method being used.

POINT (8)

The block is the term portion of a term/definition pair in a definition block. HTML: <DT> [Block]

RAW (4)

This block contains raw HTML. Non-HTML display programs should do their best to display this block, but the behaviour of this block in non-HTML programs is undefined.

COMMENT (5)

This block should not be included in the formatted output.

LINK (6)

This block contains a hypertext link to a URL. The URL should be repeated as the selection text for the user. If the URL begins, not with a protocol, but with the string "www.", it should be prefaced with the protocol string "http://" as it is presumed to be a web server.

LINK2 (9)

This is a special form of the LINK tag. The block should be multi line. The first line is the URL. Subsequent lines are the text that should be presented to the viewer to indicate the link. If the block is single line, it is treated like a LINK.

IMAGE (7)

The block contains the URL of an image file to be inserted into the document, along with any options for that insertion. In HTML: <IMG src=[Block] >

HR (0,1)

The block -- usually single line -- represents a horizontal line or other separator in the document. Usually in HTML the contents of the block will be discarded and replaced with an <HR> HTML tag. However, processors may analyse the block to decide which type of rule is appropriate at their discretion.

NOTE (1,1)

This block should be treated as a note, to be somehow set off specially from the main text. In HTML: <NOTE> [Block] </NOTE>

Undefined Tags

Should an unknown tag be detected, the lower 3 bits of its integer encoding in the reference implementation should be examined. In general, this is the number of spaces at the start of the tag, from 0 to 7. Based on this number:

4: Treat the block as a verbatim, fixed with block, but append some notification to the lines that they are probably formatted incorrectly. A hypertext link to the URL http://proletext.clari.net/prole/lineform.html>[Bad Format]</a> is recommended.
5: Treat the block as a BREAK block.
6: Treat the block as a PARA block.
7: Treat the block as a COMMENT block.
Others (0 .. 3): Treat the block as a MONO block.

line-tags

The following line-tags affect more global behaviour. Unlike HTML, some context is kept. Some of these tags start text regions with globally different behaviour. Instead of having start and end tags for each type of text region, many of these tags cause the old formatting attributes to be pushed on a stack, so that they can be restored by an END line-tag. This is not like HTML, but like many other formatting languages. It can easily be mapped to HTML.

BLANK (0)

A truly blank line-tag simply implies a paragraph break. However, if there was a paragraph break in the previous text block, a paragraph break should not be generated. Multiple BLANK tags should cause multiple paragraph breaks after this one elision, however.

END (1)

End a text region, restore to the state before the region began.

END2 (2)

End two text regions at once.

END3 (3)

End three text regions at once.

END4 (4)

End four text regions at once.

TABLE (1,2)

This is not currently defined. However, for the future, it is planned that the text will be a table, and that fancy processors will parse the table in its ASCII, monospace form, and render it nicely. Current processors should just render the table in fixed-width.

UL (3,1)

The text region is an unordered list, with each element marked with a LI tag.

OL (3,2)

The text region is an ordered list, with each element marked with a LI tag.

DIR (3,3)

The text region is directory of items, with each element marked with a LI tag. Items should be short, and can be put in columns.

RAW (3,4)

The text region is raw, fixed-width text. HTML: <XMP> [Region] </XMP>

PRE (1,1)

The text region is pre-formatted, fixed-width text. HTML: <PRE> [Region] </PRE>

QUOTE (3,5)

The text region is a block quote, usually intended to be indented or specially represented.

CENTER (3,6)

All text in the text region should be centered, if this is appropriate

DEFL (3,7)

The text region is a definition list, consisting of "terms" (blocks with a POINT tag) and "definitions" (blocks with an LI tag).

MAGLINKS (4,1)

At this point in the document, the processor is encouraged, but not required, to put in a hypertext link to provide help on the concept of invisibly formatted documents. The URL http://proletext.clari.net/prole/help.html is available for this purpose. A link is not required. Help may also exist on menus or in some other location. This line-tag simply indicates the document authors felt help might be particularly useful on this document.

PLAINLINK (5,1)

At this point in the document, the processor is strongly encouraged to put in a hypertext link or other facility to allow the user to view the document as plain text, without any formatting processing. Document authors will use this line-tag when they are converting large numbers of documents, and fear that there may be formatting errors -- particularly the rendering of tabular data as ordinary text to be line-wrapped. As this renders the tabular data unreadable, a link or menu item to allow the viewing of a document in plain text will allow the table to be viewed. Viewers for Proletext are actually encouraged to always have a facility to turn Proletext viewing off, in the rare event that an unformatted document accidentally contains a HEADER tag.

ANCHOR (4,2)

The processor is to place a hypertext anchor at this point, with the name "a#" where "#" is the index number of the anchor. Ie. the first anchor is "a0" and the second is "a1" and so on.

TRAILER (2,3,0)

This marks the end of Proletext processing. It is not needed at the end of a document. Subsequent lines should be treated as plain text, until a Proletext HEADER line-tag.

EMPTY (2,5,0)

All tags starting with this 3-tuple are EMPTY line tags. The 3-part 2-sp-tab-5sp-tab-0sp is removed, and the remaining tag is used as an ordinary (text line) tag, applied to a blank line. This allows the creation of list elements, titles, etc. that are blank. If the plain tag does not make sense with a blank line, the actions of this tag are undefined.

Line-Tag definition

Documents that use new tags but which wish them to be handled by earlier processors and viewers may use a simple macro facility to give the earlier processors some idea about how to handle the tag.

Any line-tag whose integer representation in the reference implementation has bit 0x80000000 (hexadecimal) set is a definition for a new line-tag. If the processor already knows how to handle such an line-tag, it should ignore the definition.

The definition comes on the next line, which should otherwise be ignored. The line-tag of the next line becomes the definition for the new line-tag. If the new line-tag is encountered, the old processor should implement it with the old tag it has been mapped to.

This allows documents to define a new line-tag and say, "old processors, please just treat this as PRE" or any other existing tag. Definitions should nest, so that in theory a document could describe a chain of mappings, and the processor should use the highest line-tag in the chain that it knows.

Mapping chains may not be more than 10 levels deep.

Of course, this can result in lots of "meaningless" blank lines at the start of a document in the extreme case. It is hoped that this will not become too common.

Undefined line-tags

If a processor detects an undefined line-tag, it should examine once again the lower 3 bits of its reference implementation encoding, which is to say, the number of spaces starting the line-tag, from 0 to 7. Depending on this number:

default (0..4)

Treat as BLANK line-tag, a paragraph break.

5: Treat as an END line-tag, ending a text region, popping the stack.
6: Treat as BLANK but insert a visible warning into the text that the upcoming lines probably are formatted badly. A hypertext link to the URL http://proletext.clari.net/prole/blockform.html>[Bad Format]</a> is recommended.
Consider also inserting a link or other mechanism to allow the document to be viewed unformatted.
7: Treat as a RAW text region, to be presented unformatted. Push this state on the stack, so the next END clears it.

Inline substitutions

Text in general lines is considered plain, so if mapping to HTML, characters such as "<" and "&" should be properly treated for viewing.

However, certain special mappings are defined to insert codings within lines. These of course can't be entirely invisible.

Bold-STRONG

Bold text should be bounded with " *" to turn on bold and "* " to turn it off. The space in the latter case may be the end of line, and in particular must not be present at the end of a line to avoid interfering with tags.

The character on the other side of the star must not be a space or a star, so that " * " has no special meaning, and " **" and "** " have no special meaning.

It is recommended that the HTML tag be used instead of bold.

Bold text does not extend past the end of a plain text line. It is automatically turned off at the end of the line and must be turned on again at the start of the next line.

Italics

The semantics for italics/underlining ar the same as for bold-strong, but the character "_" is used instead of "*".

Special escapes

#*: Outputs a "*" for those who just must encode " *" without turning on bold.
#_: Outputs a "_" for those who just must encode " _" without turning on italics.
#&: Outputs a "#", the odd encoding so that runs of ####### can be used at their proper size.
#<: Outputs "<a href=" -- and if the following text starts with "www." also includes the string "http://" after the "href=" portion. In non HTML processors, this is the sign that a URL is beginning.
#> or #}: Outputs a raw > into the HTML stream. Non-HTML processors should mark this as the end of a URL or Image inclusion, whichever is currently pending.
#:: Outputs a raw </A> into the HTML stream. Non-HTML processors should consider this the end of the text contents of a link. Thus a block will include text like #< www.clari.net #> ClariNet home page #: and the processor is expected to duplicate the HTML semantics of: <A href=http://www.clari.net> ClariNet home page </A>
#{: Outputs <img src= into the raw HTML text stream, indicating an image URL to be inserted into the text, along with options, to be terminated by a # escape sequence.

URL escapes

A clear URL in the text not being handled in another way should be mapped to <a href="URL">URL<a>, the way that most web browsers handle such URLs in plain text articles they display. A URL must begin with a protocol, internal document URLS will not be detected this way. The character before the URL affects how it should be parsed. If it is a single or double quote, the URL should be parsed to the closing quote or end of line (no URL may take more than one line.) If the character before the URL is an open bracket/paren/anglebracket/brace, then the URL terminates on the appropriate closing character or whitespace. Whitespace terminates any URL not enclosed in quotes.

Better inline syntax

It is admitted that a better syntax for inline processing is desired. At present, it was not desired to require processors to have a sophisticated parser that could detect complex but pretty looking multi-line inline syntaxes for links, images and other special attributes and tags. Future versions of the system may support this, as well as table processing.

You can also look at a more detailed Proletext example, with 4 views of a Proletext formatted document.