Out of band encoding for HTML (and other SGMLs)

In order to merge the worlds of plain text tools such as mailers, newsreaders and ordinary text editors with the tagged, structured text world of HTML, SGMLs and other formatted text languages, a bridge format is required.

To do this efficiently, we must present the plain text as plain text for the older text-only tools to view without burden. Structured text tools, however, would like to be able to use any structure and formatting information available about the plain text to display it and manipulate it in a more attractive and more useful manner.

To do this the structural tags may be present, but must be invisible or unobtrusive to plain text tools.

One way to do this is ProleText, my method for encoding HTML-subset formatting information into hidden trailing spaces and tabs on the end of lines of ordinary text, plus limited character formatting information inside the text according to established conventions.

Proletext has the advantage of being totally invisible, and simpler than full HTML. It also only increases the size of documents just 2% or so on average. Proletext documents can also be copied in ordinary text editors without needing to include any header information, and they can even be hand edited to a limited degree. However, the full power of HTML is not supported.

Out of Band SGML

An alternative is Out of Band SGML. In such a system, a plain text document is delivered, along with instructions on how to modify the document into HTML or any other SGML. The instructions are sent "out of band", which is to say, not usually inside the plain text.

This is akin to how colour television was developed. Existing sets were black and white, so a way was devised to transmit the original B&W picture in one band, and the colour information in another band. The old sets could still show the B&W, but colour sets displayed in living colour.

These out of band markup differences can be sent in one of the following possible ways:

The headers of an RFC822/RFC1036 style E-mail or USENET message.
A non-text component of a MIME multipart/related or other multipart type message.
If no other method is available, they can be appended to the end of a plain text message. This method is considered ugly, with its only advantage that you now have a unit of text that can be copied.

The "instructions" are in effect commands on how to add markup to the text document and otherwise modify it to turn it into a document in the SGML. They involve opcodes to move a "cursor" around the plain ("source") document, to copy sections of it into the "object" document, insert markup into the object document, and also to skip over items.

Expressed in English, the instructions might read like "Go down 3 lines, add paragraph tag at end of line. Go down 1 line, over 20 characters, insert "strong" tag. And so on.

Advantages

There are a number of interesting advantages to this method, on top of its principle advantage -- that it displays fine in plain text tools. For one, it is actually quite compact. Because it is not intended to be human generated or read, it can encode tags more compactly, and make use of a stack to efficiently express the opening and closing of paired tags. Simple documents -- the vast bulk of E-mails and USENET postings -- can be encoded at a cost of around 3%.

It is actually also a more powerful language than the underlying SGML, because it provides an automatic extension mechanism for new tags. If software encounters tags it does not understand (say, for example, older software that might not understand -- or want to understand -- the HTML "TABLE" tag) then it is always provided with a moderately valid monospace text representation of what the material inside the tags should look like. Thus even a simple parser can translate what it knows, but leave the rest as plain text.

And of course, it's always valid, unless the document is so rich as to be unrepresentable as plain text, to simply present any document in plain form if the markup can't be understood. With plain HTML, tools have no way to easily present a document with complex markup they don't understand.

Most of all it provides a single format that can be sent to any user on the net with the assurance they can read it. This is highly important for E-mail and electronic discussions such as USENET.

Process

A program translating OOB SGML (in this draft form) to an SGML follows the following procedure.

First it checks the CRCs, if present, to assure that the OOB encoding and the plain text are as they have arrived correctly. If a CRC is wrong, the program should elect to translate the document to HTML as a plain text document, through the use of PRE or XMP style tags. If other hashes are present from another system (such as Content-MD5) this test may not be needed.

As an option, an OOB SGML encoding may declare itself to be somewhat robust through the use of the "R" opcode (see below.) Optionally, a decoder may elect to attempt the translation, and compare a CRC of the resulting output SGML with one calculated by the source, and then elect to output that block as SGML or not based on a match of the CRC. Otherwise the block is to be output as plain text.

The instructions in effect move a "cursor" through the plain text document, and opcodes specify the insertion of additional tagging. While the cursor moves over the document, the text from the plain text document is usually inserted literally into the SGML (object) document. All characters are to appear literally, so that for example a "<" in the plain text must be rendered as "<" in HTML. (Unless the "&" toggle has been turned on, in which case there may be literal markup in the plain text document that should be copied verbatim.)

The cursor may also move in "ignore" mode, where the characters from the plain text are not copied into the output document, they are simply discarded.

Various opcodes control the insertion of both SGML tags and plain text into the output document. A few special opcodes combine moving the cursor and tagging. For example, the "p" opcode moves the cursor to the end of the current line and inserts a tag, as this is a very common thing to do to add formatting to a plain text document.

(There is some debate that the code should be able to insert a tag, as this is very common but incorrect HTML style. Comment is welcome. To keep the language compact, there is some merit in having an opcode that says, "There is a paragraph break here, end the current paragraph if any and start a new one" which is what does in HTML and why it is commonly used.)

Other opcodes do common character tagging by suggesting a small string, all on one line, should be tagged in (for example) and , and do this compactly.

Other opcodes allow general tagging. They allow the insertion of SGML text directly into the output, with certain escapes to avoid the use of whitespace in the OOB SGML. Whitespace in OOB SGML tagging is to generally be ignored.

One of these tags ("<") asks the translator to remember the text that was provided with it in a numbered register, creating a new one if the text has not been seen before. This effectively defines macros of common tags that can be used again. In addition, the opcode ">" causes the insertion of the tag (but not attributes) from a numbered register, surrounded in SGML close tag syntax, ie. </TAG>. This allows you to quickly and compactly encode the very commong SGML style of surrounding a block of text with <TAG attributes> and </TAG>.

Messages include a header of the form:

Content-type: text/oobhtml; oobver=VER; oobcrc=CRC[:CRC2]; markup=opcodes

It is preferred that all formatting be in one header, but for systems with a limit on the length of a field, it may be needed to break the formatting up into more than one field. Most USENET and E-mail systems can support headers of over 1024 characters made up of lines of 80 columns delimited with folding whitespace. To do this, there are two possible options: Define a new header (contenet-markup) that can contain markup, and/or define a multipart/related which has the plain text in one part and unlimited markup encoding in the other part.

VER is a version number, base64 encoded, alternately VER/VER, in which case the first number is the version number of the OOB encoding style, and the second number is a version number for decoders. If a decoder's version number is less than the decoder-version, it should abort and not handle the OOB SGML. If it is equal to or above, it should handle the OOB SGML but ignore unknown opcodes.

CRC is a n-character (6 bits per character, maximum 32 bits) CRC of the formatting data, to assure it has not been corrupted. This is optional but recommended. If provided it is delimited with a colon after the version tags. A 3 character (18 bit) CRC should usually suffice

CRC2 is an optional n character (6 bits per character, max 32 bits) CRC of the plain text, to detect damage to the plain text, so that the addition of formatting can be aborted. Leading and trailing whitespace to the document are not included in the CRC2, nor is trailing whitespace on the end of lines, but newslines and leading whitespace at the start of lines are.

There may be other CRC or hash validations of the plain text or the OOB SGML encoding, such as the Content-MD5 header from MIME. If present, these headers should also be checked, and in fact the presence of CRC2 is discouarged in such events.

It's also possible that formatting could be stored at the end of an article after a special delimiter:

	%*%Begin SGML Enhancement Codes%*%	(trailing tab)

The use of tags at the end allows documents to be clipped or piped or treated independently of their headers, but of course the tags become a visible string of unknown codes at the end of the article.

The Stream

The encoding stream consist of a series of fields containing encoding of SGML/HTML formatting to insert into the document.

Entries consist of an encoded tag specifying the amount of text to "travel" (move over and output) since the last formatting insert, a type of formatting to insert and the contents of the formatting.

Numbers are encoded using MIME 64 character ascii encoding, 6 bits per character. The number of 6-bit units is implicitly encoded in the opcode.

		    1         2         3         4         5         6
	  0123456789012345678901234567890123456789012345678901234567890123
Mime Set: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

Several opcodes take an argument which is a length. Normally such lengths are one byte, and the value "N" is taken by reading one character, and mapping it to a number from 0 to 63 using the above MIME encoding. If the length character is an exclamation point (!) then bytes should be read up to a second "!" and the resulting bytes represent a larger number in the MIME encoding -- 6 bits per character, low order bits in the first characters (little-endian). For example, "N" means a value of 12, while "!HB!" means a value of 71. (B=1*64 + H=7)

The Opcodes

Below, when an opcode is shown to take an operand of the form <N>, this is a length, normally a single byte as described above. If zero is not a meaningful operand, 1 is often implicitly to be added to the value as indicated below.

To "Travel" is to move the cursor, copying text into the output as it moves. Travel down "lines" includes travelling to the end of the current line first.

S<N>: Travel <N>+1 bytes
s<N>: Travel <N>+65 bytes
E: Travel to end of line (EOL is before the newline)
2ef: Travel to end of article, after last newline
V: Travel to next printable (non-white) character. Does not move if cursor is on a printable character. Useful as a first opcode.
v: Travel to first column of next line that has a non white character on it. Useful as a first opcode when whitespace matters.
L<N>: Travel <N>+1 lines, to start of line. What remains of current line is first line skipped.
2mk: An indication that at some point later in the stream, the pointer will return here via the 'u' or similar backtracking opcode. A warning to the scanner that it needs to be able to seek back at least this far, or buffer from here on. Required if 'u' is to be used.
2eb: An indication that there is no longer a need to seek back prior to this point, so any buffers created due to 2mk can now be released.
u<N>: Move up <N>+1 lines, to start of line, do not output any text while moving. (For re-use of ASCII and defining shaped segments.) Requires use of 2mk
l<N><N2>: Travel <N>+1 lines and n2 bytes
I<N>: Ignore <N>+1 bytes from the plain text
g<N>: Ingore to EOL and go to column <N> next line
i<N>: Ignore to EOL plus <N>+1 lines of the plain text
#: Turn on ignoring until next "#" code. (ie. use "#", then travel codes, then '#')
N: Travel to start of next line
n<N>: Travel to next line, first printable, plus n bytes
whitespace: ignored
R<N><N><N>: An 18 bit CRC of the resulting output SGML since the last "R" opcode. Used for robust encodings to CRC the desired output of each paragraph.
&: Toggle mapping of <, > and & (ie. text has SGML) Normally these characters in the text are to be escaped, < to < and so on when generating SGML.
P: Insert HTML "" at this point
p: Travel to EOL, insert "".
B: Insert HTML " " at this point
b: Travel to EOL, insert " ".
O: Insert HTML "<LI>" at this point
o: Iravel to beginning next line, insert "<LI>"
t<N>: Insert "", if N>0 travel <N> bytes, insert ""
d<N>: Insert ", if N>0 travel "<N> bytes, insert ""
T<N>: Insert "", if N>0 travel <N> bytes, insert ""
m<N>: Insert "", if N>0 travel <N> bytes, insert ""
U<N>: Insert "", if N>0 travel <N> bytes, insert ""
t<N>: Insert "<TT>", if N>0 travel <N> bytes, insert "</TT>"
q<N>: Insert "<Q>", if N>0 travel <N> bytes, insert "</Q>"
(<N>[keyword]): Insert <keyword>, <N>+1 bytes of text from stream, and </keyword> ie. some text of length <N>+1 will be bounded in a paired set of <KEYWORD>...</KEYWORD> The keyword can't contain an ')'
.: nop
!: Give an error (for use in macros)
/: When followed by another opcode which generates an opening SGML tag, causes the generation of the corresponding closing tag. Any byte count that might normally be included with the following opcode will not be present.
f<N>string: Search for string (length n) in text, Travel to to start of it. Recommended as early opcode if documents are to move through environments where the plain text might be damaged. Search string encoded same as inserted HTML.
M<N1><N2>: Insert macro argument n (use in C definitions) Bytes starting at <N1> of length <N2> from the operands of the macro are inserted. If <N2> is 0, then all remaining bytes are inserted.
2<letter><letter}: General form of most 2 (actually 3) character opcodes -- for expansion
2er: Recommend that a link be inserted in the HTML to allow viewing of the plain text, or some other button such as "view document source" be able to do this.
2ei: Insist that a link be inserted in the HTML to view the plain text, if such a function is not otherwise available.
2lr: A link to a web page with help about OOB SGML and in particular OOB HTML is recommended, but not required.
2sg: Special "signature" opcode. Travel through document to find first instance of "<newline>-- <newline>". Do not output the "-- ", instead insert the personal signature "<SIG>" tag.
2sc: Signature close. The same as "2ef/2sg" -- travel to end of document, insert "</SIG>" tag.

Raw Insertions

These opcodes allow the insertion of any sort of tag or stretch of raw SGML. They define a series of numbered "registers" to store the main tag names to allow for the compact closing of paired tags.

The encoded text is mostly literal, however certain characters must be mapped with escape sequences.

_	space
\	\
\_	underbar
\n	Newline
\t	Tab
\000	Insert octal encoded byte
\x	End of stream of SGML

H[encoded SGML]\x

Insert pre-encoded bytes of SGML or text at this point. Do not remember it in a register.

<[encoded SGML]\x

Insert bytes of SGML at this point, embedded in "<" and ">". In addition, remember the text, both first tag name that was at the start of the SGML, and any other arguments, and store them in register <N>, where N starts at zero and increments one for each new use of this opcode.

Some references to the register will extract only the tag, most notably the ">" opcode which is used to close tags. Some will extract the tag and its arguments, for use in duplicating the same tag again.

[[encoded SGML]\x

Insert bytes of SGML, embedded in < and >, but don't bother remembering in a register

><N>

Insert "</" and the tag from register <N> into stream, where the tag does not include any additional attributes saved with the tag.

@<N>

Insert "<" then the tag from register <N> into stream, coded as above, but without slash or closing ">" -- only the tag is used.

]<N>

Insert "<" then register <N> into stream, both the tag and any additional text that were included with it when the value was stored, including a closing ">" if that was part of the register.

h<N>

Insert "<", then the tag from register <N>, then ">".

Mapping Opcodes

C<newopcode><N1><oldopcodes><N2>

Define mapping of new opcode to old opcodes (length n1), Pulling in an additional <N2> bytes after the number interpreted by the old opcodes. Thus to define a new opcode you can map it to "." (nop) and specify how many bytes of operands it takes. All opcodes not in the initial spec should be defined in this way so old systems at least know how many operands they have. Bytes from the <N2> bytes of extra operands can be referred to with the M opcode, inserted into the stream.

This operator is not done if the system has a definition for the new opcode. Such definitions are to be used when you use a new operator but expect older systems to not know what to do with it. This way you can tell the old system how to best handle the new opcode -- treating it like a nop, an error, a signal to ignore material or comment it out or treat it raw or whatever.

Example:

CQAB

Define new opcode "Q", as the following A=0 byte string, ie. it's a nop. However, suck up B=1 bytes after the Q as operands.

CQB!B

Define a new opcode "Q" as the A=1 byte string "!" (signal an error). Suck up one byte of operand.

c<newopcode><N1><oldopcodes><N2>

Define a mapping of a new opcode to an old, which applies even if the new system has a meaning for the new opcode. Masks out the new meaning of course.

Notes

Doing nested tables and unusual shapes...

Some opcodes make this easier. The 'u' upcode allows the pointer to be moved back in the document to earlier lines. The 'g' opcode causes the rest of the line to be ignored and the pointer to move to a column in the next line.

This allows the OOB stream to scan arbitrary regions of the text, such as a left rectangle, and then back up and travel over a right rectangle. Thus to side by side table entries can be done side by side in the plain text, but linearly in the HTML.

This also applies to replacements for graphics.

The ignoring opcodes:

Allow you to effectively have alternate SGML that entirely replaces the ASCII text. So if in the ASCII, you bolden a word with *bold*, you can insert opcodes to ignore the stars and insert html and tags.

Example

HT-Form: PSPbCp

(meaning: Insert , Travel (P=15)+1 bytes; insert bold tags around (c=3)+1 bytes; Travel to End of Line and Insert Paragraph marker.)

Text:

This word is in bold, how about that?

Maps to:

<P>This word is in <B>bold</B>, how about that?</P>

Which is to say:

This word is in bold, how about that?

Tables

Tables will be complex but doable. The table should be rendered in ASCII, possibly with the use of | and - characters to do borders. Codes should be inserted to ignore the border drawing characters and insert appropriate table tagging HTML.

Generating OOB SGML opcodes

Generally this will be done by a program that takes some original SGML or HTML, and renders it to plain ASCII text. As it renders it to the text, it will know where the SGML tags used to go, what extra characters it generated that don't below in the SGML etc. and output the right opcodes to make this happen.

However, a program could also simply take some SGML and the resulting plain ascii text and do a "diff" style analysis of the differences, and generate the OOB SGML opcodes needed to turn the one document into the other. The first method is better, however, though a fairly clever difference program could in fact come up with the same output.

Questions

Should we have a code that means "insert " or not? Is this legit? We don't want to have discretionary codes or the CRC of the output trick won't work. The CRC of the output technique is good for having this work even on documents that got modified.

Cleverly coded opcodes, using the / (search) opcode and the skip whitespace opcodes could survive serious modification of the plain text and still generate valid output. Is this worth pursuing?

Is this too complex? I am a big stickler for extensibility, since I know nobody ever designs it right the first time. Thus the ability to define new opcodes, and the 2-part version number.

With the opcodes you can even say things like, "If you don't know opcode X and you see it, barf", vs. "If you don't know opcode Y and you see it, ignore it" and "If you don't know opcode Z and you see it, treat it like opcode W." Much more flexible than HTML.

But is a macro system overdoing it? It's a pretty simple one, not hard to code. Can make it optional in version 1, with the risk that V1 decoders will have trouble in the future. I sort of feel that extensibility options are the most important thing to put in version 1, more important than features sometimes.

Also possible: An efficient encoding for 'stupid' OOB HTML which is no more than a decent compact encoding of a diff between HTML and the ASCII output, with the DIFF made more compact by virtue of knowing that HTML doesn't care about extra whitespace, etc. Nested tables or text wrapped around objects nees more than a DIFF however, so the stupid form would need to support the backwards moving opcodes.

Extra tags for USENET:

USENET needs some new tags that are not in HTML. They are structural, mostly, so browsers need not support them directly, but their addition to the language is encouraged.

It may also make sense to define these in terms of an XML.

<SIG> ... </SIG>

This tag, which would benefit web pages too, is to mark items that are not part of the text in question, but are general information provided on many articles. This tag would be suitable for toolbars, author taglines and of course USENET and E-mail signatures.

No particular formatting is implied, but if no other formatting is provided, signatures should be set off from the main article if at the bottom.

Text in SIG blocks should not normally be indexed by search engines or saved when plain text copies of an article are saved. It is possible that <SIG> should be considered an 3rd part of a document after <HEAD> and <BODY> but it has value inside the body as well.

<INC> .. </INC>

This block marks text that has been included for reference from another posting. Includes a blockquote. used with it. Contains sub-regions as below:

<INCD type=xxx> .. </INCD>

Types are:

Author -- the E-mail address and/or name of the author of the quoted text
Source -- the source of the quoted text. Usually inside the block will be a hyperlink to the URL of the text.

Text not inside an INCD or BLOCKQUOTE is "comment" on the included text, and would typically be set off from it. Text inside an <INC> should be given lower priority by search engines indexing documents, in particular if a URL is found inside an <incd type=source> block.

A typical inclusion

<INC>	In article <incd type=Source><a href=news:msgid>msgid</a></incd>,
        <incd type=Author>Brad Templeton
	(<a href=mailto:brad@clari.net>brad@clari.net</a>)</incd> writes:
<BLOCKQUOTE>
	Text from Brad Templeton
</BLOCKQUOTE>
</INC>

<page> .. </page>

A general tag that would be useful for many things. In a newsreader, indicates that scroll down commands should pause at a page break, though a subsequent scroll down will display the next page.

Used commonly in USENET postings for the "spoiler warning", so that text after a form feed is not shown without a user explicitly moving down to it, no matter how large the user's window.