Out of band encoding for HTML (and other SGMLs) THIS IS A DRAFT. IT STILL HAS TYPOS IN IT. THIS IS A PROPOSED SYSTEM, NOT A STANDARD. In order to merge the worlds of plain text tools such as mailers, newsreaders and ordinary text editors with the tagged text world of HTML, SGMLs and other formatted text languages, a bridge format is required. To do this efficiently, we must present the plain text as plain text for the older text-only tools to view without burden. Formatting tools, however, would like to be able to use any formatting information available about the plain text to display it in a more attractive and more useful matter. To do this the formatting tags may be present, but must be invisible or unobtrusive to plain text tools. One way to do this is ProleText, ClariNet's method for encoding HTML-subset formatting information into hidden trailing spaces and tabs on the end of lines of ordinary text, plus limited character formatting information inside the text according to established conventions. Proletext has the advantage of being totally invisible, and simpler than full HTML. It also only increases the size of documents just 2% or so on average. Proletext documents can also be copied in ordinary text editors without needing to include any header information, and they can even be hand edited to a limited degree. However, the full power of HTML is not supported. Out of Band SGML An alternative is Out of Band SGML. In such a system, a plain text document is delivered, along with instructions on how to modify the document into HTML or any other SGML. The instructions are sent "out of band", which is to say, not usually inside the plain text. Instead, they can be sent in the headers of an RFC822/RFC1036 style E-mail or USENET message. If this is a problem, they may also be contained in a second component of a MIME multipart/related message. Finally, if no other method is available, they can be appended to the end of a plain text message. This method is considered ugly, with its only advantage that you now have a unit of text that can be copied. Process A program translating OOB SGML to an SGML follows the following procedure. First it checks the CRCs, if present, to assure that the OOB encoding and the plain text are as they have arrived correctly. If a CRC is wrong, the program should elect to translate the document to HTML as a plain text document, through the use of PRE or XMP style tags. As an option, an OOB SGML encoding may declare itself to be somewhat robust through the use of the "R" opcode (see below.) Optionally, a decoder may elect to attempt the translation, and compare a CRC of the resulting output SGML with one calculated by the source, and then elect to output that block as SGML or not based on a match of the CRC. Otherwise the block is to be output as plain text. The instructions in effect move a "cursor" through the plain text document, and opcodes specify the insertion of additional tagging. While the cursor moves over the document, the text from the plain text document is inserted literally into the SGML document. All characters are to appear literally, so that for example a "<" in the plain text must be rendered as "<" in HTML, unless the "&" toggle has been turned on, in which case there may be literal HTML in the plain text document that should be copied verbatim. The cursor may also move in "ignore" mode, where the characters from the plain text are not copied into the output document, they are simply discarded. Various opcodes control the insertion of both SGML tags and plain text into the output document. A few special opcodes combine moving the cursor and tagging. For example, the "p" opcode moves the cursor to the end of the current line and inserts a

tag, as this is a very common thing to do to add formatting to a plain text document. (There is some debate that the code should be able to insert a

tag, as this is very common but incorrect HTML style. Comment is welcome. To keep the language compact, there is some merit in having an opcode that says, "There is a paragraph break here, end the current paragraph if any and start a new one" which is what

does in HTML and why it is commonly used.) Other opcodes do common character tagging by suggesting a small string, all on one line, should be tagged in (for example) and , and do this compactly. Other opcodes allow general tagging. They allow the insertion of SGML text directly into the output, with certain escapes to avoid the use of whitespace in the OOB SGML. Whitespace in OOB SGML tagging is to generally be ignored. One of these tags ("<") asks the translator to remember the text that was provided with it in a numbered register, creating a new one if the text has not been seen before. This effectively defines macros of common tags that can be used again. In addition, the opcode ">" causes the insertion of the tag (but not attributes) from a numbered register, surrounded in SGML close tag syntax, ie. . This allows you to quickly and compactly encode the very commong SGML style of surrounding a block of text with and . Messages include a header of the form: HT-Form: VER:CRC:CRC2 opcodes With additional headers of the form: HT-Form1: more-opcodes HT-Form2: yet-more-opcodes (and so on). It is preferred that all formatting be in one header, but for systems with a limit on the length of a field, it may be needed to break the formatting up into more than one field. VER is a version number, base64 encoded, alternately VER/VER, in which case the first number is the version number of the OOB encoding style, and the second number is a version number for decoders. If a decoder's version number is less than the decoder-version, it should abort and not handle the OOB SGML. If it is equal to or above, it should handle the OOB SGML but ignore unknown opcodes. CRC is a n-character (6 bits per character, maximum 32 bits) CRC of the formatting data, to assure it has not been corrupted. This is optional but recommended. If provided it is delimited with a colon after the version tags. A 3 character (18 bit) CRC should usually suffice CRC2 is an optional n character (6 bits per character, max 32 bits) CRC of the plain text, to detect damage to the plain text, so that the addition of formatting can be aborted. Leading and trailing whitespace to the document are not included in the CRC2, nor is trailing whitespace on the end of lines, but newslines and leading whitespace at the start of lines are. There may be other CRC or hash validations of the plain text or the OOB SGML encoding, such as the Content-MD5 header from MIME. If present, these headers should also be checked, and they can in fact be used in place of the CRCs provided here. Alternately formatting can be stored at the end of an article after a special delimiter: %*%Begin SGML Enhancement Codes%*% (trailing tab) The use of tags at the end allows documents to be clipped or piped or treated independently of their headers, but of course the tags become a visible string of unknown codes at the end of the article. Alternately tags could be stored in a "multipart/related" component, the first part being text/plain, the second part having a special name. --------------------------------------------------------------------- Fields contain encoding of SGML/HTML formatting to insert into the document. Entries consist of an encoded tag specifying the amount of text to "travel" (move over and output) since the last formatting insert, a type of formatting to insert and the contents of the formatting. Numbers are encoded using MIME 64 character ascii encoding, 6 bits per character. The number of 6-bit units is encoded in the opcode. 1 2 3 4 5 6 0123456789012345678901234567890123456789012345678901234567890123 Mime Set: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ Length numbers may also be the character "!" which means a series of bytes in "encoded" form are to be taken until the end-of-encoding delimiter. THis applies only to lengths that are within the OOB stream, not the text stream. Formatting opcodes: S Travel n+1 bytes s Travel n+65 bytes E Travel to end of line (EOL is before the newline) 2ef Travel to end of article, after last newline V Travel to next printable (non-white) character. Does not move if cursor is on a printable character. Useful as a first opcode. v Travel to first column of next line that has a non white character on it. Useful as a first opcode when whitespace matters. L Travel n+1 lines, to start of line. What remains of current line is first line skipped. 2mk An indication that at some point later in the stream, the pointer will return here via the 'u' or similar backtracking opcode. A warning to the scanner that it needs to be able to seek back at least this far, or buffer from here on. Required if 'u' is to be used. 2eb An indication that there is no longer a need to seek back prior to this point, so any buffers created due to 2mk can now be released. u Move up n+1 lines, to start of line, do not output any text while moving. (For re-use of ASCII and defining shaped segments.) Requires use of 2mk l Travel n+1 lines and n2 bytes I Ignore n+1 bytes from the plain text g Ingore to EOL and go to column next line i Ignore to EOL plus n+1 lines of the plain text # Turn on ignoring until next "#" code. (ie. use "#", then travel codes, then '#') N Travel to start of next line n Travel to next line, first printable, plus n bytes H[encoded SGML]\x Insert pre-encoded bytes of SGML or text at this point. Encoding is normal, but some characters must be escaped, notably: _ space \\ \ \_ underbar \n Newline \t Tab \000 Insert octal encoded byte \x End of stream of SGML Question? Consider using space as end of HTML delimiter marker? <[encoded SGML]\x Insert bytes of SGML at this point, embedded in < and >. In addition, remember the tag name that was at the start of the SGML, and store it in register N, where N increments one for each new tag, [[encoded SGML]\x Insert bytes of SGML, embedded in < and >, but don't bother remembering in a register > Insert into stream, where tagn is the tag (but not any additional attributes after the tag) remembered from an "<" opcode, numbered n If n is 56-63, one more byte n2 follows to provide 9 bit number formed from ((n-56) << 6) + n2 for 512 possible tags. Only the tag is used, not any other attributes that came with it. whitespace ignored @ Insert " -- only the tag is used. ] Insert " if that was part of the register. h Insert into stream, with closing ">" R An 18 bit CRC of the resulting output SGML since the last "R" opcode. Used for robust encodings to CRC the desired output of each paragraph. & Toggle mapping of <, > and & (ie. text has SGML) Normally these characters in the text are to be escaped, < to < and so on when generating SGML. P Insert HTML

at this point p Travel to EOL, insert

. B Insert HTML
at this point b Travel to EOL, insert
. O insert HTML
  • at this point o Travel to beginning next line, insert
  • t Insert n+1 bytes d Insert n+1 bytes T Insert n+1 bytes m Insert n+1 bytes U Insert n+1 bytes t Insert n+1 bytes ([keyword]) Insert , n1+1 bytes of text from stream, and ie. some text of length n1+1 will be bounded in a paired set of ... The keyword can't contain an ')' . nop ! Give an error (for use in macros) /string Search for string (length n) in text, Travel to to start of it. Recommended as early opcode if documents are to move through environments where the plain text might be damaged. Search string encoded same as inserted HTML. M Insert macro argument n (use in C definitions) Bytes starting at of length from the operands of the macro are inserted. If is 0, then all remaining bytes are inserted. 2 Two letter opcodes -- for expansion 2er Recommend that a link be inserted in the HTML to allow viewing of the plain text, or some other button such as "view document source" be able to do this. 2ei Insist that a link be inserted in the HTML to view the plain text, or that this function be available. 2lr A link to a web page with help about OOB SGML and in particular OOB HTML is recommended, but not required. C Define mapping of new opcode to old opcodes (length n1), Pulling in an additional n2 bytes after the number interpreted by the old opcodes. Thus to define a new opcode you can map it to "." (nop) and specify how many bytes of operands it takes. All opcodes not in the initial spec should be defined in this way so old systems at least know how many operands they have. Bytes from the n2 bytes of extra operands can be referred to with the M opcode, inserted into the stream. If "n2" is "!" as noted above, rather than a mime64 number, bytes are taken until a delimiter. This operator is not done if the system has a definition for the new opcode. Such definitions are to be used when you use a new operator but expect older systems to not know what to do with it. This way you can tell the old system how to best handle the new opcode -- treating it like a nop, an error, a signal to ignore material or comment it out or treat it raw or whatever. Example: CQAB Define new opcode "Q", as the following A=0 byte string, ie. it's a nop. However, suck up B=1 bytes after the Q as operands. CQB!B Define a new opcode "Q" as the A=1 byte string "!" (signal an error). Suck up one byte of operand. c Define a mapping of a new opcode to an old, which applies even if the new system has a meaning for the new opcode. Masks out the new meaning of course. Doing nested tables and unusual shapes... Some opcodes make this easier. The 'u' upcode allows the pointer to be moved back in the document to earlier lines. The 'g' opcode causes the rest of the line to be ignored and the pointer to move to a column in the next line. This allows the OOB stream to scan arbitrary regions of the text, such as a left rectangle, and then back up and travel over a right rectangle. Thus to side by side table entries can be done side by side in the plain text, but linearly in the HTML. This also applies to replacements for graphics. The ignoring opcodes: Allow you to effectively have alternate SGML that entirely replaces the ASCII text. So if in the ASCII, you bolden a word with *bold*, you can insert opcodes to ignore the stars and insert html and tags. Example: HT-Form: PSPbCp (meaning: Insert

    , Travel (P=15)+1 bytes; insert bold tags around (c=3)+1 bytes; Travel to End of Line and Insert Paragraph marker.) Text: This word is in bold, how about that? Maps to:

    This word is in bold, how about that?

    Encoding of tables: Tables will be complex but doable. The table should be rendered in ASCII, possibly with the use of | and - characters to do borders. Codes should be inserted to ignore the border drawing characters and insert appropriate table tagging HTML. Generating OOB SGML opcodes: Generally this will be done by a program that takes some original SGML or HTML, and renders it to plain ASCII text. As it renders it to the text, it will know where the SGML tags used to go, what extra characters it generated that don't below in the SGML etc. and output the right opcodes to make this happen. However, a program could also simply take some SGML and the resulting plain ascii text and do a "diff" style analysis of the differences, and generate the OOB SGML opcodes needed to turn the one document into the other. The first method is better, however, though a fairly clever difference program could in fact come up with the same output. Questions: Should we have a code that means "insert

    " or not? Is this legit? We don't want to have discretionary codes or the CRC of the output trick won't work. The CRC of the output technique is good for having this work even on documents that got modified. Cleverly coded opcodes, using the / (search) opcode and the skip whitespace opcodes could survive serious modification of the plain text and still generate valid output. Is this worth pursuing? Is this too complex? I am a big stickler for extensibility, since I know nobody ever designs it right the first time. Thus the ability to define new opcodes, and the 2-part version number. With the opcodes you can even say things like, "If you don't know opcode X and you see it, barf", vs. "If you don't know opcode Y and you see it, ignore it" and "If you don't know opcode Z and you see it, treat it like opcode W." Much more flexible than HTML. But is a macro system overdoing it? It's a pretty simple one, not hard to code. Can make it optional in version 1, with the risk that V1 decoders will have trouble in the future. I sort of feel that extensibility options are the most important thing to put in version 1, more important than features sometimes. Also possible: An efficient encoding for 'stupid' OOB HTML which is no more than a decent compact encoding of a diff between HTML and the ASCII output, with the DIFF made more compact by virtue of knowing that HTML doesn't care about extra whitespace, etc. Nested tables or text wrapped around objects nees more than a DIFF however, so the stupid form would need to support the backwards moving opcodes. Extra tags for USENET: USENET needs some new tags that are not in HTML. They are structural, mostly, so browsers need not support them directly, but their addition to the language is encouraged. ... This tag, which would benefit web pages too, is to mark items that are not part of the text in question, but are general information provided on many articles. This tag would be suitable for toolbars, author taglines and of course USENET and E-mail signatures. No particular formatting is implied, but if no other formatting is provided, signatures should be set off from the main article if at the bottom. Text in SIG blocks should not normally be indexed by search engines or saved when plain text copies of an article are saved. .. This block marks text that has been included for reference from another posting. Includes a blockquote. used with it. Contains sub-regions as below: .. Types are: Author -- the E-mail address and/or name of the author of the quoted text Source -- the source of the quoted text. Usually inside the block will be a hyperlink to the URL of the text. Text not inside an INCD or BLOCKQUOTE is "comment" on the included text, and woudl typically be set off from it. A typical inclusion: In article msgid, Brad Templeton (brad@clari.net) writes:

    Text from Brad Templeton
    .. A general tag that would be useful for many things. In a newsreader, indicates that scroll down commands should pause at a page break, though a subsequent scroll down will display the next page. Used commonly in USENET postings for the "spoiler warning", so that text after a form feed is not shown without a user explicitly moving down to it, no matter how large the user's window. This next tag is optional, if you want to do full USENET functionality you would do it but frankly there are better ways... .. Indicates material in the block is encoded with the "method." If warning is on, the browsers should prompt the user with the string before revealing the decoded component.