Out of band encoding for HTML (and other SGMLs)
THIS IS A DRAFT. IT STILL HAS TYPOS IN IT. THIS IS A PROPOSED SYSTEM,
NOT A STANDARD.
In order to merge the worlds of plain text tools such as mailers, newsreaders
and ordinary text editors with the tagged text world of HTML, SGMLs and other
formatted text languages, a bridge format is required.
To do this efficiently, we must present the plain text as plain text for
the older text-only tools to view without burden. Formatting tools, however,
would like to be able to use any formatting information available about
the plain text to display it in a more attractive and more useful matter.
To do this the formatting tags may be present, but must be invisible or
unobtrusive to plain text tools.
One way to do this is ProleText, ClariNet's
method for encoding HTML-subset formatting information into hidden trailing
spaces and tabs on the end of lines of ordinary text, plus limited
character formatting information inside the text according to established
conventions.
Proletext has the advantage of being totally invisible, and simpler than
full HTML. It also only increases the size of documents just 2% or so on
average. Proletext documents can also be copied in ordinary text editors
without needing to include any header information, and they can even be
hand edited to a limited degree. However, the full power of HTML is not
supported.
Out of Band SGML
An alternative is Out of Band SGML. In such a system, a plain text
document is delivered, along with instructions on how to modify the
document into HTML or any other SGML. The instructions are sent "out of
band", which is to say, not usually inside the plain text. Instead, they
can be sent in the headers of an RFC822/RFC1036 style E-mail or USENET
message. If this is a problem, they may also be contained in a second
component of a MIME multipart/related message. Finally, if no other method
is available, they can be appended to the end of a plain text message. This
method is considered ugly, with its only advantage that you now have a
unit of text that can be copied.
Process
A program translating OOB SGML to an SGML follows the following procedure.
First it checks the CRCs, if present, to assure that the OOB encoding and
the plain text are as they have arrived correctly. If a CRC is wrong, the
program should elect to translate the document to HTML as a plain text document,
through the use of PRE or XMP style tags.
As an option, an OOB SGML encoding may declare itself to be somewhat robust
through the use of the "R" opcode (see below.) Optionally, a decoder may
elect to attempt the translation, and compare a CRC of the resulting output
SGML with one calculated by the source, and then elect to output that block
as SGML or not based on a match of the CRC. Otherwise the block is to be
output as plain text.
The instructions in effect move a "cursor" through the plain text document,
and opcodes specify the insertion of additional tagging. While the cursor
moves over the document, the text from the plain text document is inserted
literally into the SGML document. All characters are to appear literally,
so that for example a "<" in the plain text must be rendered as "<" in
HTML, unless the "&" toggle has been turned on, in which case there may be
literal HTML in the plain text document that should be copied verbatim.
The cursor may also move in "ignore" mode, where the characters from the
plain text are not copied into the output document, they are simply
discarded.
Various opcodes control the insertion of both SGML tags and plain text
into the output document. A few special opcodes combine moving the
cursor and tagging. For example, the "p" opcode moves the cursor to the
end of the current line and inserts a
tag, as this is a very common
thing to do to add formatting to a plain text document.
(There is some debate that the code should be able to insert a tag,
as this is very common but incorrect HTML style. Comment is welcome.
To keep the language compact, there is some merit in having an opcode that
says, "There is a paragraph break here, end the current paragraph if any
and start a new one" which is what
does in HTML and why it is commonly
used.)
Other opcodes do common character tagging by suggesting a small string,
all on one line, should be tagged in (for example) and ,
and do this compactly.
Other opcodes allow general tagging. They allow the insertion of
SGML text directly into the output, with certain escapes to avoid
the use of whitespace in the OOB SGML. Whitespace in OOB SGML tagging
is to generally be ignored.
One of these tags ("<") asks the translator to remember the text that was
provided with it in a numbered register, creating a new one if the text has
not been seen before. This effectively defines macros of common tags
that can be used again. In addition, the opcode ">" causes the insertion
of the tag (but not attributes) from a numbered register, surrounded in
SGML close tag syntax, ie. . This allows you to quickly and
compactly encode the very commong SGML style of surrounding a block of
text with and .
Messages include a header of the form:
HT-Form: VER:CRC:CRC2 opcodes
With additional headers of the form:
HT-Form1: more-opcodes
HT-Form2: yet-more-opcodes
(and so on). It is preferred that all formatting be in one header,
but for systems with a limit on the length of a field, it may be
needed to break the formatting up into more than one field.
VER is a version number, base64 encoded, alternately VER/VER, in which case
the first number is the version number of the OOB encoding style, and the
second number is a version number for decoders. If a decoder's version number
is less than the decoder-version, it should abort and not handle the OOB
SGML. If it is equal to or above, it should handle the OOB SGML but
ignore unknown opcodes.
CRC is a n-character (6 bits per character, maximum 32 bits) CRC of the
formatting data, to assure it has not been corrupted. This is optional
but recommended. If provided it is delimited with a colon after the
version tags. A 3 character (18 bit) CRC should usually suffice
CRC2 is an optional n character (6 bits per character, max 32 bits) CRC
of the plain text, to detect damage to the plain text, so that the addition
of formatting can be aborted. Leading and trailing whitespace to the
document are not included in the CRC2, nor is trailing whitespace on the end
of lines, but newslines and leading whitespace at the start of lines are.
There may be other CRC or hash validations of the plain text or the
OOB SGML encoding, such as the Content-MD5 header from MIME. If present,
these headers should also be checked, and they can in fact be used in
place of the CRCs provided here.
Alternately formatting can be stored at the end of an article after
a special delimiter:
%*%Begin SGML Enhancement Codes%*% (trailing tab)
The use of tags at the end allows documents to be clipped or piped or
treated independently of their headers, but of course the tags become
a visible string of unknown codes at the end of the article.
Alternately tags could be stored in a "multipart/related" component, the
first part being text/plain, the second part having a special name.
---------------------------------------------------------------------
Fields contain encoding of SGML/HTML formatting to insert into the document.
Entries consist of an encoded tag specifying the amount of text to "travel"
(move over and output) since the last formatting insert, a type of
formatting to insert and the contents of the formatting.
Numbers are encoded using MIME 64 character ascii encoding, 6 bits per
character. The number of 6-bit units is encoded in the opcode.
1 2 3 4 5 6
0123456789012345678901234567890123456789012345678901234567890123
Mime Set: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
Length numbers may also be the character "!" which means a series of
bytes in "encoded" form are to be taken until the end-of-encoding
delimiter. THis applies only to lengths that are within the OOB stream,
not the text stream.
Formatting opcodes:
S Travel n+1 bytes
s Travel n+65 bytes
E Travel to end of line (EOL is before the newline)
2ef Travel to end of article, after last newline
V Travel to next printable (non-white) character.
Does not move if cursor is on
a printable character. Useful as a first opcode.
v Travel to first column of next line that has a
non white character on it. Useful as a first
opcode when whitespace matters.
L Travel n+1 lines, to start of line. What remains
of current line is first line skipped.
2mk An indication that at some point later in the
stream, the pointer will return here via the 'u'
or similar backtracking opcode. A warning to
the scanner that it needs to be able to seek back
at least this far, or buffer from here on. Required
if 'u' is to be used.
2eb An indication that there is no longer a need to
seek back prior to this point, so any buffers created
due to 2mk can now be released.
u Move up n+1 lines, to start of line, do not output
any text while moving. (For re-use of ASCII and
defining shaped segments.) Requires use of 2mk
l Travel n+1 lines and n2 bytes
I Ignore n+1 bytes from the plain text
g Ingore to EOL and go to column next line
i Ignore to EOL plus n+1 lines of the plain text
# Turn on ignoring until next "#" code.
(ie. use "#", then travel codes, then '#')
N Travel to start of next line
n Travel to next line, first printable, plus n bytes
H[encoded SGML]\x Insert pre-encoded bytes of SGML or text at this
point.
Encoding is normal, but some characters must
be escaped, notably:
_ space
\\ \
\_ underbar
\n Newline
\t Tab
\000 Insert octal encoded byte
\x End of stream of SGML
Question? Consider using space as end of
HTML delimiter marker?
<[encoded SGML]\x Insert bytes of SGML at this point,
embedded in < and >. In addition, remember
the tag name that was at the start of the
SGML, and store it in register N, where N
increments one for each new tag,
[[encoded SGML]\x Insert bytes of SGML, embedded in < and >, but
don't bother remembering in a register
> Insert into stream, where tagn is the
tag (but not any additional attributes after the
tag) remembered from an "<" opcode, numbered n
If n is 56-63, one more byte n2 follows to provide 9
bit number formed from ((n-56) << 6) + n2 for 512
possible tags. Only the tag is used, not any
other attributes that came with it.
whitespace ignored
@ Insert " -- only the tag is used.
] Insert "
if that was part of the register.
h Insert into stream, with closing ">"
R An 18 bit CRC of the resulting output SGML since
the last "R" opcode. Used for robust encodings
to CRC the desired output of each paragraph.
& Toggle mapping of <, > and & (ie. text has SGML)
Normally these characters in the text are to be
escaped, < to < and so on when generating SGML.
P Insert HTML at this point
p Travel to EOL, insert
.
B Insert HTML
at this point
b Travel to EOL, insert
.
O insert HTML at this point
o Travel to beginning next line, insert
t Insert n+1 bytes
d Insert n+1 bytes
T Insert n+1 bytes
m Insert n+1 bytes
U Insert n+1 bytes
t Insert n+1 bytes
([keyword]) Insert , n1+1 bytes of
text from stream, and
ie. some text of length n1+1 will be bounded
in a paired set of ...
The keyword can't contain an ')'
. nop
! Give an error (for use in macros)
/string Search for string (length n) in text, Travel to
to start of it. Recommended as early opcode if
documents are to move through environments where the
plain text might be damaged. Search string encoded
same as inserted HTML.
M Insert macro argument n (use in C definitions)
Bytes starting at of length from the
operands of the macro are inserted. If is 0,
then all remaining bytes are inserted.
2 Two letter opcodes -- for expansion
2er Recommend that a link be inserted in the HTML to
allow viewing of the plain text, or some other
button such as "view document source" be able to
do this.
2ei Insist that a link be inserted in the HTML to
view the plain text, or that this function be
available.
2lr A link to a web page with help about OOB SGML
and in particular OOB HTML is recommended, but not
required.
C
Define mapping of new opcode to old opcodes (length
n1), Pulling in an additional n2 bytes after the
number interpreted by the old opcodes. Thus to define
a new opcode you can map it to "." (nop) and specify
how many bytes of
operands it takes. All opcodes not in the initial
spec should be defined in this way so old systems
at least know how many operands they have.
Bytes from the n2 bytes of extra operands can be
referred to with the M opcode, inserted into the
stream. If "n2" is "!" as noted above, rather
than a mime64 number, bytes are taken until a
delimiter.
This operator is not done if the system has a
definition for the new opcode. Such definitions
are to be used when you use a new operator but
expect older systems to not know what to do with it.
This way you can tell the old system how to best
handle the new opcode -- treating it like a nop, an
error, a signal to ignore material or comment it out
or treat it raw or whatever.
Example:
CQAB
Define new opcode "Q", as the following A=0 byte
string, ie. it's a nop. However, suck up B=1 bytes
after the Q as operands.
CQB!B
Define a new opcode "Q" as the A=1 byte string "!"
(signal an error). Suck up one byte of operand.
c
Define a mapping of a new opcode to an old, which
applies even if the new system has a meaning for
the new opcode. Masks out the new meaning of course.
Doing nested tables and unusual shapes...
Some opcodes make this easier. The 'u' upcode allows the
pointer to be moved back in the document to earlier lines.
The 'g' opcode causes the rest of the line to be ignored and
the pointer to move to a column in the next line.
This allows the OOB stream to scan arbitrary regions of the
text, such as a left rectangle, and then back up and travel over
a right rectangle. Thus to side by side table entries can be
done side by side in the plain text, but linearly in the HTML.
This also applies to replacements for graphics.
The ignoring opcodes:
Allow you to effectively have alternate SGML that entirely
replaces the ASCII text. So if in the ASCII, you bolden a
word with *bold*, you can insert opcodes to ignore the stars
and insert html and tags.
Example:
HT-Form: PSPbCp
(meaning: Insert , Travel (P=15)+1 bytes; insert bold tags around (c=3)+1
bytes; Travel to End of Line and Insert Paragraph marker.)
Text:
This word is in bold, how about that?
Maps to:
This word is in bold, how about that?
Encoding of tables: Tables will be complex but doable. The table should
be rendered in ASCII, possibly with the use of | and - characters to do
borders. Codes should be inserted to ignore the border drawing characters
and insert appropriate table tagging HTML.
Generating OOB SGML opcodes:
Generally this will be done by a program that takes some original
SGML or HTML, and renders it to plain ASCII text. As it renders
it to the text, it will know where the SGML tags used to go, what
extra characters it generated that don't below in the SGML etc.
and output the right opcodes to make this happen.
However, a program could also simply take some SGML and the
resulting plain ascii text and do a "diff" style analysis of
the differences, and generate the OOB SGML opcodes needed to
turn the one document into the other. The first method is
better, however, though a fairly clever difference program could
in fact come up with the same output.
Questions:
Should we have a code that means "insert " or not? Is this
legit? We don't want to have discretionary codes or the
CRC of the output trick won't work. The CRC of the output technique
is good for having this work even on documents that got modified.
Cleverly coded opcodes, using the / (search) opcode and the skip
whitespace opcodes could survive serious modification of the
plain text and still generate valid output. Is this worth
pursuing?
Is this too complex? I am a big stickler for extensibility, since
I know nobody ever designs it right the first time. Thus the
ability to define new opcodes, and the 2-part version number.
With the opcodes you can even say things like, "If you don't know
opcode X and you see it, barf", vs. "If you don't know opcode
Y and you see it, ignore it" and "If you don't know opcode Z and
you see it, treat it like opcode W." Much more flexible than HTML.
But is a macro system overdoing it? It's a pretty simple one, not
hard to code. Can make it optional in version 1, with the risk
that V1 decoders will have trouble in the future. I sort of
feel that extensibility options are the most important thing to
put in version 1, more important than features sometimes.
Also possible: An efficient encoding for 'stupid' OOB HTML which is
no more than a decent compact encoding of a diff between HTML and the
ASCII output, with the DIFF made more compact by virtue of knowing that
HTML doesn't care about extra whitespace, etc. Nested tables or text
wrapped around objects nees more than a DIFF however, so the stupid form
would need to support the backwards moving opcodes.
Extra tags for USENET:
USENET needs some new tags that are not in HTML. They are structural, mostly,
so browsers need not support them directly, but their addition to the
language is encouraged.
...
This tag, which would benefit web pages too, is to mark items that are not
part of the text in question, but are general information provided on
many articles. This tag would be suitable for toolbars, author taglines
and of course USENET and E-mail signatures.
No particular formatting is implied, but if no other formatting is provided,
signatures should be set off from the main article if at the bottom.
Text in SIG blocks should not normally be indexed by search engines or saved
when plain text copies of an article are saved.
..
This block marks text that has been included for reference from another
posting. Includes a blockquote. used with it. Contains sub-regions
as below:
..
Types are:
Author -- the E-mail address and/or name of the author of the quoted
text
Source -- the source of the quoted text. Usually inside the block
will be a hyperlink to the URL of the text.
Text not inside an INCD or BLOCKQUOTE is "comment" on the included text,
and woudl typically be set off from it.
A typical inclusion:
In article msgid,
Brad Templeton
(brad@clari.net) writes:
Text from Brad Templeton
..
A general tag that would be useful for many things. In a newsreader,
indicates that scroll down commands should pause at a page break,
though a subsequent scroll down will display the next page.
Used commonly in USENET postings for the "spoiler warning", so that
text after a form feed is not shown without a user explicitly moving
down to it, no matter how large the user's window.
This next tag is optional, if you want to do full USENET functionality you
would do it but frankly there are better ways...
..
Indicates material in the block is encoded with the "method." If warning
is on, the browsers should prompt the user with the string before revealing
the decoded component.