Coding In Paradise

This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Monday, December 31, 2007

Straw Man Proposal for Purple Include Spec (Version 0.1)

This is a simple straw-man proposal giving a spec for Purple Includes. Its mainly meant to help folks like Kevin Burton who are asking if they can create server-side support for Purple Includes in their software, such as Spinn3r, Kevin's mojo-licious blog spider. Note that the existing JavaScript Purple Include actually doesn't fully support this spec yet. Once we hash out this strawman proposal I'll update things in the JavaScript.

Let's call this version 0.1 of the spec. A Purple Include is a way to include pieces of other, remote documents into your own web page just like you can include images from all over the world using the IMG tag. The idea behind Purple Includes is based on Ted Nelson and Douglas Engelbart's work.

Purple Includes can be added to HTML's standard Q and BLOCKQUOTE tags. A Q is an inline quote (i.e. you can use it inside of other text), while a BLOCKQUOTE is block-level, like a DIV. To use, simply set the 'cite' attribute to the remote resource or piece of remote resource you want to include and the 'embed' attribute to 'true'. Two examples:


<q cite="http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)" embed="true"></q>

<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU" embed="true"></blockquote>

The 'embed' attribute is necessary so that we can differentiate normal, non-Purple Include quotes and blockquotes from ones that want the Purple Include magic.

If 'embed' is present and true, then the user agent should take the URI given in 'cite', follow it, and either grab the whole thing or a fragment and inline it into the quote or blockquote, replacing that element's older contents. If 'embed' is false or not present, the user agent should do nothing and simply display what is already inside the quote or blockquote.

For the above BLOCKQUOTE example, once the Purple Include has happened here is what the markup would look like after the remote content was inlined:


<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU"
 embed="true">
<p>
<a name="nidMDU" id="nidMDU"></a>

What is a collaborative tool?  It's a tool that facilitates
collaboration.  Certainly, a shared authoring tool like a Wiki has
affordances that facilitate collaboration.  But a plain old text
editor is just as legitimately a collaboration tool, because it can
also be used to facilitate collaboration (for example, when used on a
<a href="http://www.eekim.com/cgi-bin/wiki.pl?SharedDisplay" class="wikiword">SharedDisplay</a>). 

<a class="nid" title="MDU" href="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU">(MDU)</a>
</p>
</blockquote>

And here is what that Purple Include actually looks like using the JavaScript Purple Include (under the covers using the older syntax since I don't support this spec fully yet):

If the user agent is visible to the user (i.e. a browser), it should display a spinning ball inside the element while grabbing the remote resource. Feel free to use this one, which is one of the free Ajax spinner balls around the net that are free to use for any purpose:

If the user agent is again visible to the user, the user agent should set a 'class' value on the quote or blockquote so that document writers can add nifty CSS to style the Purple Include based on whether it succeeded or not. If the Purple Include succeeds, the quote or blockquote should be given the classes "included" and "include_ok". If it fails, it should be given the CSS class name "include_error".

If there was an error, an error message should be inlined into the element's value so that users can see what happened and perhaps address the error.

Being able to grab portions of remote documents is the real magic of Purple Include. Existing server-side templating systems grab entire documents, which have limited utility in the real world when it comes to discourse and annotation. Let's go over the kinds of URIs that can be inside the 'cite' attribute.

First things first: only HTML, XHTML, and XML are supported for remote resources right now, i.e. the following MIME types:

text/html
application/xhtml+xml
text/xml
application/xml

Any other MIME types must throw an error, writing the error into the element's value so that the user can see what happened and possibly changing the 'class' value as described above if this is a user-facing user agent.

Before applying any of the addressing schemes below, the user agent should transform HTML into XHTML by using a tidy program, whether tidy, JTidy, or another language specific tidy library. [Note: would it help to provide some of the options I give to JTidy?]

Now we are ready to look at the different kinds of URIs that can be inside the 'cite' attribute and what a user agent should do with them:

If you just give a full URI with no anchor (ex: http://codinginparadise.org/paperairplane), the user agent should grab the entire remote resource. If the remote resource is HTML/XHTML, it should return just the BODY tag and its children (in XPath notation this would be /body). If the resource is XML, it should return the root of the XML document plus its children.

If you have a full URI with a simple anchor name (ex: http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU), then we have some special behavior since these could be Purple Numbers.

Purple Numbers are supported by some publishing systems, and cause a unique anchor to be placed onto each paragraph of the page so that you can grab just specific parts. This makes it easy to bookmark and point at specific parts of a given document. In addition, even if this isn't a Purple Number and just an anchor name, generally anchors have no children -- the intent of someone making an anchor is to make its parent have a unique address. Here's an example:

<h1><a name="first_section"></a>The First Section</h1>

If we just made the behavior of Purple Include 'dumb' and simply grabbed the anchor and returned its children, this is probably not the right thing to do. Instead, we have some special logic here:

Find an anchor tag with the given name or ID. If found, grab its immediate parent and return that.
Find a paragraph with the given name or ID. If found, return that paragraph.

Here's what that logic looks like in XPath if you end up using XPath on your server-side to do this:

//a[@name=anchorName or @id=anchorName]/..
|
//p[@name=anchorName or @id=anchorName]

[Note: is the special logic here worth the complexity? Should we just drop it? It's nice for end-users because they can quickly grab specific anchors, however]

Now we get to the fun part: addressing schemes. After the anchor you can give an addressing scheme, such as:

http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)

Address schemes always have the form:

#address(input to address type)

These address schemes are meant to grab portions of the resource. Right now there are only two schemes, a quote() scheme and an xpath() scheme. The only scheme that must be supported right now is the quote() scheme, since the xpath() scheme has shown itself to be of limited usability.

The quote() scheme always has the following form:

#quote(Start of quote...End of quote)

where I give the start of the quote, followed by three dots, followed by the end of the quote.

For example, if I had the following markup:

<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>

and I wanted to grab some of the first question and some of the last question, shown in bold:

<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>

I might do the following:

#quote(were a central and transparent part...massively decentralized)

This would cause the following fragment to be returned:

<p>were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized</p>

Notice that the HTML is correct; we don't just grab some of the substring and return incorrect HTML. 'dumb' return results would look like this:

were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized

This would be useless, since we would 99% of the time get bad markup and no one would want to use this system.

Now, doing this right is hard, which is why I suggest you cheat: let someone else do the hard work. Specifically, on the server-side portion of the JavaScript Purple Include, for example, I use the Xerces DOM Range support. The DOM Range spec (and Xerces implementation) lets you specify a range that might cut across various elements of an HTML document. You simply set the beginning of the range, then the end of the range. This is what I do on the server-side, and I suggest you do the same since the DOM Range stuff will do all the hard work of correctly closing start and end tags that cut across ranges. It transforms what would probably be several weeks of work into several hours of work (which is how long it took me to do the quote() scheme myself). Once you have the range, you can simply ask it to give you its contents using cloneContents(), which will have everything correctly setup in the markup.

Here's another tip that will make things easier to implement. On the server-side I also use the DOM Traversal functionality of Xerces to grab all the text and CDATA nodes, and then just iterate over all of these to find the start and end strings. The DOM Traversal stuff is another nifty spec that lets you grab just some type of nodes.

One tricky thing you will need to keep in mind is that the start and end strings might fall across different text nodes so you should match strings that fall across node boundaries, and also remember to turn off hidden whitespace so you match correctly. Man I wish I could just give you the code (which can be viewed here in the QuoteAddress class), but unfortunately due to its heritage it is under a GPL license [Note: Eugene, can we just relicense this all as BSD code?]. Viral licenses are a pain in the butt. Studying the algorithm should be ok [Note: is that correct? Studying can't also be viral].

If you manage to do all this successfully (and without bugs) using SAX good luck. Send me the code when you do so if it is Java so I can replace what I have.

Some final last notes about the quote() scheme: if you want to use a parantheses in the quote scheme, just backslash it:

#quote(This is some text$...and here is some text that ends with a parantheses$)

Also, you should scan from the top of the document to the bottom in the same way that a human would read the document (i.e. traverse the document using pre-order traversal). This means that if someone gives a start or end string that has multiple matches, the one that will be found is the one that occurs first in the document.

I mentioned that there is an xpath() scheme. You don't need to support this, since it turned out to not be useful for the majority of users (the quote scheme is much more usable), but, it can be fun to have for more obscure and advanced usage. If you want to get some extra credit and implement it, the one thing to keep in mind is that it must be able to support XPath version 2 and not just XPath version 1. What this means in practice is that your user-agent must use an XPath version 2 parser that can handle things like the following:

http://codinginparadise.org/paperairplane#xpath(for $i in (4 to 10) return //p[$i])

The reason for this is that XPath version 2 has some extra functionality that makes being able to have an xpath() scheme actually useful, like the 'for' loop above, while XPath version 1 is just too limited to be useful for this use case.

After doing the above, you should filter what you return to the client to prevent XSS (Cross-Site Scripting Attacks) based on the returned client. You should:

Strip out SCRIPT blocks
Strip out javascript: URLs
Strip out eval() values in inline CSS

Everything else should be left in the returned values. [Note: should we get into more details on stripping out SCRIPT blocks and javascript: urls since there is a little trickery here?]. One of the chief reasons Firefox would never include, um, inclusions was because of XSS attacks, but since the nifty algorithm above helps to prevent them... maybe this stuff will show up in teh (yes, teh) browser.

Since the current web does not allow sites to easily work cross-domain, the JavaScript Purple Include has a server-side that 'proxies' all this stuff. The JavaScript Purple Include defaults to my web site, codinginparadise.org (I know I'm going to regret that some day... or maybe have a happy weblogs.com payoff). You can change this with a META tag; if your user agent does something similar, you should have the same META tag:

<meta name="purple.include.addressService" content="http://brad.com:8000/purple_include/"></meta>

One final note for server-side folks; remember that when you are working with a Q or BLOCKQUOTE tag that you should automatically add quotes to Q elements. In fact, here is what the HTML spec says about these using a Purple Include:

Whew, there you go; you kids have fun. :)

Labels: purple include, spec

// posted by Brad Neuberg @ Monday, December 31, 2007 5 Comments Links to this post

Subscribe to Posts [Atom]

Coding In Paradise

Monday, December 31, 2007

Straw Man Proposal for Purple Include Spec (Version 0.1)

about me

Archives