Coding In Paradise

This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Monday, December 31, 2007

Straw Man Proposal for Purple Include Spec (Version 0.1)

This is a simple straw-man proposal giving a spec for Purple Includes. Its mainly meant to help folks like Kevin Burton who are asking if they can create server-side support for Purple Includes in their software, such as Spinn3r, Kevin's mojo-licious blog spider. Note that the existing JavaScript Purple Include actually doesn't fully support this spec yet. Once we hash out this strawman proposal I'll update things in the JavaScript.

Let's call this version 0.1 of the spec. A Purple Include is a way to include pieces of other, remote documents into your own web page just like you can include images from all over the world using the IMG tag. The idea behind Purple Includes is based on Ted Nelson and Douglas Engelbart's work.

Purple Includes can be added to HTML's standard Q and BLOCKQUOTE tags. A Q is an inline quote (i.e. you can use it inside of other text), while a BLOCKQUOTE is block-level, like a DIV. To use, simply set the 'cite' attribute to the remote resource or piece of remote resource you want to include and the 'embed' attribute to 'true'. Two examples:


<q cite="http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)" embed="true"></q>

<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU" embed="true"></blockquote>

The 'embed' attribute is necessary so that we can differentiate normal, non-Purple Include quotes and blockquotes from ones that want the Purple Include magic.

If 'embed' is present and true, then the user agent should take the URI given in 'cite', follow it, and either grab the whole thing or a fragment and inline it into the quote or blockquote, replacing that element's older contents. If 'embed' is false or not present, the user agent should do nothing and simply display what is already inside the quote or blockquote.

For the above BLOCKQUOTE example, once the Purple Include has happened here is what the markup would look like after the remote content was inlined:


<blockquote cite="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU"
 embed="true">
<p>
<a name="nidMDU" id="nidMDU"></a>

What is a collaborative tool?  It's a tool that facilitates
collaboration.  Certainly, a shared authoring tool like a Wiki has
affordances that facilitate collaboration.  But a plain old text
editor is just as legitimately a collaboration tool, because it can
also be used to facilitate collaboration (for example, when used on a
<a href="http://www.eekim.com/cgi-bin/wiki.pl?SharedDisplay" class="wikiword">SharedDisplay</a>). 

<a class="nid" title="MDU" href="http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU">(MDU)</a>
</p>
</blockquote>

And here is what that Purple Include actually looks like using the JavaScript Purple Include (under the covers using the older syntax since I don't support this spec fully yet):

If the user agent is visible to the user (i.e. a browser), it should display a spinning ball inside the element while grabbing the remote resource. Feel free to use this one, which is one of the free Ajax spinner balls around the net that are free to use for any purpose:

If the user agent is again visible to the user, the user agent should set a 'class' value on the quote or blockquote so that document writers can add nifty CSS to style the Purple Include based on whether it succeeded or not. If the Purple Include succeeds, the quote or blockquote should be given the classes "included" and "include_ok". If it fails, it should be given the CSS class name "include_error".

If there was an error, an error message should be inlined into the element's value so that users can see what happened and perhaps address the error.

Being able to grab portions of remote documents is the real magic of Purple Include. Existing server-side templating systems grab entire documents, which have limited utility in the real world when it comes to discourse and annotation. Let's go over the kinds of URIs that can be inside the 'cite' attribute.

First things first: only HTML, XHTML, and XML are supported for remote resources right now, i.e. the following MIME types:

text/html
application/xhtml+xml
text/xml
application/xml

Any other MIME types must throw an error, writing the error into the element's value so that the user can see what happened and possibly changing the 'class' value as described above if this is a user-facing user agent.

Before applying any of the addressing schemes below, the user agent should transform HTML into XHTML by using a tidy program, whether tidy, JTidy, or another language specific tidy library. [Note: would it help to provide some of the options I give to JTidy?]

Now we are ready to look at the different kinds of URIs that can be inside the 'cite' attribute and what a user agent should do with them:

If you just give a full URI with no anchor (ex: http://codinginparadise.org/paperairplane), the user agent should grab the entire remote resource. If the remote resource is HTML/XHTML, it should return just the BODY tag and its children (in XPath notation this would be /body). If the resource is XML, it should return the root of the XML document plus its children.

If you have a full URI with a simple anchor name (ex: http://www.eekim.com/blog/2007/06/21/networkedtoolsemail#nidMDU), then we have some special behavior since these could be Purple Numbers.

Purple Numbers are supported by some publishing systems, and cause a unique anchor to be placed onto each paragraph of the page so that you can grab just specific parts. This makes it easy to bookmark and point at specific parts of a given document. In addition, even if this isn't a Purple Number and just an anchor name, generally anchors have no children -- the intent of someone making an anchor is to make its parent have a unique address. Here's an example:

<h1><a name="first_section"></a>The First Section</h1>

If we just made the behavior of Purple Include 'dumb' and simply grabbed the anchor and returned its children, this is probably not the right thing to do. Instead, we have some special logic here:

Find an anchor tag with the given name or ID. If found, grab its immediate parent and return that.
Find a paragraph with the given name or ID. If found, return that paragraph.

Here's what that logic looks like in XPath if you end up using XPath on your server-side to do this:

//a[@name=anchorName or @id=anchorName]/..
|
//p[@name=anchorName or @id=anchorName]

[Note: is the special logic here worth the complexity? Should we just drop it? It's nice for end-users because they can quickly grab specific anchors, however]

Now we get to the fun part: addressing schemes. After the anchor you can give an addressing scheme, such as:

http://codinginparadise.org/paperairplane#quote(What if community and editing...research and coding efforts can steer towards.)

Address schemes always have the form:

#address(input to address type)

These address schemes are meant to grab portions of the resource. Right now there are only two schemes, a quote() scheme and an xpath() scheme. The only scheme that must be supported right now is the quote() scheme, since the xpath() scheme has shown itself to be of limited usability.

The quote() scheme always has the following form:

#quote(Start of quote...End of quote)

where I give the start of the quote, followed by three dots, followed by the end of the quote.

For example, if I had the following markup:

<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>

and I wanted to grab some of the first question and some of the last question, shown in bold:

<p>What
if community and editing were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized on a peer-to-peer network,
able to exist and run without businesses or governments?</p>

I might do the following:

#quote(were a central and transparent part...massively decentralized)

This would cause the following fragment to be returned:

<p>were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized</p>

Notice that the HTML is correct; we don't just grab some of the substring and return incorrect HTML. 'dumb' return results would look like this:

were a central and transparent part of the
web and browsers?</p>
<p>What
if the web was extremely integrated for usability, with instant
messaging, site creation, the web server, and more all integrated
into one whole?</p>
<p>What would this web look like if despite being
integrated it was massively decentralized

This would be useless, since we would 99% of the time get bad markup and no one would want to use this system.

Now, doing this right is hard, which is why I suggest you cheat: let someone else do the hard work. Specifically, on the server-side portion of the JavaScript Purple Include, for example, I use the Xerces DOM Range support. The DOM Range spec (and Xerces implementation) lets you specify a range that might cut across various elements of an HTML document. You simply set the beginning of the range, then the end of the range. This is what I do on the server-side, and I suggest you do the same since the DOM Range stuff will do all the hard work of correctly closing start and end tags that cut across ranges. It transforms what would probably be several weeks of work into several hours of work (which is how long it took me to do the quote() scheme myself). Once you have the range, you can simply ask it to give you its contents using cloneContents(), which will have everything correctly setup in the markup.

Here's another tip that will make things easier to implement. On the server-side I also use the DOM Traversal functionality of Xerces to grab all the text and CDATA nodes, and then just iterate over all of these to find the start and end strings. The DOM Traversal stuff is another nifty spec that lets you grab just some type of nodes.

One tricky thing you will need to keep in mind is that the start and end strings might fall across different text nodes so you should match strings that fall across node boundaries, and also remember to turn off hidden whitespace so you match correctly. Man I wish I could just give you the code (which can be viewed here in the QuoteAddress class), but unfortunately due to its heritage it is under a GPL license [Note: Eugene, can we just relicense this all as BSD code?]. Viral licenses are a pain in the butt. Studying the algorithm should be ok [Note: is that correct? Studying can't also be viral].

If you manage to do all this successfully (and without bugs) using SAX good luck. Send me the code when you do so if it is Java so I can replace what I have.

Some final last notes about the quote() scheme: if you want to use a parantheses in the quote scheme, just backslash it:

#quote(This is some text$...and here is some text that ends with a parantheses$)

Also, you should scan from the top of the document to the bottom in the same way that a human would read the document (i.e. traverse the document using pre-order traversal). This means that if someone gives a start or end string that has multiple matches, the one that will be found is the one that occurs first in the document.

I mentioned that there is an xpath() scheme. You don't need to support this, since it turned out to not be useful for the majority of users (the quote scheme is much more usable), but, it can be fun to have for more obscure and advanced usage. If you want to get some extra credit and implement it, the one thing to keep in mind is that it must be able to support XPath version 2 and not just XPath version 1. What this means in practice is that your user-agent must use an XPath version 2 parser that can handle things like the following:

http://codinginparadise.org/paperairplane#xpath(for $i in (4 to 10) return //p[$i])

The reason for this is that XPath version 2 has some extra functionality that makes being able to have an xpath() scheme actually useful, like the 'for' loop above, while XPath version 1 is just too limited to be useful for this use case.

After doing the above, you should filter what you return to the client to prevent XSS (Cross-Site Scripting Attacks) based on the returned client. You should:

Strip out SCRIPT blocks
Strip out javascript: URLs
Strip out eval() values in inline CSS

Everything else should be left in the returned values. [Note: should we get into more details on stripping out SCRIPT blocks and javascript: urls since there is a little trickery here?]. One of the chief reasons Firefox would never include, um, inclusions was because of XSS attacks, but since the nifty algorithm above helps to prevent them... maybe this stuff will show up in teh (yes, teh) browser.

Since the current web does not allow sites to easily work cross-domain, the JavaScript Purple Include has a server-side that 'proxies' all this stuff. The JavaScript Purple Include defaults to my web site, codinginparadise.org (I know I'm going to regret that some day... or maybe have a happy weblogs.com payoff). You can change this with a META tag; if your user agent does something similar, you should have the same META tag:

<meta name="purple.include.addressService" content="http://brad.com:8000/purple_include/"></meta>

One final note for server-side folks; remember that when you are working with a Q or BLOCKQUOTE tag that you should automatically add quotes to Q elements. In fact, here is what the HTML spec says about these using a Purple Include:

Whew, there you go; you kids have fun. :)

Labels: purple include, spec

// posted by Brad Neuberg @ Monday, December 31, 2007 5 Comments Links to this post

Saturday, December 29, 2007

Niall's Suggestion

Niall Kennedy gave a great suggestion in the comments about using HTML's standard BLOCKQUOTE and Q tags along with the 'cite' attribute for Purple Include. Modifying what he suggested a bit, this is what it would look like:

<q cite="http://codinginparadise.org/paperairplane#quote(Start of range...End of range)"></q>

He also provided a test page that showed some of what he was thinking (BTW, he ran into a known bug in Purple Include where if you include something from the same page twice the second time you just see the spinning icon -- I need to fix this).

Niall suggested using the 'title' attribute to hold the span; I want to keep the full address in the URL 'cite' attribute however for several reasons. One, it really is a full URL: the anchor at the end actually specifies a range within the document. Just as everything to the left of the anchor specifies a given file/resource to grab, everything after the anchor is an infile-address.

This will become more clear in later iterations as I tie this into granular addressability where you can "jump through" the quote into the larger document, causing the browser to scroll to this quoted text; creating "out-of-band" links where you can annotate and create links inside a document without having to change it even while someone is looking at it by specifying the ranges of where the links should be; and other fun hypertext geekiness/madness.

Alot of this hypertext work is just to have fun and see how far we can push the hypertext model on the web, finding out what stands and can be useful and what falls over, similar to how I was pushing Ajax in wierd/new directions around history, storage, offline, etc.

Soon I'll roll Niall's suggestions in and remove the 'href' attribute I was using before. Thanks for the great suggestion Niall!

Labels: purple include

// posted by Brad Neuberg @ Saturday, December 29, 2007 5 Comments Links to this post

Friday, December 28, 2007

New Purple Include Release

I've put up a new release of Purple Include, iteration 3. Purple Include is a client-side JavaScript library that allows you to do client-side transclusions.

What the heck does that mean?

It means that you can include and display fragments of one HTML page in another without copying and pasting any content.

To use Purple Include, just add the following SCRIPT tag to the following of your HTML page:

<script src="http://codinginparadise.org/projects/purple-include/purple-include.js"></script>

The big new feature of this release is a new addressing scheme that is much easier to use, named 'quote'. Here's an example:

<div href="http://docs.google.com/View?docid=dhkhksk4_8gdp9gr&pli=1#quote(In the background Dojo Offline is checking...when they are on- or off-line so you don't have to)"></div>

Notice the 'href' attribute; this will cause Purple Include to fetch the given page and inline it into that DIV (you can add the href attribute to DIVs, SPANs, BLOCKQUOTEs, Ps, and more). Also notice the #quote() at the end. This is the new addressing scheme in this release; you can now quote and grab a range of any remote page by just having a few words from the page, followed by three dots, followed by the end of the text to grab. In the example above we have:

#quote(In the background Dojo Offline is checking...when they are on- or off-line so you don't have to)

This goes to the Dojo Offline tutorial, starts grabbing HTML and text from the beginning of 'In the background Dojo Offline is checking' and continues doing so until it encounters 'when they are on- or off-line so you don't have to'. Purple Include has proper HTML range support, which means that nested tags are correctly setup when they are returned and inlined into your page.

Here is the example actually running; after a few seconds you should see the content in-line (if you are reading this in an RSS feed click here to see the example):

The older XPath addressing scheme is still there for advanced usage; the quote syntax is just much simpler and has true range support. Details on the XPath scheme here.

The other big change in this release is that I have refactored the server-side addressing system to be truly generic, so that it is much easier to drop in new addressing schemes. For example, we could create a line() scheme to grab ranges of lines from source code files.

Please note that this software is still in development. In particular, the quote() scheme needs to be tested more, and I haven't done performance work on large documents yet. I wanted to 'release early release often' on this before getting it perfect.

You don't need to host anything to use this. If you want to run your own server, however, here is the ZIP file for you to download. In your HTML you will need to point Purple Include at your custom server using a META tag. See tests/example.html for details on how to do this at the top. The client-side is BSD licensed, the server-side is GPL.

Labels: announcement, engelbart, hypertext, hypertext geekery, open source, purple include, release

// posted by Brad Neuberg @ Friday, December 28, 2007 5 Comments Links to this post

Friday, August 17, 2007

Purple Include Update

I integrated Purple Include and Blogger
I pushed a small new update today that fixes some address parsing bugs for complex XPath 2.0 expressions -- these are automatically on the server, so you don't have to do anything. I also updated the JavaScript to ease the Blogger integration a bit (added some extra styling to the spinner image and a class name)
I forgot to mention in my post yesterday that Purple Include can now take XPath 2.0 expressions (the old one could only do XPath 1.0 stuff)
Purple Include works on the iPhone!

Labels: announcement, hypertext, hypertext geekery, purple include

// posted by Brad Neuberg @ Friday, August 17, 2007 2 Comments Links to this post

Purple Include Test of Blogger Integration

This is a test post using a Purple Include to get it working on my blog. Here is a paragraph from my Paper Airplane, um, paper from a few years ago:

You should see some paragraphs above; all I had to do was add a script tag to the top of my Blogger template:

<script src="http://codinginparadise.org/projects/purple-include/purple-include.js"
type="text/javascript"></script>

I also added some default styles to my blog template; Purple Include automatically adds some class names when it embeds something and if there is an error that we can style on:

<style>
.included{ display: block; padding-left: 2em; padding-right: 2em; background-color: #486F6F; }
.include_error{ display: block; background-color: red; text-color: black; }
.include_roller{ border: none; }
.included p{ margin-top: 1em; margin-bottom: 1em; }
</style>

Here's the tag I added above to include things:

<div href="http://codinginparadise.org/paperairplane#xpath(for $i in (4 to 10) return //p[$i])"></div>

The XPath expression is an XPath 2.0 expression:

for $i in (4 to 10) return //p[$i]

This basically just returns a range of paragraph nodes, from the 4th to the 10th one, which is the Introduction part of the Paper Airplane paper.

Labels: hypertext geekery, purple include

// posted by Brad Neuberg @ Friday, August 17, 2007 3 Comments Links to this post

Wednesday, August 15, 2007

New Purple Include Release: Include Pieces of Web Pages Like You Would Images

Hi everyone; I'm back from my sabbatical and rearing to go. The first thing I wanted to hack on was to take the cool Purple Include work that Jonathan, Eugene, and I hacked together last month to a new version. I'm proud to announce another release of Purple Include, version 1.9 -- as you will see below, this is a major refactoring that simplifies working with the library and now works across all major browsers.

Purple Include is a client-side JavaScript library that allows you to do client-side transclusions.

What the heck does that mean?

It means that you can include and display fragments of one HTML page in another without copying and pasting any content. For example, you could quote the second paragraph from another person's blog entry by embedding something like:

<div href="http://foo.com/bar.html#xpath(/p[2])"></div>

in your blog page. The expression following the explanation point in the URL is an XPath expression.

If the page you want to transclude has a fragment identifier or a purple number, you can transclude that directly:

<div href="http://foo.com/purple.html#nid32"></div>

In fact, all you have to do is add an 'href' attribute to any of the following types of HTML tags in order to have that URL transcluded right into the page when the page loads:

<p href="http://foobar.com#nid32"></p>

<blockquote href="http://foobar.com#xpath(/p[2])"></blockquote>

<div href="includeme.html#foobar"></div>

<span href="../../relativefile.html#foobar"></span>

<q href="http://foobar.com#foobar"></q>

Here's the great thing about this new release -- there's nothing to install! We do some magic (see the Release Notes below to see how) to make it so that you have absolutely no server to install. I host everything on my webserver now, at codinginparadise.org, even the inclusion service and JavaScript, so all you have to do is add the following JavaScript to the top of your page:

<script src="http://codinginparadise.org/projects/purple-include/purple-include.js"></script>

Plus, Purple Include now works across Safari, Internet Explorer, and Firefox (Opera probably too -- I just haven't tested). The client-side JavaScript is now just about 9K.

See the example page for examples and usage. Also see the README file. You can view the JavaScript file as well if you want.

This is all beta stuff, so as usual, if you see bugs, tell me -- or even better, fix it ;)

RELEASE NOTES for August 15th, 2007 Purple Include 1.9

This is a pretty radical refactoring of the Purple Include code. The big highlights in this version:

* Now works cross-browser: Internet Explorer, Firefox, and Safari (Opera should work but has not been tested)

* There are no server-side requirements anymore. Instead, the inclusion service is hosted on my web site at codinginparadise.org and we do a trick in the JavaScript (the JSONP/Script tag trick) in order to do transclusions through a third-party web site. All you have to do is drop the JavaScript into your page and start using it.

* We no longer have an <hx:include> tag; instead, you can simply add an 'href' attribute to many different HTML types and have that type transclude it's contents:

<p href="http://codinginparadise.org/paperairplane#xpath(//[@id='table_of_contents']")</p>

This works for the following tags: P, BLOCKQUOTE, Q, PRE, DIV, SPAN

* The client-side script has gotten vastly smaller and simpler -- the script is now only about 9K.

* We now use the notation #xpath(//p) around an XPath expression rather than using an exclamation point, such as #xpath!//p. This is in keeping with the pseudo-standard that has developed around this practice, such as #xpointer() -- it also opens up the possibility of chaining together expressions in the future, such as #xpath(//expression1)xpath(//expression2), which would return the results of both expressions.

* The little roller image, roller.gif, used to be a pain in the butt to configure because it was always relative to the page you are using transclusions on, which therefore required you to specify a relative path or something using a META tag. To cut down on configuration, I now just host this image on my web server -- you can set it to your own path using the META tag 'purple.include.rollerURL' but we default it inside the code to my webserver. This means that all you have to do now to use Purple Include is have the JavaScript file -- that's it.