Coding In Paradise: Tuesday, November 28, 2006

This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Tuesday, November 28, 2006

HTML Transformer for HyperScope: Apply Advanced Hyperlinks to Normal HTML Documents

An HTML transformer for HyperScope is now finished and up! This transformer can dynamically take HTML documents and bring them into the HyperScope.

Why is this cool? Well, it means you can now use HyperScope's advanced hyperlinks with normal HTML documents. This is very useful for complicated documents, such as legal briefs, software specifications, and more.

For example, let's say we are having an email discussion and want to point at a specific section in the official XML specification; how would we have done this before HyperScope? The original author would have had to put HTML anchors into every part of the document you want to talk about, which no one does.

With the XHTML transformer and HyperScope, this is as easy as this.

That URL points at a specific paragraph that we might need to talk about.

By the way, this is useful for any OPML app, since it is a generic system to turn HTML into OPML, which other OPML hackers might find useful (this is all open source).

There is a bookmarklet available that you can drag to your links toolbar to use the HTML transformer; when browsing the web, you can press this button to suck the page into the HyperScope. There is also a web-based form that you can plug a URL into to transform. Both are available here.

For this release of the transformer the focus of the HTML transformer are technical specifications, in particular the ones at the W3C's site. Here are some example specs sucked into the HyperScope dynamically:

You can now apply HyperScope's tools to these documents, including studying tools and advanced addressing and hyperlinks.

Important: Note that the focus of the transformer for this release are the W3C documents at their website; I wrote the HTML transformer to be generic, so it will work well with many 'document' oriented web pages. However, some pages will not work, and some pages will give errors. For this release the most important thing was getting the W3C docs to work right.

Here is an example of a 'normal', non-W3C web page being pulled into the transformer and HyperScope; this is the Paper Airplane research report.

I have also uploaded a new build of HyperScope to my webserver as well. There are some small bug fixes in here, including better default viewspecs, such as completely showing the document on page load (viewspec g) rather than just the top-level outline (viewspec x), and having purple numbers on by default (m). You can download a ZIP file of this new release here.

Here are some technical details on the XHTML implementation:
The core of the HTML transformer is an XSLT stylesheet; I used Les Orchard's XOXO-to-OPML stylesheet as the foundation for getting started, so many props to him!

There are three big challenges the XSLT stylesheet had to tackle:

HTML does not show hierarchy through nesting; instead, it happens through sibling relationships, such as an H1 element followed by a P tag. This is tricky to code for.
There are different idioms when folks are writing their HTML to indicate hierarchy; the W3C site in particular uses different ways to indicate hierarchy across different specs - even they don't write clean semantic markup.
Most of the specifications on the W3C site are *huge*. For example, the XSLT spec is 1.5 megabytes of text! Getting this to be fast was a challenge.

The basic process behind the transformer is:
1) Make sure the URL requested is safe
2) Fetch the HTML contents of the URL
3) Clean up the HTML and convert it into XHTML
4) Shoot the XHTML into the XSLT stylesheet to get our OPML
5) Return the OPML to the web browser

I initially thought this process would take about a month to code, but it took me about a month in a half. The two hard things were performance and reliability. The W3C specs are written in different ways, which was tricky to capture the hierarchy of in some cases, and they are huge.

I initially wrote the steps above in PHP, but found that the XSLT engine inside of it was much too slow for some of the W3C specs. I benchmarked XSLT engines, and found that the Java-based Saxon engine is very fast for us, and therefore rewrote things in Java as a servlet that internally uses Apache HTTP-Client to fetch the resource; JTidy to clean it up and turn it into XHTML; and Saxon to format the results into OPML.

You can download the WAR file yourself for the transformer if you want to install a local version on your web server. You must have Tomcat 5, Java 1.4, and Apache 1/2. Simply drop the xhtml_transformer.war file into your Tomcat's webapps directory. You should also have Tomcat and Apache integrated so that Apache delegates to Tomcat when it needs to load a resource, rather than having you directly contact Tomcat such as http://myserver.com:8080 (where 8080 is Tomcat's port).

The next few days I'm going to be hitting some small housekeeping tasks, such as updating the HyperScope design document to be up to date and putting known bugs into a bug database. Unfortunately, this is the last paid week of development we have on HyperScope right now; all of us are working to put together a grant for another phase of development.

// posted by Brad GNUberg @ Tuesday, November 28, 2006 0 Comments Links to this post

Subscribe to Posts [Atom]

Coding In Paradise

Tuesday, November 28, 2006

HTML Transformer for HyperScope: Apply Advanced Hyperlinks to Normal HTML Documents

about me

Archives