Java: A platform for platforms
Sun's reorg may seem promising to shareholders but it's also a scramble for position. The question now is whether Sun can, or wants to, maintain its hold on Java technology. Especially with enterprise leaders like SpringSource and RedHat investing heavily in Java's future as a platform for platforms

Also see:

Discuss: Tim Bray on 'What Sun Should Do'

Featured Whitepapers
Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Access the world's biggest database with Web DataBase Connectivity

Use SQL to query Web-based information

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
What's your hobby? Mine is amateur investing and the Web helps me a lot. I can log on to one of my favorite financial information sites and look up the latest share prices, or I can log on to a different site to see the news items for the stocks I hold. I have a secret recipe for choosing shares, which I can't tell you, but suppose it's something like: "I will buy shares in a company whose price drops to its 52-week low and has appeared in the news this week." For the first part of that formula, how do I find all companies at their 52-week low? In the past, I viewed the information for each company individually and did a mental calculation. Now I simply type something like this:

SELECT ... AS name, ... AS price, ... AS yearLow
FROM http://www.asharesite.com/lookup
WITHPOST symbol=ALL
WHERE price=yearLow


Besides my hobby, I also have a business -- the Java business. Therefore, I like to stay informed of the latest Java news and views, which I can do at this site, which might be familiar to you (Figure 1):

Figure 1. The JavaWorld homepage. Click on thumbnail to view full-size image (69 KB).

You will see from the picture that today I'm interested in Enterprise Java Beans, so I've typed "ejb" into the search box. The JavaWorld search engine does its job and rewards me with Figure 2:

Figure 2. Search Results for "ejb." Click on thumbnail to view full-size image (68 KB).

The results appear in descending date order and -- although you can't see it in the picture -- the result set contains entries for prior months as well as the current month. Now I want to see the results in ascending date order rather than descending date order but with results prior to December omitted. If I had some software capable of running it, I could express exactly what I want to see with this piece of SQL:

SELECT ... AS linkText, ... AS description, ... AS date
FROM http://search.itworld.com:8765/query.html
WITHGET qt=ejb
WHERE date MATCHES '*Dec*'
ORDER BY date


Figure 3. SQL results. Click on
thumbnail to view full size image (13 KB).

Look at how I've placed the Webpage URL in the FROM clause, in the space where a table name would usually go. You will also notice a new keyword, WITHGET, which I've introduced to trigger a form submission. The result of running this SQL statement would be as shown in Figure 3. I could export the data to another application, a RDBMS or a spreadsheet program. If I issued the query from within a Java program -- using something like JDBC -- I could present the data in my own style, manipulate the data, and combine it with results from other sites. Imagine the possibilities.

That's enough for setting the scene. You should have a fairly good idea of what I'm proposing and why I think it's such a great idea, so now I'll tell you a bit more about this new variation on the SQL theme.

From HTML to SQL

Figure 4. Hierarchy of HTML elements. Click on thumbnail to view full-size image (29 KB).

As designed, SQL works with regular two-dimensional tables with data arranged neatly into rows and columns. Web-based information -- in the form of HTML -- just isn't like that, as you can see from the hierarchy of elements comprising the JavaWorld search page (Figure 4).

To move from the HTML hierarchy to the regular SQL table paradigm, you must think of SQL in a slightly different way. As a simple example, look at how I've transformed the following SQL statement that runs against an employee table with two columns:

 
> SELECT name, salary FROM employee WHERE name='Tony'
> SELECT row[*].column['name'], row[*].column['salary'] FROM employee WHERE row[*].column['name']='Tony'
> SELECT row[*].column[1] AS name, row[*].column[2] AS salary FROM employee WHERE name='Tony'


The final statement means exactly the same, semantically, as the first. It reads, "Select the first column (call it name) of any row and the second column (call it salary) of any row in the employees table where the name is equal to 'Tony.'"

Using this syntax I could propose a new query language called 3D-SQL that operates on tables with the traditional two dimensions plus an additional third dimension, called history. Each table cell is now addressable as row[x].column[y].history[z], which allows me to construct statements like:

> SELECT row[*].column[1].history[1] AS name, row[*].column[2].history[*] AS salary FROM employee WHERE salary>10000


The statement means, "Select the first historic element of any row's first column (call it name) and any historic element of any row's second column (call it salary) where the salary is greater than 10,000." In plain English it means, "Give me the names of those employees who have ever had a salary greater than 10,000."

Although I've placed the wildcard characters within row specifications (row[*]), I could reverse the sense and use column[*] to do something rather more difficult with traditional SQL: search across the table columns rather than down the table rows.

Now when you look at Figure 4, you might see it in a new light. It's not quite so frightening now, I hope, and I can return to my earlier query against this page and fill in the blanks (note the placement of the wildcard characters) to give:

SELECT .table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0000].htxt[0000] AS linkText,
.table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0000].text[0001] AS description,
.table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0001].table[0000].tr[0001].td[0000]
.text[0000] AS date
FROM http://search.itworld.com:8765/query.html
WITHGET qt=ejb
WHERE date MATCHES '*Dec*'
ORDER BY date


The proposed 3D-SQL name will have to go because we're now working in more than three dimensions, and since the lowest common denominator for these queries will be HTML, I'll now settle on the name HTMSQL (HyperText Markup Structured Query Language). I suspect that this SQL has a lot in common with the versions of SQL for object databases, but that's not my specialty.

Of course, coming up with these kinds of queries proves difficult if all you've got to work from is the Webpage as presented in a browser. Nor are we all fluent in raw HTML, so I propose some new tools.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources