SELECT ... AS name, ... AS price, ... AS yearLow FROM http://www.asharesite.com/lookup WITHPOST symbol=ALL WHERE price=yearLow
Besides my hobby, I also have a business -- the Java business. Therefore, I like to stay informed of the latest Java news and views, which I can do at this site, which might be familiar to you (Figure 1):
Figure 1. The JavaWorld homepage. Click on thumbnail to view full-size image (69 KB).
You will see from the picture that today I'm interested in Enterprise Java Beans, so I've typed "ejb" into the search box. The JavaWorld search engine does its job and rewards me with Figure 2:
Figure 2. Search Results for "ejb." Click on thumbnail to view full-size image (68 KB).
The results appear in descending date order and -- although you can't see it in the picture -- the result set contains entries for prior months as well as the current month. Now I want to see the results in ascending date order rather than descending date order but with results prior to December omitted. If I had some software capable of running it, I could express exactly what I want to see with this piece of SQL:
SELECT ... AS linkText, ... AS description, ... AS date FROM http://search.itworld.com:8765/query.html WITHGET qt=ejb WHERE date MATCHES '*Dec*' ORDER BY date
Figure 3. SQL results. Click on
thumbnail to view full size image (13 KB).
Look at how I've placed the Webpage URL in the FROM clause, in the space where a table name would usually go. You will also notice a new keyword, WITHGET, which I've introduced to trigger a form submission. The result of running this SQL statement would be as shown in Figure
3. I could export the data to another application, a RDBMS or a spreadsheet program. If I issued the query from within a Java
program -- using something like JDBC -- I could present the data in my own style, manipulate the data, and combine it with
results from other sites. Imagine the possibilities.
That's enough for setting the scene. You should have a fairly good idea of what I'm proposing and why I think it's such a great idea, so now I'll tell you a bit more about this new variation on the SQL theme.
Figure 4. Hierarchy of HTML elements. Click on thumbnail to view full-size image (29 KB).
As designed, SQL works with regular two-dimensional tables with data arranged neatly into rows and columns. Web-based information -- in the form of HTML -- just isn't like that, as you can see from the hierarchy of elements comprising the JavaWorld search page (Figure 4).
To move from the HTML hierarchy to the regular SQL table paradigm, you must think of SQL in a slightly different way. As a simple example, look at how I've transformed the following SQL statement that runs against an employee table with two columns:
> SELECT name, salary FROM employee WHERE name='Tony' > SELECT row[*].column['name'], row[*].column['salary'] FROM employee WHERE row[*].column['name']='Tony' > SELECT row[*].column[1] AS name, row[*].column[2] AS salary FROM employee WHERE name='Tony'
The final statement means exactly the same, semantically, as the first. It reads, "Select the first column (call it name) of any row and the second column (call it salary) of any row in the employees table where the name is equal to 'Tony.'"
Using this syntax I could propose a new query language called 3D-SQL that operates on tables with the traditional two dimensions plus an additional third dimension, called history. Each table cell is now addressable as row[x].column[y].history[z], which allows me to construct statements like:
> SELECT row[*].column[1].history[1] AS name, row[*].column[2].history[*] AS salary FROM employee WHERE salary>10000
The statement means, "Select the first historic element of any row's first column (call it name) and any historic element of any row's second column (call it salary) where the salary is greater than 10,000." In plain English it means, "Give me the names of those employees who have ever had a salary greater than 10,000."
Although I've placed the wildcard characters within row specifications (row[*]), I could reverse the sense and use column[*] to do something rather more difficult with traditional SQL: search across the table columns rather than down the table rows.
Now when you look at Figure 4, you might see it in a new light. It's not quite so frightening now, I hope, and I can return to my earlier query against this page and fill in the blanks (note the placement of the wildcard characters) to give:
SELECT .table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000] .tr[0001].td[0000].table[*].tr[0000].td[0000].htxt[0000] AS linkText, .table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000] .tr[0001].td[0000].table[*].tr[0000].td[0000].text[0001] AS description, .table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000] .tr[0001].td[0000].table[*].tr[0000].td[0001].table[0000].tr[0001].td[0000] .text[0000] AS date FROM http://search.itworld.com:8765/query.html WITHGET qt=ejb WHERE date MATCHES '*Dec*' ORDER BY date
The proposed 3D-SQL name will have to go because we're now working in more than three dimensions, and since the lowest common denominator for these queries will be HTML, I'll now settle on the name HTMSQL (HyperText Markup Structured Query Language). I suspect that this SQL has a lot in common with the versions of SQL for object databases, but that's not my specialty.
Of course, coming up with these kinds of queries proves difficult if all you've got to work from is the Webpage as presented in a browser. Nor are we all fluent in raw HTML, so I propose some new tools.