HTML parser

VA Smalltalk is a "100% VisualAge compatible" IDE that includes the original VisualAge technology and the popular VA Assist and WidgetKit add-ons.

Moderators: Eric Clayberg, wembley, tc, Diane Engles, solveig

HTML parser

Postby Bob Nemec » Sat Nov 11, 2006 3:42 pm

Is there a simple way to parse HTML in VA?

I can find HTML parsers for VW and Squeak, and SgmlParser in VA looks interesting... but I figure I'd ask before diving in head first.

Any suggestions?
Bob Nemec
Northwater Capital
Bob Nemec
[|]
 
Posts: 16
Joined: Mon Oct 16, 2006 5:07 am

HTML Parser

Postby solveig » Tue Nov 14, 2006 1:26 pm

Bob,

The SGML parser in the product might suit your needs, but it might be too strict.

What do you want to parse?

Solveig
solveig
Moderator
 
Posts: 57
Joined: Tue Oct 17, 2006 6:30 am

Postby Bob Nemec » Wed Nov 15, 2006 11:45 am

I'm trying to parse saved web pages. I'm using an iMacro script (http://www.iopus.com/) to download pages from a financial site. I then want to pull data out of the page.

iMacro can do most of the parsing, but I'd like to position my app to eventually be 100% Smalltalk. The web browser scripting will probably be tricky, but being able to deal with the raw page data is a first step.

I'm doing this as an add-on exercise to Rob Vens' MijnGeld. To get started, my code is VA domain classes only (no UI). My plan is to automatically download a person's daily bank transactions and email them to the user, along with current budget status info.

I've got the download and email working (with iMacro and VA to do the emailing). Next is the budget and transaction hooks.
Bob Nemec
Northwater Capital
Bob Nemec
[|]
 
Posts: 16
Joined: Mon Oct 16, 2006 5:07 am

Postby tc » Wed Nov 15, 2006 6:35 pm

Hi Bob,

The xml parser in smalltalk uses an sgml parser underneath to parse out xml. HTML is similar but allows many things that are not allowed in XML. A few examples are:

1. HTML start tags may not have end tags.
2. Start and end tags might overlap other start and end tags rather then nest.
3. Attributes may not have quotes.
4. The HTML page might not have a single root element.
5. Things like 'A&P' in HTML would have to be 'A&P' in XML.

. . . I am sure the st xml parser would complain about some of these things. If you know the data you are after in the HTML, maybe using the st string functions would work better?

--tc
tc
Moderator
 
Posts: 304
Joined: Tue Oct 17, 2006 7:40 am
Location: Raleigh, NC

HTML tidy

Postby quixotik » Thu Nov 16, 2006 6:22 am

Or one could try to run that html, assuming it's not well-formed to begin with, through HTML tidy (http://tidy.sourceforge.net/) and then use the sgml parser? Just a thought.

Robin
quixotik
 
Posts: 10
Joined: Thu Nov 16, 2006 6:20 am
Location: Belgium

Postby tc » Sun Dec 10, 2006 1:42 am

Hi Robin,

Good suggestion. Thanks.

--tc
tc
Moderator
 
Posts: 304
Joined: Tue Oct 17, 2006 7:40 am
Location: Raleigh, NC


Return to VA Smalltalk 7.0, 7.5 & 8.0

Who is online

Users browsing this forum: No registered users and 1 guest