PixieRobot: Web Scraping Manual


Page Contents

1. Inroduction
2. Programming a conversation: in brief
3. Scripting
3.1 ExecuteWWW
3.2 XForm
3.3.1 XCommon
3.3.2 Xrun
3.4 OutputToFile
3.5 FileGetText
3.6 LoggedMessage
3.7 LogMessage
3.8 Monitor
3.9 Silent
3.10 The Document Object
3.11 Extracting Pictures
3.11.1 GetImageList
3.11.2 GetImageByNumber
3.11.3 GetImageByName
3.11.4 GetFilenameFromAddress


1. Inroduction

PixieRobot's WWW functions are used to extract unstructured data from the WWW and, reformat it into structured data formats such as spreadsheets and databases.

PixieRobot now provides a script interface for driving a browser object.  You can script:


2. Programming a conversation: in brief

Scripts are plain text files written in the VBSCRIPT language.  By default the browsing script is called "onInterval.txt"
but you can change its name in the configuration form:  Menu:Config ---> WWW, 
or edit PixieRobot.ini eg
ScriptOnInterval=Cleopatra.txt

You need to name your master routine
Sub Main

The script statement  Monitor=True
causes the conversation to be made visible. 
Monitor=True should always be the first line in a new Sub Main.
Change to Monitor=False when you "go production".

You then script one step at a time, test-running with Menu:WWW --> Immediate  (Hotkey = F7).
At each step the monitor, which is built with objects from "Internet Explorer", displays the result. 
Click the "WriteToFile" button to save the current page for further inspection.
For a faster way to work, you can right-click on the displayed page area, then select "View Source" from the popup menu.

Most commonly-used library functions of those we have added to VBSCRIPT:
ExecuteWWW, a function used for navigation and form submission.
XForm, an object, consisting of the collection of elements in the first page form. Mostly used for data entry into pages, but also for using the "document object model" for scraping desired data out of a target page.


3. Scripting

Scripts are plain text files written in the VBSCRIPT language.  The language manual is published at
http://msdn.microsoft.com/scripting
but that is a rather dry theoretical document.  We aim here to cover common PixieRobot by example.

Features and points of our implementation

General VBSCRIPT note:

  IW (Internet WWW Object): details of the scripting properties and methods
Grouped by purpose, all keywords and file/path strings are case-insensitive.

3.1 ExecuteWWW

Call ExecuteWWW( URL, [PostData], [EndOfTx]  )
stringData = ExecuteWWW( URL, [PostData], [EndOfTx] )
stringData = ExecuteWWW( SUBMIT, [PostData])
Navigate to a page, or submit a form by using "SUBMIT" as the URL argument

Part

Description

Url

String: Address to navigate to, OR command "SUBMIT"

PostData

String: Optional data to submit in low-level formatlike"name=Michael&age= 37".
Or, the relative number on the page of the form being submitted.
Form count is zero based so the first form on the page has a number 0.

EndOfTx

String: Optional text to recognise as end of transmission.  This is a bandwidth-saving device so when this text arrives in the page download, PixieRobot knows that it has received all that you need and can stop further downloading.  Use distinctive text from near the end of the page.

3.2 XForm

XForm(FieldNameOrIndex [,FormIndex] ) [(ArrayIndex)].Value = Value
Call XForm( FieldNameOrIndex [,FormIndex]
) .Click
Manipulate a form field directly through an abbreviation of the HTML Document Object Model eg:
XForm("Password").Value = "biscuit"

Note, this is equivalent to
WWW.Document.Forms(0).Elements("Password").Value = "biscuit"< /FONT >
but it is much easier to type.  You may need the longer WWW "low level" version if the page uses unusual
on-form-submission methods eg "remote scripting" or "field-by-field".

Part

Description

FieldNameOrIndex

String or Integer: Field Name or Index to identify field to manipulate

ArrayIndex

Optional Integer, only needed when multiple fields have the same name

Value

Value for the field, usually string, but can be numeric, or for checkboxes, boolean =True/False


More XForm Examples:
XForm("chkAutoTrans", 1).Checked = True
There are 2 forms on this page, and the checkbox "chkAutoTrans" that
we want to tick is in the second form.  The first, default form has a
FormIndex of 0, so a second form needs a FormIndex specified of 1.

xPrice2 = XForm("Price2").Value
Read the value of field "Price2" into variable 'xprice2'.

XForm("optType")(2).Checked = True
There is a group of option radiobuttons.  They all have the name "optType".
You want to check the 3rd one, which requires an ArrayIndex of 2 because they count from 0.

3.3.1 XCommon

A common area used to pass data between a calling VBScript program and a called VBScript.

Examples:
XCommon("DelCountry") = oDelCountry
XCommon("ProviderRef") = "32319970815"

3.3.2 XRun

A VBSCRIPT program may be broken up into separate scripts and PixieRobot provides a method for
one script to call another. This is the PixieRobot proprietary function "XRun".

Example:
sRet = XRun("TESTMOCK.vbs")

3.4 OutputToFile

Call OutputToFile( Data, Destination, True
Write string data as a disk file.  Very useful for logging results of scripting when developing scripts.

Part

Description

Data String to write to disk as data

Destination

String: full path of new file including file name

True

Appends at the end of an existing file. Else will overwrite contents of file. Optional.

3.5 FileGetText

stringData = FileGetText(File_with_full_path_name)
Read contents of a file into stringData.  Useful for reading the contents of an attachment to feed into other systems like databases.  eg
stringData = FileGetText(PathAttachIn & AttachFile(Index))

Part Description
MessageString Message to log
Destination String: full path of new file including file name

3.6 LoggedMessage

LoggedMessage
String Property, returns the previous logmessage sent to the logfile and monitor.

3.7 LogMessage

Call LogMessage(Message)
Displays string Message on the PixieRobot Monitor, as well as writing it to its logbook.

3.8 Monitor

The script statement Monitor=True causes the conversation to be made visible.
Monitor=True should always be the first line in a new Sub Main, to allow for debugging.
Change to Monitor=False when you "go production".

Examples:   Monitor = False      Monitor = True

3.9 Silent

The script statement Silent=True causes "pop-up boxes" to be ignored.

Example:  Silent = True


3.10 The Document Object

URL Extraction Using The Document Object

The Document Object represents the HTML document in a given browser window.
Use the document object to examine, modify, or add content to a HTML document
and to process events within that document. The URL property sets or retrieves
the URL for the current document.
e.g. wFi = Mid(WWW.Document.URL, 1, 25)
Returns: http://www.abcsports.com and places 25 characters of the URL in variable - wFi

A specific use for the URL property with PixieRobot could be if a web farm
is encountered. Web farms are set-up to handle large visitor numbers by
having multiple web servers to process requirements. Which web server you get
is randomly selected when a session is first established. If you try
entering a constant URL (e.g. www4.abcsports.com) it will be ignored and
its own URL address is returned. From then on, the session variable for your
allocated web farm, needs to match up.

So for example the following PixieRobot script will extract the server variable
for subsequent use.

Sub Main
' ABC Sports Web Farm Test
' PixieRobot command to run script manually
Monitor = True
' PixieRobot command to ignore pop-up windows while running
Silent = True
' PixieRobot command to navigate to a web page
s = ExecuteWWW("http://www.abcsports.com")
' PixieRobot command to obtain the URL and extract the web farm address in - wFi
' The URL returned is: http://www4.abcsports.com and the 11th character from left is web farm identifier
wFi = Mid(WWW.Document.URL, 11, 1)
' Navigate to new web page after combining all elements of URL
s = ExecuteWWW("http://www" & wFi & ".abcsports.com/"
End Sub

Other ways of using PixieRobot to navigate web pages include:

Setting a Form Element value: WWW.Document.Forms(0).Elements("zipcode").value = "10010"
Ckecking if a page is loaded: If WWW.document.ReadyState = "complete" Then
Setting a Form Elment Index value: WWW.Document.Forms(1).Elements("cspecialty").selectedIndex = 0
Retrieving a Form Element Value: a2=WWW.Document.Forms(1).Elements("cspecialty").Value
Clicking an Element on a Form: Call www.document.forms(1).elements(9).click


3.11 Extracting Pictures

These functions are intended to be used to get a list of URL's
off the current page that the script has been positioned at. Then used to
extract the required picture by its index reference or by its name, and then
store the picture in the folder specified. The following code is an example:

If Instr(1, t, "nophoto", 1) <> 0 Then
   call logMessage ("No Picture details page")
   oPicid = "None"
Else
   iMglist = Split(GetImageList(), Chr(254))
   For i = 0 To Ubound(iMglist)
      ' The next line searches for an image with "auto" in the name
      If Instr(1, iMglist(i), "auto", 1) > 0 Then
         oPicid = iMglist(i)
         on error resume next
         ' The next line downloads the image to the specified folder
         Call GetImageByNumber(i, "C:\Prog Files\PR\djphotos")
         on error goto 0
         Exit For
      End If
   Next
End If

3.11.1 GetImageList

Public Function GetImageList() As String()

GetImageList - This function returns an array containing the URLs of every
image on the document. This array is zero-based. You can use the indices
of this array in the GetImageByNumber function call, or you can put the
URL through the GetFilenameFromAddress function and pass the returned
filename to the GetImageByName function.

3.11.2 GetImageByNumber

Public Function GetImageByNumber(ByVal index As Integer, ByVal directory As String) As String

GetImageByNumber - This function downloads an image based on it's index in
the web page.
* index: The index of the image
* directory: The directory (NOT filename) you wish to download to. The image will
retain its' own filename.
* return value: The function returns the path to the downloaded image. If the function
fails, the return value is a zero-length string. The function will not return until
the image has been downloaded.

3.11.3 GetImageByName

Public Function GetImageByName(ByVal imgname As String, ByVal directory As String) As String

GetImageByName - This function downloads an image by it's filename, name, or id.
* imgname: The filename, name or id of an image on the web page. Not all images
have names or ids, it depends on the exact HTML code used.
* directory: see GetImageByNumber
* return value: see GetImageByNumber

3.11.4 GetFilenameFromAddress

Private Function GetFilenameFromAddress(url As String) As String

GetFilenameFromAddress - This function takes a URL and extracts the filename from it.
the filename is defined as the segment of the URL past the last slash character
(either '/' or '\').