PixieRobot: Web Scraping Manual
Page Contents
1. Inroduction
2. Programming a conversation: in brief
3. Scripting
3.1 ExecuteWWW
3.2 XForm
3.3.1 XCommon
3.3.2 Xrun
3.4 OutputToFile
3.5 FileGetText
3.6 LoggedMessage
3.7 LogMessage
3.8 Monitor
3.9 Silent
3.10 The Document Object
3.11 Extracting Pictures
3.11.1 GetImageList
3.11.2 GetImageByNumber
3.11.3 GetImageByName
3.11.4 GetFilenameFromAddress
1. Inroduction
PixieRobot's WWW functions are used to extract unstructured data from the WWW and, reformat it into structured data formats such as spreadsheets and databases.
PixieRobot now provides a script interface for driving a browser object. You can script:
2. Programming a conversation: in brief
Scripts are plain text files
written in the VBSCRIPT language. By default the browsing script is
called "onInterval.txt"
but you can change its name in the configuration form: Menu:Config
---> WWW,
or edit PixieRobot.ini eg
ScriptOnInterval=Cleopatra.txt
You need to name your master routine
Sub Main
The script statement Monitor=True
causes the conversation to be made visible.
Monitor=True should
always be the first line in a new Sub Main.
Change to Monitor=False
when you "go production".
You then script one step at a time, test-running
with Menu:WWW --> Immediate (Hotkey = F7).
At each step the monitor, which is built with objects from "Internet Explorer",
displays the result.
Click the "WriteToFile" button to save the current page for further
inspection.
For a faster way to work, you can right-click on the displayed page area, then
select "View Source" from the popup menu.
Most commonly-used library functions of
those we have added to VBSCRIPT:
ExecuteWWW, a
function used for navigation and form submission.
XForm, an object, consisting of the collection of elements in the first page
form. Mostly used for data entry into pages, but also for using
the "document object model" for scraping desired data out of a target page.
3. Scripting
Scripts are plain text files written in the VBSCRIPT
language. The language manual is published at
http://msdn.microsoft.com/scripting
but that is a rather dry theoretical document. We aim here to cover
common PixieRobot by example.
Features and points of our implementation
The PixieRobot methods and properties are supplied as
an implicit object, ie functions such as ExecuteWWW
and FileGetText become
part of the VBSCRIPT language for this environment.
If you wish to write in object syntax, that implicit object is called "IW", so
both of the following mean the same:
Call
IW.ExecuteWWW("http://www.pixieware.com")
Call ExecuteWWW("http://www.pixieware.com")
Some VARs may wish to script using the
System.FileSystemObject. An implicit one of these named FS is already provided,
and you can use all of its methods and properties directly. eg
Set drv = GetDrive(GetDriveName(drvPath))
NOTE that you still need to set and use
sub-objects eg drv above
General VBSCRIPT note:
The For ... Next loop requires that the closing
keyword "Next" be on its own with no variable with it. That can be tricky
to get used to, especially with the less than helpful error message it
gives:
"1025, Expected end of statement" eg:
'
'distribute multiple attachments
For i = 1 To AttachNumber
Call AttachMove(i, sDestination(i))
Next
'note this must NOT be "Next i"
IW (Internet
WWW Object): details of the scripting properties and methods
Grouped by purpose, all keywords and file/path strings are
case-insensitive.
3.1 ExecuteWWW
Call ExecuteWWW( URL, [PostData], [EndOfTx] )|
Part |
Description |
|
Url |
String: Address to navigate to, OR command "SUBMIT" |
|
PostData |
String: Optional data to submit in low-level
formatlike"name=Michael&age= 37". |
|
EndOfTx |
String: Optional text to recognise as end of transmission. This is a bandwidth-saving device so when this text arrives in the page download, PixieRobot knows that it has received all that you need and can stop further downloading. Use distinctive text from near the end of the page. |
3.2 XForm
XForm(FieldNameOrIndex [,FormIndex]
) [(ArrayIndex)].Value
= Value
Call XForm( FieldNameOrIndex [,FormIndex])
.Click
Manipulate a form field
directly through an abbreviation of the HTML Document Object Model eg:
XForm("Password").Value = "biscuit"
Note, this is equivalent to
WWW.Document.Forms(0).Elements("Password").Value =
"biscuit"< /FONT >
but it is much easier to type. You may need the longer WWW "low level"
version if the page uses unusual
on-form-submission methods eg
"remote scripting" or "field-by-field".
|
Part |
Description |
|
FieldNameOrIndex |
String or Integer: Field Name or Index to identify field to manipulate |
|
ArrayIndex |
Optional Integer, only needed when multiple fields have the same name |
|
Value |
Value for the field, usually string, but can be numeric, or for checkboxes, boolean =True/False |
More XForm Examples:
XForm("chkAutoTrans", 1).Checked = True
There are 2 forms on this page, and the checkbox
"chkAutoTrans" that
we want to tick is in the second form. The first, default form has a
FormIndex of 0, so a second form needs a FormIndex specified of 1.
xPrice2 = XForm("Price2").Value
Read the value of field "Price2" into variable 'xprice2'.
XForm("optType")(2).Checked =
True
There is a group of option radiobuttons. They all have the name
"optType".
You want to check the 3rd one, which requires an ArrayIndex of 2 because they
count from 0.
3.3.1 XCommon
A common area used to pass data between a calling VBScript program and a called VBScript.
Examples:
XCommon("DelCountry") = oDelCountry
XCommon("ProviderRef") = "32319970815"
3.3.2 XRun
A VBSCRIPT program may be broken up into separate scripts and PixieRobot provides a method for
one script to call another. This is the PixieRobot proprietary function "XRun".
Example:
sRet = XRun("TESTMOCK.vbs")
3.4 OutputToFile
Call OutputToFile( Data, Destination, True )
Write string data as a disk file. Very useful for
logging results of scripting when developing scripts.
|
Part |
Description |
| Data | String to write to disk as data |
|
Destination |
String: full path of new file including file name |
|
True |
Appends at the end of an existing file. Else will overwrite contents of file. Optional. |
3.5 FileGetText
stringData = FileGetText(File_with_full_path_name)
Read contents of a file into stringData. Useful for reading the contents
of an attachment to feed into other systems like databases. eg
stringData = FileGetText(PathAttachIn &
AttachFile(Index))
| Part | Description |
| MessageString | Message to log |
| Destination | String: full path of new file including file name |
3.6 LoggedMessage
LoggedMessage
String Property, returns the previous logmessage
sent to the logfile and monitor.
3.7 LogMessage
Call LogMessage(Message)3.8 Monitor
The script statement Monitor=True causes the conversation to be made visible.
Monitor=True should always be the first line in a new Sub Main, to allow for debugging.
Change to Monitor=False when you "go production".
Examples: Monitor = False  Monitor = True
3.9 Silent
The script statement Silent=True causes "pop-up boxes" to be ignored.
Example: Silent = True
3.10 The Document Object
URL Extraction Using The Document Object
The Document Object represents the HTML document in a given browser window.
Use the document object to examine, modify, or add content to a HTML document
and to process events within that document. The URL property sets or retrieves
the URL for the current document.
e.g. wFi = Mid(WWW.Document.URL, 1, 25)
Returns: http://www.abcsports.com and places 25 characters of the URL in variable - wFi
A specific use for the URL property with PixieRobot could be if a web farm
is encountered. Web farms are set-up to handle large visitor numbers by
having multiple web servers to process requirements. Which web server you get
is randomly selected when a session is first established. If you try
entering a constant URL (e.g. www4.abcsports.com) it will be ignored and
its own URL address is returned. From then on, the session variable for your
allocated web farm, needs to match up.
So for example the following PixieRobot script will extract the server variable
for subsequent use.
Sub Main
' ABC Sports Web Farm Test
' PixieRobot command to run script manually
Monitor = True
' PixieRobot command to ignore pop-up windows while running
Silent = True
' PixieRobot command to navigate to a web page
s = ExecuteWWW("http://www.abcsports.com")
' PixieRobot command to obtain the URL and extract the web farm address in - wFi
' The URL returned is: http://www4.abcsports.com and the 11th character from left is web farm identifier
wFi = Mid(WWW.Document.URL, 11, 1)
' Navigate to new web page after combining all elements of URL
s = ExecuteWWW("http://www" & wFi & ".abcsports.com/"
End Sub
Other ways of using PixieRobot to navigate web pages include:
Setting a Form Element value: WWW.Document.Forms(0).Elements("zipcode").value = "10010"
Ckecking if a page is loaded: If WWW.document.ReadyState = "complete" Then
Setting a Form Elment Index value: WWW.Document.Forms(1).Elements("cspecialty").selectedIndex = 0
Retrieving a Form Element Value: a2=WWW.Document.Forms(1).Elements("cspecialty").Value
Clicking an Element on a Form: Call www.document.forms(1).elements(9).click
3.11 Extracting Pictures
These functions are intended to be used to get a list of URL's
off the current page that the script has been positioned at. Then used to
extract the required picture by its index reference or by its name, and then
store the picture in the folder specified. The following code is an example:
If Instr(1, t, "nophoto", 1) <> 0 Then
call logMessage ("No Picture details page")
oPicid = "None"
Else
iMglist = Split(GetImageList(), Chr(254))
For i = 0 To Ubound(iMglist)
' The next line searches for an image with "auto" in the name
If Instr(1, iMglist(i), "auto", 1) > 0 Then
oPicid = iMglist(i)
on error resume next
' The next line downloads the image to the specified folder
Call GetImageByNumber(i, "C:\Prog Files\PR\djphotos")
on error goto 0
Exit For
End If
Next
End If
3.11.1 GetImageList
Public Function GetImageList() As String()
GetImageList - This function returns an array containing the URLs of every
image on the document. This array is zero-based. You can use the indices
of this array in the GetImageByNumber function call, or you can put the
URL through the GetFilenameFromAddress function and pass the returned
filename to the GetImageByName function.
3.11.2 GetImageByNumber
Public Function GetImageByNumber(ByVal index As Integer, ByVal directory As String) As String
GetImageByNumber - This function downloads an image based on it's index in
the web page.
* index: The index of the image
* directory: The directory (NOT filename) you wish to download to. The image will
retain its' own filename.
* return value: The function returns the path to the downloaded image. If the function
fails, the return value is a zero-length string. The function will not return until
the image has been downloaded.
3.11.3 GetImageByName
Public Function GetImageByName(ByVal imgname As String, ByVal directory As String) As String
GetImageByName - This function downloads an image by it's filename, name, or id.
* imgname: The filename, name or id of an image on the web page. Not all images
have names or ids, it depends on the exact HTML code used.
* directory: see GetImageByNumber
* return value: see GetImageByNumber
3.11.4 GetFilenameFromAddress
Private Function GetFilenameFromAddress(url As String) As String
GetFilenameFromAddress - This function takes a URL and extracts the filename from it.
the filename is defined as the segment of the URL past the last slash character
(either '/' or '\').