DataSource API

The DataSource API is used to define a web crawler. A web crawler is written as a JSON document which contains the following sets of attributes

Attributes

origin_url Required origin_url_object An Object that list the URL(s) to get data from
method Optional method_object A String indicates the HTTP REQUEST method to use for making the page load call
post_data Optional post_data_object A Object that indicates the form values to send along in a HTTP POST REQUEST
columns Required columns An Array of column_objects
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
cookies Optional cookies An Array that list the Cookies to use for the site
disable_cookies Optional Boolean A Boolean value when set to true disables the sending and receiving of cookies from our crawlers
headers Optional headers_object An Object that is sent as the REQUEST HEADER

Example

{
  "origin_url"  : "http://some_url.com",
  "method"      : "post",
  "post_data"   : {
    "username"  : "My User Name",
    "password"  : "My Password"
  },
  "columns"     : [{
    "col_name"  : "transaction date",
    "dom_query" : "td.transaction_date"
  },{
    "col_name"  : "transaction name",
    "dom_query" : "td.transaction_name"  
  },{
    "col_name"  : "transaction amount",
    "dom_query" : "td.transaction_amount"      
  }],
  "next_page"   : {
    "dom_query" : "a.next_page"
  },
  "cookies"     : [{
    "domain": "localhost",
    "name": "__profilin",
    "value": "p%3Dt"
  }],
  "headers"     : {
    Referer: "http://localhost:3000/docs/define-data",
    User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
  }
}


origin_url_object

The origin_url_object is a required attribute that indicates the URL(s) to get data from. It has 3 variations

Variant 1

The URL Strings to get data from.

Example

"http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones"

Variant 2

An array of URL Strings to get data from.

Example

[
  "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=phones",
  "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=android",
  "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=blackberry"
]

Variant 3

An Object with the following attributes

Attributes

origin_pattern Required String A String that describes the general pattern of the list of URLs you want to
get data from with @@origin_value@@ parameter indicating the replacable value of the URL
origin_value Required Array An Array of Strings that indicate the value to replace into the @@origin_value@@ parameter

Example

{  
  "origin_pattern" : "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=@@origin_value@@",
  "origin_value" : [
    "Iphones",
    "blackberry",        
    "Android",
    "Nokia"
  ]
} 

Variant 4

The URL String pattern to get data from. It is used only in the options_object

Example

"http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=@@some_col_name@@",


method_object

The method_object is a String the indicates the HTTP method to use to make the page load call. Defaults to get if this not indicated.

Supported Methods

  • post
  • get

post_data_object

The post_data_object is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.

Example

{
  "key_1": "value_1",
  "key_2": "value_2",
  "key_3": "value_3",
  ...
}

to_click_object

An Array of jQuery Strings that describes the elements in the Page to Click before data getting.

It can be used for

  • Showing a section of the page
  • Expanding a section of the page

It cannot be used for

  • Loading Ajax content in a page

Example

[ 
  'a.load-javascript-controlled-content', 
  'span.expand-section',
  'div.load-all-details'
] 

columns

An Array of column_objects to get all the data attributes you want from a web page

Example

[
  column_object_1, 
  column_object_2, 
  column_object_3
]


column_object

An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.

Attributes

col_name Required String An arbitrary string that describes the data attribute represented by this column_object
dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
dom_container Optional String A valid jQuery CSS Selector that describes the set of parent DOM Elements to extract the actual list of DOM Elements from.
No actual DOM Element was found within an existing parent DOM Element, an empty String will be returned.
dom_query must be present.
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
required_attribute Optional required_attribute_type The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements.
column_type Optional column_type Parses the extracted text for the corresponding matching text pattern.
options Optional options_object The nested page to get data from. When used, either
regex_pattern Optional String Regular Expression Pattern
regex_flag Optional String Regular Expression Flags
regex_group Optional Integer If a regular expression group is used, indicates the Nth matching value to return.
If * is used instead of an Integer, combines all matching values into a String together in a comma separated format.

Variant 1

Gets a list of innerHTMl from matching DOM Elements with jQuery selectors.

Example

{ 
  "col_name" :  'Column name',
  "dom_query" : 'div.product-name'  
}

Variant 2

Gets a list of innerHTMl from matching DOM Elements with XPath.

Example

{
  "col_name" : "column name",
  "xpath" : '//xpath/to/elements'
} 

Variant 3

Gets a list of DOM Attribute from matching DOM Elements.

Example

{
  "col_name" : "Price of Products",
  "xpath" : "//xpath/to/elements",
  "required_attribute" : required_attribute_type  
} 

Variant 4

Gets the URLs to the nested pages and then continues to get data from these pages.

Example

{
  "col_name" : 'Product detailed page',
  "dom_query" : 'a.detailed-url',
  "required_attribute" : 'href',
  "options" : options_object
} 

Variant 5

Gets the innerHTML from matching DOM Elements and only gets a subset of the String that matches a Regex Pattern

Example

{
  "col_name" : 'Product Price',
  "dom_query" : 'div.product-description',
  "required_attribute" : "innerHTML",
  "regex_pattern" : /[0-9.]+/gi,
  "regex_group" : 1
}


required_attribute

The required_attribute is a String that indicates the attribute to get from a DOM Element.

These are the general String values available in normal DOM Elements on top of other non-common ones.

General Types

  • innerHTML
  • innerText
  • textContent
  • href
  • src

column_type

The column_type is a String that indicates the matching text pattern type to get from the extracted text.

The support types are listed below

phone Finds and extract valid phone numbers from text
email Finds and extract valid email address from text
numbers Finds and extract numbers from text be they normalized or internationalized based on settings of accepted-language in the headers

options_object

The options_object is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes

Attributes

columns Required columns An Array of column_objects
origin_url Conditional Required origin_url_object An Object that list the URL(s) to get data from: Variant 4
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
cookies Optional cookies_object An Array that list the Cookies to use for the site
headers Optional headers_object An Object that is sent as the REQUEST HEADER
wait Optional Integer See wait for more details
to_click Optional to_click_object An Array that list the DOM Elements in the page to click before getting data

Variant 1

Crawls from a parent page to a nested page to get more data via the href that is indicated as the required_attribute of the parent column_object

Example

{
  "columns" : columns,
  "next_page" : next_page_object
} 

Variant 2

Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL using the value from parent column_object. href need not be indicated as the required_attribute of the parent column_object.

Example

{
  "origin_url" : origin_url_object,
  "columns" : columns,
  "next_page" : next_page_object,  
} 


next_page_object

The next_page_object is an Object that describes the hyperlink to the next page in an listing page. It has 2 variants. The Object has the following attributes.

Attributes

dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
click Optional Boolean A Boolean when true clicks on a DOM Element to get the subsequent URL a web page is fetching data from instead of extracting the href attribute of the DOM Element

Variant 1

This variant uses jQuery selector to detect the link to the next page

Example

{  
  "dom_query" : '.listing_next_page' 
} 

Variant 2

This variant uses the xPath selector to detect the link to the next page

Example

{
  "xpath" : '//xPath/to/next/page' 
}  

Variant 3

This variant gets the link to the next page by clicking on the DOM Element and detecting the URL it loads in the background.

Example

{  
  "xpath" : '//xPath/to/next/page',
  "click" : true
}  


cookies

The cookies is an Array of cookie_objects sent along with the page request in the Request Header to emulate an Authenticated User Session.

Example

[
  cookie_object_1, 
  cookie_object_2, 
  cookie_object_3
]


disable_cookies

The disable_cookies is a Boolean when set to true disables the sending and getting of cookies from our engine to the receipient server.


headers_object

The headers_object is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on WikiPedia

Example

{
  "Accept-Language": "es",
  "X-Test": "foo",
  "DNT": "1"
}

variables

Variables are col_names in the origin_url_object that are written like @@SOME_PATTERN@@ where patterns correspond to the col_name values in an ancestor column_object

Example

{
  "origin_url" : "http://www.imdb.com/movies-in-theaters/",
  "columns": [
    {
      "col_name": "NEW_MOVIES_IN_THEATER",
      "dom_query": "h4[itemprop='name']",
      "options": {
        "origin_url" : "http://www.rottentomatoes.com/search/?sitesearch=rt&search=@@NEW_MOVIES_IN_THEATER@@",
        "columns": [
          {
            "col_name": "movie rating",
            "dom_query": ".tMeterScore"
          }
    }
  ]
}