The Semantic Query Language

The Semantic Query Language (SQL) allows you to declare the data you need gathered from the web. Your declarations are called recipes (documents in JSON format). These recipes are interpreted by our decentralized ODBC engine. The engine fetches the desired data on your behalf

The SQL supports the following clauses.


Clauses

engine Optional engine_object Indicates which type of web crawler to use
origin_url Required origin_url_object An Object that list the URL(s) to get data from
headers Optional headers_object An Object that is sent as the REQUEST HEADER
method Optional method_object A String that indicates the HTTP REQUEST method to use for making the page load call
cookies Optional cookies An Array that list the Cookies to use for the site
disable_cookies Optional Boolean A Boolean value when set to true disables the sending and receiving of cookies from our engines
post_data Optional post_data_object A Object that indicates the form values to send along in a HTTP POST REQUEST
actions Optional actions An Array of action_objects that defines the list of actions to perform on the page
columns Required columns An Array of column_objects that defines the data to get from page
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
source_format Optional source_format_object A String that indicates what format the web page is served in
required_json_attr Optional required_json_attr_object A String that indicates the JSONattribute which the targeted HTML data is embedded in
groupings Optional groupings_object An Object that indicates groups attributes in your data set as sub-attributes for your JSON exports

Sample Recipe

{
  "engine"      : "chrome",
  "origin_url"  : "http://some_url.com",
  "method"      : "post",
  "post_data"   : {
    "username"  : "My User Name",
    "password"  : "My Password"
  },
  "actions":    : [{
    "action_name": "scroll_bottom",
    "times": 10
  },{
    "action_name": "click",
    "dom_query": "button.show_more"
  },{
    "action_name": "click",
    "xpath": "//button[[contains(@class, 'show_more')]]"
  },{
    "action_name": "wait",
    "milliseconds": [1000, 5000]
  }],
  "columns"     : [{
    "col_name"  : "merchant name",
    "dom_query" : "td.merchant_name",
    "is_index"  : true
  },{
    "col_name"  : "transaction date",
    "dom_query" : "td.transaction_date"
  },{
    "col_name"  : "transaction name",
    "dom_query" : "td.transaction_name"  
  },{
    "col_name"  : "transaction amount",
    "dom_query" : "td.transaction_amount"      
  }],
  "next_page"   : {
    "dom_query" : "a.next_page"
  },
  "cookies"     : [{
    "domain": "localhost",
    "name": "__profilin",
    "value": "p%3Dt"
  }],
  "headers"     : {
    Referer: "http://localhost:3000/docs/define-data",
    User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko)"
  },
  "groupings": [{
    "group_name"    : "transactions",
    "group_members" : [
      "transaction date", 
      "transaction name",
      "transaction amount"
    ]
  }]
}


engine_object

The engine_object is an optional clause that indicates the type of engine to get data with. There are currently 2 available engines to choose from

  • nokogiri - the Nokogiri engine is useful for extracting data from static html pages
  • chrome - the Chrome engine is useful for extracting data from pages that are Ajax heavy. it is experimental and does not support POST methods


origin_url_object

The origin_url_object is a required clause that indicates the URL(s) to get data from. It has 3 variations

Variant 1

The URL Strings to get data from.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones"
  ...
}

Variant 2

An array of URL Strings to get data from.

Example

{
  "origin_url": [
    "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=phones",
    "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=android",
    "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords=blackberry"
  ],
  ...
}

Variant 3

An Object with the following attributes

Attributes

origin_pattern Required String A String that describes the general pattern of the list of URLs you want to
get data from with {{origin_value}} parameter indicating the replacable value of the URL
origin_value Required Array An Array of Strings that indicate the value to replace into the {{origin_value}} parameter

Example

{  
  "origin_pattern" : "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{origin_value}}",
  "origin_value" : [
    "Iphones",
    "blackberry",        
    "Android",
    "Nokia"
  ],
  ...
} 

Variant 4

The URL String pattern to get data from. It is used only in the options_object

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{some_col_name}}",
  ...
}


headers_object

The headers_object is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on WikiPedia

Example

{
  ...
  "headers": {
    "Accept-Language": "es",
    "X-Test": "foo",
    "DNT": "1"
  },
  ...
}

method_object

The method_object is a String the indicates the HTTP method to use to make the page load call. Defaults to get if this not indicated.

Supported Methods

  • post
  • get

cookies

The cookies is an Array of cookie_objects sent along with the page request in the Request Header to emulate an Authenticated User Session.

Example

{
  ...
  "cookies": [
    cookie_object_1, 
    cookie_object_2, 
    cookie_object_3
  ],
  ...
}


disable_cookies

The disable_cookies is a Boolean when set to true disables the sending and getting of cookies from our engine to the receipient server.


post_data_object

The post_data_object is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.

Example

{
  ...
  "post_data" : {
    "key_1": "value_1",
    "key_2": "value_2",
    "key_3": "value_3",
    ...
  },
  ...
}

actions

An Array of action_objects to sequences the list of actions you want perform on a web page before getting data from it

Only supported by the chrome engine

Example

{
  ...  
  "actions": [
    action_object_1, 
    action_object_2, 
    action_object_3
  ]
  ...
}


action_object

An Object that indicates the sequence of action to perform before you get data from the Page. It has 5 Variants. The Object contains the following attributes.

Attributes

action_name Required String The action to perform on this step. Support variables options include
  • click
  • wait
  • scroll_bottom
xpath Conditional Required String A valid XPath Selector that describes the DOM Element you want to click on if action_name is click
dom_query Conditional Required String A valid CSS Selector that describes the DOM Element you want to click on if action_name is click
milliseconds Optional Integer The time to wait if action_name is wait or scroll_bottom. Defaults to 1000 milliseconds if not indicated.
times Optional Integer The number of times to scroll to the bottom if action_name is scroll_bottom. Defaults to 1 if not indicated.

Variant 1

Scrolls to the bottom of the page to cause rendering of more elements by the page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "scroll_bottom",
      "times": 10 # Number of times to scroll to the bottom of the page. Defaults to 1
    }
    ...
  ]
  ...
}

Variant 2

Click on matching DOM Elements with XPath.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "dom_query": "CSS selector of element to click"
    }
    ...
  ]
  ...
}

Variant 3

Click on matching DOM Elements with CSS Selector.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "xpath": "XPath selector of element to click"
    }
    ...
  ]
  ...
}

Variant 4

Waits for specific period of time before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": 1000 // Seconds to wait before next action
    }
    ...
  ]
  ...
}

Variant 5

Waits for random period of time within specific range before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": [Integer, Integer]
    }
    ...
  ]
  ...
}


columns

An Array of column_objects to get all the data attributes you want from a web page

Example

{
  ...  
  "columns": [
    column_object_1, 
    column_object_2, 
    column_object_3
  ]
  ...
}


column_object

An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.

Attributes

col_name Required String An arbitrary string that describes the data attribute represented by this column_object
dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
is_index Optional Boolean Indicates it values in this column should be used as the primary key for this data set. It affects how our Engine interprets the groupings_object you provide in your recipe
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
required_attribute Optional required_attribute_type The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements.
column_type Optional column_type Parses the extracted text for the corresponding matching text pattern.
options Optional options_object The nested page to get data from. When used, either
regex_pattern Optional String Regular Expression Pattern
regex_flag Optional String Regular Expression Flags
regex_group Optional Integer If a regular expression group is used, indicates the Nth matching value to return.
If * is used instead of an Integer, combines all matching values into a String together in a comma separated format.
prefix Optional String Prepends the string provided to the beginning of the column that was collected from the page.
postfix Optional String Appends the string provided to the end of the column that was collected from the page.
listagg Optional Boolean

If set to true, collapses all the values of this column on the page into a single line.

false by default.

listagg_delimiter Optional String

If provided, uses the string provide as the delimiter between the values of this column in a listagg operation

"" by default.

Variant 1

Gets a list of innerHTML from matching DOM Elements with jQuery selectors.

Example

{
  ...
  "columns": [{ 
    "col_name" :  'Column name',
    "dom_query" : 'div.product-name'  
  }]
  ...
}

Variant 2

Gets a list of innerHTML from matching DOM Elements with XPath.

Example

{
  ...
  "columns": [{
    "col_name" : "column name",
    "xpath" : '//xpath/to/elements'
  }]
  ...
}

Variant 3

Gets a list of DOM Attribute from matching DOM Elements.

Example

{
  ...
  "columns": [{
    "col_name" : "Price of Products",
    "xpath" : "//xpath/to/elements",
    "required_attribute" : required_attribute_type  
  }]
  ...
}

Variant 4

Gets the URLs to the nested pages and then continues to get data from these pages.

Example

{
  ...
  "columns": [{
    "col_name" : 'Product detailed page',
    "dom_query" : 'a.detailed-url',
    "required_attribute" : 'href',
    "options" : options_object
  }]
  ...
}

Variant 5

Gets the innerHTML from matching DOM Elements and only gets a subset of the String that matches a Regex Pattern

Example

{
  ...
  "columns": [{
    "col_name" : 'Product Price',
    "dom_query" : 'div.product-description',
    "required_attribute" : "innerHTML",
    "regex_pattern" : /[0-9.]+/gi,
    "regex_group" : 1
  }]
  ...
}


required_attribute

The required_attribute is a String that indicates the attribute to get from a DOM Element.

These are the general String values available in normal DOM Elements on top of other non-common ones.

General Types

  • innerHTML
  • innerText
  • textContent
  • href
  • src

column_type

The column_type is a String that indicates the matching text pattern type to get from the extracted text.

The support types are listed below

phone Finds and extract valid phone numbers from text
email Finds and extract valid email address from text
numbers Finds and extract numbers from text be they normalized or internationalized based on settings of accepted-language in the headers

next_page_object

The next_page_object is an Object that describes the hyperlink to the next page in an listing page. It has 2 variants. The Object has the following attributes.

Attributes

dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
click Optional Boolean A Boolean when true clicks on a DOM Element to get the subsequent URL a web page is fetching data from instead of extracting the href attribute of the DOM Element

Variant 1

This variant uses jQuery selector to detect the link to the next page

Example

{
  ...
  "next_page": {  
    "dom_query" : '.listing_next_page' 
  },
  ...
}

Variant 2

This variant uses the xPath selector to detect the link to the next page

Example

{
  ...
  "next_page": {
    "xpath" : '//xPath/to/next/page' 
  },  
}

Variant 3

This variant gets the link to the next page by clicking on the DOM Element and detecting the URL it loads in the background.

Example

{
  ...
  "next_page": {  
    "xpath" : '//xPath/to/next/page',
    "click" : true
  },
  ...
}


source_format

The source_format is a String that indicates the format a web page is served in

Supported formats

  • html - the data on the page is in HTML format
  • json - the data on the page is in JSON format
  • json_attr - the data on the page has HTML format embedded in JSON format

required_json_attr

The required_json_attr is a String which indicates the JSON attribute to extract HTML data from. It only works if the source_format_object is also provided in the recipe


groupings_object

The groupings_object is an Array . It allows you to cluster attributes as sub-attributes in your JSON export. For it to work, you will need to indicate is_index as true for one column_object.

Sample Input

{
  ...
  "columns": [{
    "col_name": "Author",
    "dom_query": ".author",
    "is_index": true
  },{
    "col_name": "Book name",
    "dom_query": ".book_name"
  },{
    "col_name": "Published date",
    "dom_query": ".published_at"
  }],{
    "col_name": "Rater Name",
    "dom_query": ".rater"
  },{
    "col_name": "Rating",
    "dom_query": ".rating"
  }],
  ...
  "groupings": [{
    "group_name" : "publised_books",
    "group_members": ["Book name", "Published date"]
  },{
    "group_name" : "ratings",
    "group_members": ["Rater Name", "Rating"]
  }]
  ...
}

variables

Variables are col_names in the origin_url_object that are written like {{SOME_PATTERN}} where patterns correspond to the col_name values in an ancestor column_object

Example

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://www.imdb.com/title/tt0119177/fullcredits",
    "columns": [
        {
            "col_name": "cast_name",
            "dom_query": "table.cast_list tr td:nth-child(2).itemprop a span.itemprop",
            "options": {
                "origin_url": "https://www.google.com/search?q={{cast_name}}+date+of+birth",
                "columns": [
                    {
                        "col_name": "date_of_birth",
                        "dom_query": "div[role='heading'][aria-level='3'].HwtpBd.kno-fb-ctx div:nth-child(1)"
                    }
                ]
            }
        }
    ]
}

options_object

The options_object is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes

Attributes

columns Required columns An Array of column_objects
origin_url Conditional Required origin_url_object An Object that list the URL(s) to get data from: Variant 4
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
cookies Optional cookies_object An Array that list the Cookies to use for the site
headers Optional headers_object An Object that is sent as the REQUEST HEADER
wait Optional Integer See wait_object for more details
to_click Optional to_click_object An Array that list the DOM Elements in the page to click before getting data

Variant 1

Crawls from a parent page to a nested page to get more data via the href that is indicated as the required_attribute of the parent column_object

Example

{
  ...
  "columns" : [{
    "col_name" : "product_page",
    "dom_query" : ".product a",
    "required_attribute" : "href",
    "options" : {
      "columns" : columns,
      "next_page" : next_page_object
    } 
  }],
  ...
}

Variant 2

Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL using the value from parent column_object. href need not be indicated as the required_attribute of the parent column_object.

Example

{
  ...
  "columns" : [{
    "col_name" : "product_name",
    "dom_query" : ".product",
    "options" : {
      "origin_url" : "https://www.google.com/search?q={{product_name}}+price",
      "columns" : columns,
      "next_page" : next_page_object,  
    }
  }],
  ...
} 


Next Steps

To find out more about how GetData.IO works