The Semantic Query Language

The Semantic Query Language (SQL) allows you to declare the data you need gathered from the web. Your declarations are called recipes (documents in JSON format). These recipes are interpreted by our decentralized ODBC engine. The engine fetches the desired data on your behalf

The SQL supports the following clauses.


Clauses

engine Optional engine_object Indicates which type of web crawler to use
origin_url Required origin_url_object An Object that list the URL(s) to get data from
headers Optional headers_object An Object that is sent as the REQUEST HEADER
method Optional method_object A String that indicates the HTTP REQUEST method to use for making the page load call
cookies Optional cookies An Array that list the Cookies to use for the site
disable_cookies Optional Boolean A Boolean value when set to true disables the sending and receiving of cookies from our engines
post_data Optional post_data_object A Object that indicates the form values to send along in a HTTP POST REQUEST
actions Optional actions An Array of action_objects that defines the list of actions to perform on the page
columns Required columns An Array of column_objects that defines the data to get from page
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
source_format Optional source_format_object A String that indicates what format the web page is served in
required_json_attr Optional required_json_attr_object A String that indicates the JSONattribute which the targeted HTML data is embedded in
groupings Optional groupings_object An Object that indicates groups attributes in your data set as sub-attributes for your JSON exports
order_by Optional order_by_object An Array that indicates the order in which records in your data should be sorted by

Sample Recipe

{
  "engine"      : "chrome",
  "origin_url"  : "http://some_url.com",
  "method"      : "post",
  "post_data"   : {
    "username"  : "My User Name",
    "password"  : "My Password"
  },
  "actions":    : [{
      "action_name": "scroll_bottom",
      "times": 10
    },{
      "action_name": "click",
      "dom_query": "button.show_more"
    },{
      "action_name": "click",
      "xpath": "//button[[contains(@class, 'show_more')]]"
    },{
      "action_name": "wait",
      "milliseconds": [1000, 5000]
  }],
  "columns"     : [{
      "col_name"  : "merchant name",
      "dom_query" : "td.merchant_name",
      "is_index"  : true
    },{
      "col_name"  : "transaction date",
      "dom_query" : "td.transaction_date"
    },{
      "col_name"  : "transaction name",
      "dom_query" : "td.transaction_name"  
    },{
      "col_name"  : "transaction amount",
      "dom_query" : "td.transaction_amount"      
  }],
  "next_page"   : {
    "dom_query" : "a.next_page"
  },
  "cookies"     : [{
    "domain": "localhost",
    "name": "__profilin",
    "value": "p%3Dt"
  }],
  "headers"     : {
    Referer: "http://localhost:3000/docs/define-data",
    User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko)"
  },
  "groupings": [{
    "group_name"    : "transactions",
    "group_members" : [
      "transaction date", 
      "transaction name",
      "transaction amount"
    ]
  }],
  "order_by": [{
      "col_name" : "merchant name",
      "order": "asc"
    },{
      "col_name" : "transaction date",
      "order": "desc"
  }]  
}


engine_object

The engine_object is an optional clause that indicates the type of engine to get data with. There are currently 2 available engines to choose from

  • nokogiri - the Nokogiri engine is useful for extracting data from static html pages
  • chrome - the Chrome engine is useful for extracting data from pages that are Ajax heavy. it is experimental and does not support POST methods


origin_url_object

The origin_url_object, a required clause, is a String that indicates the URL(s) to get data from. It supports Dynamic variables.

Variant 1

The String indicating the URL to get data from.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones"
  ...
}

Variant 2

When used in a options_object, the String is constructed at run time with the use of Dynamic variables that are already collected.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{some_col_name}}",
  ...
}


headers_object

The headers_object is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on WikiPedia

Example

{
  ...
  "headers": {
    "Accept-Language": "es",
    "X-Test": "foo",
    "DNT": "1"
  },
  ...
}

method_object

The method_object is a String the indicates the HTTP method to use to make the page load call. Defaults to get if this not indicated.

Supported Methods

  • post
  • get

cookies

The cookies is an Array of cookie_objects sent along with the page request in the Request Header to emulate an Authenticated User Session.

Example

{
  ...
  "cookies": [
    cookie_object_1, 
    cookie_object_2, 
    cookie_object_3
  ],
  ...
}


disable_cookies

The disable_cookies is a Boolean when set to true disables the sending and getting of cookies from our engine to the receipient server.


post_data_object

The post_data_object is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.

Example

{
  ...
  "post_data" : {
    "key_1": "value_1",
    "key_2": "value_2",
    "key_3": "value_3",
    ...
  },
  ...
}

actions

An Array of action_objects to sequences the list of actions you want perform on a web page before getting data from it

Only supported by the chrome engine

Example

{
  ...  
  "actions": [
    action_object_1, 
    action_object_2, 
    action_object_3
  ]
  ...
}


action_object

An Object that indicates the sequence of action to perform before you get data from the Page. It has 5 Variants. The Object contains the following attributes.

Attributes

action_name Required String The action to perform on this step. Support variables options include
  • click
  • insert
  • scroll_bottom
  • wait
  • trigger_change
  • goto
  • switch_window
xpath Conditional Required String A valid XPath Selector that describes the DOM Element you want to click on if action_name is click
dom_query Conditional Required String A valid CSS Selector that describes the DOM Element you want to click on if action_name is click. It supports Dynamic variables.
milliseconds Optional Integer The time to wait if action_name is wait or scroll_bottom. Defaults to 1000 milliseconds if not indicated.
times Optional Integer The number of times to scroll to the bottom if action_name is scroll_bottom. Defaults to 1 if not indicated.
origin_url Optional String This is used in conjuection with the goto action. It specifies the next URL to go within the same window after a prior action is completed.
title Optional String This is used in conjuection with the switch_window action. It specifies the title of the new window to focus on after a prior action is completed.

Variant 1

Scrolls to the bottom of the page to cause rendering of more elements by the page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "scroll_bottom",
      "times": 10, // Number of times to scroll to the bottom of the page. Defaults to 1
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 2

Click on matching DOM Elements with XPath.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "dom_query": "CSS selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // time to wait before next action
    }
    ...
  ]
  ...
}

Variant 3

Click on matching DOM Elements with CSS Selector.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "xpath": "XPath selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 4

Waits for specific period of time before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 5

Waits for random period of time within specific range before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": [Integer, Integer] // Random time to wait before next action within specified range
    }
    ...
  ]
  ...
}

Variant 6

Inserting value into input field using CSS selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "dom_query": "#userid",
      "value": "MY_USERID",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 7

Inserting value into input field using XPath selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "xpath": "//input[contains(@id,'password')]",
      "value": "MY_PASSWORD",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 8

Triggering javascript change event on selected dom elements. This is useful for getting fresh data to load after a new option is selected in a dropdown

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "trigger_change",
      "dom_query": "CSS selector of element to receive event",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 9

Directing the crawler to go another URL after completing prior actions on the current page. This is useful for sites where modify your account settings page before fetching data from another part of the site.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "goto",
      "origin_url": "The next URL to go to"
    }
    ...
  ]
  ...
}

Variant 10

Directing the crawler to focus on another window in the browser. This is useful for sites where actions trigger a popup window and the data you want is within the popup window.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "switch_window",
      "title": "Matrix"
    }
    ...
  ]
  ...
}


columns

An Array of column_objects to get all the data attributes you want from a web page. It supports Dynamic variables.

Example

{
  ...  
  "columns": [
    column_object_1, 
    column_object_2, 
    column_object_3
  ]
  ...
}


column_object

An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.

Attributes

col_name Required String An arbitrary string that describes the data attribute represented by this column_object
dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query, xpath or row_number must be present
outer_dom_query Optional String A valid jQuery CSS Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page.
Is active on when dom_query is present
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either dom_query, xpath or row_number must be present
outer_xpath Optional String A valid XPath Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page. Is active on when xpath is present
is_index Optional Boolean Indicates it values in this column should be used as the primary key for this data set. It affects how our Engine interprets the groupings_object you provide in your recipe
required_attribute Optional required_attribute_type The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements.
column_type Optional column_type Parses the extracted text for the corresponding matching text pattern.
options Optional options_object The nested page to get data from. When used, either
regex_pattern Optional String Regular Expression Pattern that is global and case insensitive.
regex_group Optional Integer If a regular expression group is used, indicates the Nth matching value to return.
If * is used instead of an Integer, combines all matching values into a String together in a comma separated format.
prefix Optional String Prepends the string provided to the beginning of the column that was collected from the page.
postfix Optional String Appends the string provided to the end of the column that was collected from the page.
listagg Optional Boolean

If set to true, collapses all the values of this column on the page into a single line.

false by default.

listagg_delimiter Optional String

If provided, uses the string provide as the delimiter between the values of this column in a listagg operation

"" by default.

row_number Conditional Required Boolean When flagged as true, a sequential integer is assigned to this column for each row of result from a page. Either dom_query, xpath or row_number must be present
order_by Optional order_by_object Used in conjunction with the row_number, determines the order in which sequential integers should be assigned. When omitted, the natural order in which records are gathered is assumed, typically from the top to the bottom of the page.
partition_by Optional Array ofcol_names Used in conjunction with the row_number, divides rows of results from a page into groups. When omitted, all rows of results gathered from the page is in a single partition.

Variant 1

Gets a list of innerHTML from matching DOM Elements with jQuery selectors.

Example

{
  ...
  "columns": [{ 
    "col_name" :  'Column name',
    "dom_query" : 'div.product-name'  
  }]
  ...
}

Variant 2

Gets a list of innerHTML from matching DOM Elements with XPath.

Example

{
  ...
  "columns": [{
    "col_name" : "column name",
    "xpath" : '//xpath/to/elements'
  }]
  ...
}

Variant 3

Gets a list of DOM Attribute from matching DOM Elements.

Example

{
  ...
  "columns": [{
    "col_name" : "Price of Products",
    "xpath" : "//xpath/to/elements",
    "required_attribute" : required_attribute_type  
  }]
  ...
}

Variant 4

Gets the URLs to the nested pages and then continues to get data from these pages.

Example

{
  ...
  "columns": [{
    "col_name" : 'Product detailed page',
    "dom_query" : 'a.detailed-url',
    "required_attribute" : 'href',
    "options" : options_object
  }]
  ...
}

Variant 5

Gets the innerHTML from matching DOM Elements and only gets a subset of the String that matches a Regex Pattern

Example

{
  ...
  "columns": [{
    "col_name" : 'Product Price',
    "dom_query" : 'div.product-description',
    "required_attribute" : "innerHTML",
    "regex_pattern" : "[0-9.]+",
    "regex_group" : 1
  }]
  ...
}

Variant 6

Applies row_number to a column for rows of results gathered from a page.

Example

{
  ...
  "columns": [{
      "col_name" : 'product_ranking',
      "row_number" : true,
      "partition_by": ['product_category', 'product_sub_category'],
      "order_by": [{
        "col_name": "product_price",
        "order": "desc"
      },{
        "col_name": "product_rating",
        "order": "desc"
      }]
    },
    ...
    {
      "col_name" : 'product_category',
      "dom_query" : '.category'
    },{
      "col_name" : 'product_sub_category',
      "dom_query" : '.sub-category'  
    },{
      "col_name" : 'product_name',
      "dom_query" : '.prod-name'  
    },{
      "col_name" : 'product_price',
      "dom_query" : '.prod-price'  
    },{
      "col_name" : 'product_rating',
      "dom_query" : '.prod-rating'  
  }]
  ...
}


required_attribute

The required_attribute is a String that indicates the attribute to get from a DOM Element.

These are the general String values available in normal DOM Elements on top of other non-common ones.

General Types

  • inner_html
  • outer_html
  • text
  • href
  • src

column_type

The column_type is a String that indicates the matching text pattern type to get from the extracted text.

The support types are listed below

phone Finds and extract valid phone numbers from text
email Finds and extract valid email address from text
numbers Finds and extract numbers from text be they normalized or internationalized based on settings of accepted-language in the headers

next_page_object

The next_page_object is an Object that describes the hyperlink to the next page in an listing page. It has 2 variants. The Object has the following attributes.

Attributes

dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
persist Optional Boolean Only supported by the chrome engine. A Boolean when true loads the next page within the same Window by performing an actual click on the described DOM Element. This is useful for handling paginations on sites where the next page is loaded via Ajax.
refresh Optional Boolean Only supported by the chrome engine. This is used in conjunction with persist. The page is refreshed after the next page is loaded. This is useful for clearing out javascript memory leaks in pages. It cannot be used for Ajax enabled paginations that reset upon refresh. It defaults to false
milliseconds Optional Integer Only supported by the chrome engine. This is used in conjunction with persist to ensure the contents from the next page is loaded before we execute the columns clause. It defaults to a one second wait time if not declared.

Variant 1

This variant uses jQuery selector to detect the link to the next page

Example

{
  ...
  "next_page": {  
    "dom_query" : '.listing_next_page' 
  },
  ...
}

Variant 2

This variant uses the xPath selector to detect the link to the next page

Example

{
  ...
  "next_page": {
    "xpath" : '//xPath/to/next/page' 
  },  
}

Variant 3

This variant clicks on the link next page within the same browser window.

Example

{
  ...
  "next_page": {  
    "dom_query" : "#pagination .next_page a",
    "presist" : true,
    "refresh": true,
    "milliseconds": 1000
  },
  ...
}


source_format

The source_format is a String that indicates the format a web page is served in

Supported formats

  • html - the data on the page is in HTML format
  • json - the data on the page is in JSON format
  • json_attr - the data on the page has HTML format embedded in JSON format

required_json_attr

The required_json_attr is a String which indicates the JSON attribute to extract HTML data from. It only works if the source_format_object is also provided in the recipe


groupings_object

The groupings_object is an Array . It allows you to cluster attributes as sub-attributes in your JSON export. For it to work, you will need to indicate is_index as true for one column_object.

Sample Input

{
  ...
  "columns": [{
      "col_name": "Author",
      "dom_query": ".author",
      "is_index": true
    },{
      "col_name": "Book name",
      "dom_query": ".book_name"
    },{
      "col_name": "Published date",
      "dom_query": ".published_at"
    }],{
      "col_name": "Rater Name",
      "dom_query": ".rater"
    },{
      "col_name": "Rating",
      "dom_query": ".rating"
  }],
  ...
  "groupings": [{
      "group_name" : "publised_books",
      "group_members": ["Book name", "Published date"]
    },{
      "group_name" : "ratings",
      "group_members": ["Rater Name", "Rating"]
  }]
  ...
}

order_by_object

The order_by is an Array . It allows you to that indicates the order in which records in your data should be sorted by.

Attributes

col_name Required String A String that references the column to sort the records in the data set by
order Optional String A String that indicates the order in which the records should be sorted by. When not provided, it defaults to ascending.

Supported

  • asc
  • desc

Sample Input

{
  ...
  "order_by": [{
      "col_name": "Author name",
      "order": "asc"
    },{
      "col_name": "Rating",
      "order": "desc"
  }]
  ...
}

variables

Variables correspond to col_names within the recipe whose values have already been gathered. When utilized the attributes, their corresponding values are constructed dynamically. They are written in the format {{SOME_PATTERN}}.

Example usage in the origin_url_object

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://www.imdb.com/title/tt0119177/fullcredits",
    "columns": [
        {
            "col_name": "CAST_NAME",
            "dom_query": "table.cast_list tr td:nth-child(2).itemprop a span.itemprop",
            "options": {
                "origin_url": "https://www.google.com/search?q={{CAST_NAME}}+date+of+birth",
                "columns": [
                    {
                        "col_name": "date_of_birth",
                        "dom_query": "div[role='heading'][aria-level='3'].HwtpBd.kno-fb-ctx div:nth-child(1)"
                    }
                ]
            }
        }
    ]
}

Example usage in the column_object.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://sanfrancisco.postalcodes.com",
    "columns": [
        {
            "col_name": "POSTAL_CODE",
            "dom_query": "td.postal_code",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "columns": [
                    {
                        "col_name": "real_estate_agent",
                        "dom_query": "tr:contains('{{POSTAL_CODE}}') td.agent_name"
                    }
                ]
            }
        }
    ]
}

Example usage in the actions.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://real-estate-agents.com",
    "columns": [
        {
            "col_name": "AGENT_NAME",
            "dom_query": "td.agent_name",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "actions": [{
                    "action_name": "click",
                    "dom_query": "tr:contains('{{AGENT_NAME}}') td button.show_contacts"
                }],
                "columns": [
                    {
                        "col_name": "phone_number",
                        "dom_query": "tr:contains('{{AGENT_NAME}}') td.contact_details"
                    }
                ]
            }
        }
    ]
}


options_object

The options_object is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes

Attributes

columns Required columns An Array of column_objects
origin_url Conditional Required origin_url_object An Object that list the URL(s) to get data from: Variant 4
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
cookies Optional cookies_object An Array that list the Cookies to use for the site
headers Optional headers_object An Object that is sent as the REQUEST HEADER
wait Optional Integer See wait_object for more details
to_click Optional to_click_object An Array that list the DOM Elements in the page to click before getting data

Variant 1

Crawls from a parent page to a nested page to get more data via the href that is indicated as the required_attribute of the parent column_object

Example

{
  ...
  "columns" : [{
    "col_name" : "product_page",
    "dom_query" : ".product a",
    "required_attribute" : "href",
    "options" : {
      "columns" : columns,
      "next_page" : next_page_object
    } 
  }],
  ...
}

Variant 2

Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL using the value from parent column_object. href need not be indicated as the required_attribute of the parent column_object.

Example

{
  ...
  "columns" : [{
    "col_name" : "product_name",
    "dom_query" : ".product",
    "options" : {
      "origin_url" : "https://www.google.com/search?q={{product_name}}+price",
      "columns" : columns,
      "next_page" : next_page_object,  
    }
  }],
  ...
} 


Next Steps

To find out more about how GetData.IO works