The Semantic Query Language (SQL) allows you to declare the data you need gathered from the web.
Your declarations are called recipes (documents in JSON
format).
These recipes are interpreted by our decentralized ODBC engine. The engine fetches the desired data on your behalf
The SQL supports the following clauses.
name | Optional String | Artibrary description you use to describe the level of in the site hierarchy this crawler is at for easier debugging purposes |
---|---|---|
engine | Optional engine_object |
Indicates which type of web crawler to use |
origin_url | Required origin_url_object |
An Object that list the URL(s) to get data from |
headers | Optional headers_object |
An Object that is sent as the REQUEST HEADER |
method | Optional method_object |
A String that indicates the HTTP REQUEST method to use for making the page load call |
cookies | Optional cookies |
An Array that list the Cookies to use for the site |
disable_cookies | Optional Boolean | A Boolean value when set to true disables the sending and receiving of cookies from our engines |
post_data | Optional post_data_object |
A Object that indicates the form values to send along in a HTTP POST REQUEST |
actions | Optional actions |
An Array of action_object s that defines the list of actions to perform on the page |
columns | Required columns |
An Array of column_object s that defines the data to get from page |
next_page | Optional next_page_object |
An Object indicates the hyperlink to the next page of this listing |
source_format | Optional source_format_object |
A String that indicates what format the web page is served in |
required_json_attr | Optional required_json_attr_object |
A String that indicates the JSON attribute which the targeted HTML data is embedded in |
groupings | Optional groupings_object |
An Object that indicates groups attributes in your data set as sub-attributes for your JSON exports |
order_by | Optional order_by_object |
An Array that indicates the order in which records in your data should be sorted by |
{ "engine" : "chrome", "origin_url" : "http://some_url.com", "method" : "post", "post_data" : { "username" : "My User Name", "password" : "My Password" }, "actions": : [{ "action_name": "scroll_bottom", "times": 10 },{ "action_name": "click", "dom_query": "button.show_more" },{ "action_name": "click", "xpath": "//button[[contains(@class, 'show_more')]]" },{ "action_name": "wait", "milliseconds": [1000, 5000] }], "columns" : [{ "col_name" : "merchant name", "dom_query" : "td.merchant_name", "is_index" : true },{ "col_name" : "transaction date", "dom_query" : "td.transaction_date" },{ "col_name" : "transaction name", "dom_query" : "td.transaction_name" },{ "col_name" : "transaction amount", "dom_query" : "td.transaction_amount" }], "next_page" : { "dom_query" : "a.next_page" }, "cookies" : [{ "domain": "localhost", "name": "__profilin", "value": "p%3Dt" }], "headers" : { Referer: "http://localhost:3000/docs/define-data", User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko)" }, "groupings": [{ "group_name" : "transactions", "group_members" : [ "transaction date", "transaction name", "transaction amount" ] }], "order_by": [{ "col_name" : "merchant name", "order": "asc" },{ "col_name" : "transaction date", "order": "desc" }] }
The engine_object
is an optional clause that indicates the type of engine to get data with.
There are currently 2 available engines to choose from
nokogiri
- the Nokogiri engine is useful for extracting data from static html pageschrome
- the Chrome engine is useful for extracting data from pages that are Ajax heavy. it is experimental and does not support POST methodspdf_reader
- the Pdf Reader engine is used for extracting data from PDF documents published on websitestext_reader
- the Text Reader engine is used for extracting data from unstructured text files published on websites
The origin_url_object
, a required clause, is a String that indicates the URL(s) to get data from.
It supports Dynamic variables
.
{ "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones" ... }
options_object
, the String is constructed at run time with the use of Dynamic variables
that are already collected.
{ "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{some_col_name}}", ... }
The headers_object
is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on
WikiPedia
{ ... "headers": { "Content-Type": "application/json", "Accept-Language": "es", "X-Test": "foo", "DNT": "1" }, ... }
The method_object
is a String the indicates the HTTP method to use to make the page load call. Defaults to
get
if this not indicated.
post
get
The post_data_object
is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.
{ ... "post_data" : { "key_1": "value_1", "key_2": "value_2", "key_3": "value_3", ... }, ... }
An Array of action_object
s to sequences the list of actions you want perform on a web page
before getting data from it
Only supported by the chrome
engine
{
...
"actions": [
action_object_1,
action_object_2,
action_object_3
]
...
}
An Object that indicates the sequence of action to perform before you get data from the Page. It has 5 Variants. The Object contains the following attributes.
action_name | Required String |
The action to perform on this step. Support variables options include
|
---|---|---|
xpath | Conditional Required String |
A valid XPath Selector that describes the DOM Element you want to click on if action_name is click
|
dom_query | Conditional Required String |
A valid CSS Selector that describes the DOM Element you want to click on if action_name is click .
It supports Dynamic variables .
|
milliseconds | Optional Integer |
The time to wait if action_name is wait or scroll_bottom . Defaults to 1000 milliseconds if not indicated.
|
times | Optional Integer |
The number of times to scroll to the bottom if action_name is scroll_bottom . Defaults to 1 if not indicated.
|
origin_url | Optional String |
This is used in conjuection with the goto action. It specifies the next URL to go within the same window after a prior action is completed.
|
title | Optional String |
This is used in conjuection with the switch_window action. It specifies the title of the new window to focus on after a prior action is completed.
|
scroll_bottom | Optional Boolean |
This is used in conjuection with the click action. It scrolls the window to the bottom of the document after the click event occurs.
It defaults to false .
|
scroll_pixel | Optional Integer |
This is used in conjunction with the scroll_bottom action. It scrolls to down the targeted dom element by the value of pixel indicated.
It defaults to the bottom of the dom element.
|
gather | Optional Boolean |
This is used in conjunction with the scroll_bottom action. It gathers the results in between each scroll event. It is very useful for tackling the situation when elements at the top of the list are removed during scrolling.
|
gather_skip | Optional Integer |
The number of times to skip for each scroll_bottom action before gathering the results.
It is very useful for tackling the situation when the inifinite scroll page has a large volume of records.
This helps reduce waiting time before the crawler completely scrolls through all the records on the page.
|
{ ... "actions": [ ... { "action_name": "scroll_bottom", "times": 10, // Number of times to scroll to the bottom of the page. Defaults to 1 "milliseconds": 1000, // Time to wait before next action "dom_query": "CSS selector of element to scroll bottom in" // Defaults to the body if not indicated, "gather": true, // Boolean (optional) gather the data available before further scrolling. "gather_skip": 10, // Integer (optional) the number of times to wait before we gather results from the page. "scroll_pixel": 100 // pixels to scroll. Defaults to the bottom of the page } ... ] ... }
{ ... "actions": [ ... { "action_name": "click", "dom_query": "CSS selector of element to click", "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1 "milliseconds": 1000 // time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "click", "xpath": "XPath selector of element to click", "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1 "milliseconds": 1000 // Time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "wait", "milliseconds": 1000 // Time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "wait", "milliseconds": [Integer, Integer] // Random time to wait before next action within specified range } ... ] ... }
{ ... "actions": [ ... { "action_name": "insert", "dom_query": "#userid", "value": "MY_USERID", "milliseconds": 1000 // Time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "insert", "xpath": "//input[contains(@id,'password')]", "value": "MY_PASSWORD", "milliseconds": 1000 // Time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "trigger_change", "dom_query": "CSS selector of element to receive event", "milliseconds": 1000 // Time to wait before next action } ... ] ... }
{ ... "actions": [ ... { "action_name": "goto", "origin_url": "The next URL to go to" } ... ] ... }
{ ... "actions": [ ... { "action_name": "switch_window", "title": "Matrix" } ... ] ... }
{ ... "actions": [ ... { "action_name": "solve_captcha", "dom_query": "CSS selector of the captcha element on the page" // The crawler will keep on waiting while this element is found on the page } ... ] ... }
An Array of column_object
s to get all the data attributes you want from a web page.
It supports Dynamic variables
.
{
...
"columns": [
column_object_1,
column_object_2,
column_object_3
]
...
}
An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.
col_name | Required String | An arbitrary string that describes the data attribute represented by this column_object |
---|---|---|
dom_query | Conditional Required String |
A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query , xpath or row_number must be present. When none are present,
the engine defaults to returning the entire documents text.
Supported only by the following engines
|
outer_dom_query | Optional String |
A valid jQuery CSS Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page.
Is active on when dom_query is present
Supported only by the following engines
|
xpath | Conditional Required String |
A valid XPath Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query , xpath or row_number must be present. When none are present,
the engine defaults to returning the entire documents text.
Supported only by the following engines
|
outer_xpath | Optional String |
A valid XPath Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page.
Is active on when xpath is present
Supported only by the following engines
|
required_attribute | Optional required_attribute_type |
The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements.
Supported only by the following engines
|
options | Optional options_object |
The nested page to get data from. When used, either
|
regex_pattern | Optional String | Regular Expression Pattern that is global and case insensitive. |
regex_group | Optional Integer |
If a regular expression group is used, indicates the Nth matching value to return. If * is used instead of an Integer, combines all matching values into a String together in a comma separated format.
|
pdf_page | Optional Integer |
If indicated only returns text from a specific page of a PDF document. Pagination starts from 1. It otherwise defaults to returning the text for the entire PDF document. Supported only by the following engines
|
prefix | Optional String | Prepends the string provided to the beginning of the column that was collected from the page. |
postfix | Optional String | Appends the string provided to the end of the column that was collected from the page. |
is_index | Optional Boolean |
Indicates it values in this column should be used as the primary key for this data set. It affects how our Engine interprets the
groupings_object you provide in your recipe
|
listagg | Optional Boolean |
If set to true, collapses all the values of this column on the page into a single line.
|
listagg_delimiter | Optional String |
If provided, uses the string provide as the delimiter between the values of this column in a
|
row_number | Conditional Required Boolean |
When flagged as true, a sequential integer is assigned to this column for each row of result from a page.
Either dom_query , xpath or row_number must be present
|
order_by | Optional order_by_object |
Used in conjunction with the row_number , determines the order in which sequential integers should be assigned.
When omitted, the natural order in which records are gathered is assumed, typically from the top to the bottom of the page.
|
partition_by | Optional Array ofcol_name s |
Used in conjunction with the row_number , divides rows of results from a page into groups.
When omitted, all rows of results gathered from the page is in a single partition.
|
{ ... "columns": [{ "col_name" : 'Column name', "dom_query" : 'div.product-name' }] ... }
{ ... "columns": [{ "col_name" : "column name", "xpath" : '//xpath/to/elements' }] ... }
{ ... "columns": [{ "col_name" : "Price of Products", "xpath" : "//xpath/to/elements", "required_attribute" : required_attribute_type }] ... }
{ ... "columns": [{ "col_name" : 'Product detailed page', "dom_query" : 'a.detailed-url', "required_attribute" : 'href', "options" : options_object }] ... }
{ ... "columns": [{ "col_name" : 'Product Price', "dom_query" : 'div.product-description', "required_attribute" : "innerHTML", "regex_pattern" : "[0-9.]+", "regex_group" : 1 }] ... }
row_number
to a column for rows of results gathered from a page.
{ ... "columns": [{ "col_name" : 'product_ranking', "row_number" : true, "partition_by": ['product_category', 'product_sub_category'], "order_by": [{ "col_name": "product_price", "order": "desc" },{ "col_name": "product_rating", "order": "desc" }] }, ... { "col_name" : 'product_category', "dom_query" : '.category' },{ "col_name" : 'product_sub_category', "dom_query" : '.sub-category' },{ "col_name" : 'product_name', "dom_query" : '.prod-name' },{ "col_name" : 'product_price', "dom_query" : '.prod-price' },{ "col_name" : 'product_rating', "dom_query" : '.prod-rating' }] ... }
The required_attribute
is a String that indicates the attribute to get from a DOM Element.
These are the general String values available in normal DOM Elements on top of other non-common ones.
inner_html
outer_html
text
href
src
The next_page_object
is an Object that describes the hyperlink to the next page in an listing page.
It has 2 variants. The Object has the following attributes.
dom_query | Conditional Required String |
A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
|
---|---|---|
xpath | Conditional Required String |
A valid XPath Selector that describes the list of DOM Elements you want to get from a web page.
Either xpath or dom_query must be present
|
persist | Optional Boolean |
Only supported by the chrome engine. A Boolean when true loads the next page within the same Window by performing an actual click on the described DOM Element.
This is useful for handling paginations on sites where the next page is loaded via Ajax.
|
refresh | Optional Boolean |
Only supported by the chrome engine. This is used in conjunction with persist . The page is refreshed after the next page is loaded.
This is useful for clearing out javascript memory leaks in pages.
It cannot be used for Ajax enabled paginations that reset upon refresh. It defaults to false
|
milliseconds | Optional Integer |
Only supported by the chrome engine. This is used in conjunction with persist to ensure the contents from the next page is loaded before we execute the columns clause.
It defaults to a one second wait time if not declared.
|
scroll_bottom | Optional Boolean |
Only supported by the chrome engine. This is used in conjunction with persist .
It scrolls the window to the bottom of the page after the click event occurs. It defaults to false .
|
max_page_dom_query | Optional String |
Only supported by the chrome engine. This is used in conjunction with persist .
A valid jQuery CSS Selector that describes the dom element to extract the max page number value from.
Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached.
This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
|
max_page_xpath | Optional String |
Only supported by the chrome engine. This is used in conjunction with persist .
A valid XPath Selector that describes the dom element to extract the max page number value from.
Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached.
This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
|
times | Optional Integer |
Only supported by the chrome engine. This is used in conjunction with persist .
It indicates the number of times to crawl to the next page before the crawler should stop.
|
This variant uses jQuery selector to detect the link to the next page
{ ... "next_page": { "dom_query" : '.listing_next_page' }, ... }
This variant uses the xPath selector to detect the link to the next page
{ ... "next_page": { "xpath" : '//xPath/to/next/page' }, }
This variant clicks on the link next page within the same browser window.
{ ... "next_page": { "dom_query" : "#pagination .next_page a", "presist" : true, "refresh": true, "milliseconds": 1000 }, ... }
The source_format
is a String that indicates the format a web page is served in
html
- the data on the page is in HTML formatjson
- the data on the page is in JSON formatjson_attr
- the data on the page has HTML format embedded in JSON format
The required_json_attr
is a String which indicates the JSON attribute to extract HTML data from.
It only works if the source_format_object
is also provided in the recipe
The groupings_object
is an Array . It allows you to cluster attributes as sub-attributes in your JSON
export.
For it to work, you will need to indicate is_index
as true
for one
column_object
.
{ ... "columns": [{ "col_name": "Author", "dom_query": ".author", "is_index": true },{ "col_name": "Book name", "dom_query": ".book_name" },{ "col_name": "Published date", "dom_query": ".published_at" }],{ "col_name": "Rater Name", "dom_query": ".rater" },{ "col_name": "Rating", "dom_query": ".rating" }], ... "groupings": [{ "group_name" : "publised_books", "group_members": ["Book name", "Published date"] },{ "group_name" : "ratings", "group_members": ["Rater Name", "Rating"] }] ... }
The order_by
is an Array . It allows you to that indicates the order in which records in your data should be sorted by.
col_name | Required String | A String that references the column to sort the records in the data set by |
---|---|---|
order | Optional String |
A String that indicates the order in which the records should be sorted by. When not provided, it defaults to ascending.
Supported
|
{ ... "order_by": [{ "col_name": "Author name", "order": "asc" },{ "col_name": "Rating", "order": "desc" }] ... }
The data_object
is an Object which allows you to repopulate your crawler with desired columns and values.
{ ... "data": { "internal_reference_code": "for_fun" } ... }
Variables
correspond to col_name
s within the recipe whose values have already been gathered.
When utilized the attributes, their corresponding values are constructed dynamically. They are written in the format {{SOME_PATTERN}}
.
{ "engine": "nokogiri", "client_version": "4.0.20", "origin_url": "https://www.imdb.com/title/tt0119177/fullcredits", "columns": [ { "col_name": "CAST_NAME", "dom_query": "table.cast_list tr td:nth-child(2).itemprop a span.itemprop", "options": { "origin_url": "https://www.google.com/search?q={{CAST_NAME}}+date+of+birth", "columns": [ { "col_name": "date_of_birth", "dom_query": "div[role='heading'][aria-level='3'].HwtpBd.kno-fb-ctx div:nth-child(1)" } ] } } ] }
{ "engine": "nokogiri", "client_version": "4.0.20", "origin_url": "https://sanfrancisco.postalcodes.com", "columns": [ { "col_name": "POSTAL_CODE", "dom_query": "td.postal_code", "options": { "origin_url": "https://real-estate-agents.com", "columns": [ { "col_name": "real_estate_agent", "dom_query": "tr:contains('{{POSTAL_CODE}}') td.agent_name" } ] } } ] }
{ "engine": "nokogiri", "client_version": "4.0.20", "origin_url": "https://real-estate-agents.com", "columns": [ { "col_name": "AGENT_NAME", "dom_query": "td.agent_name", "options": { "origin_url": "https://real-estate-agents.com", "actions": [{ "action_name": "click", "dom_query": "tr:contains('{{AGENT_NAME}}') td button.show_contacts" }], "columns": [ { "col_name": "phone_number", "dom_query": "tr:contains('{{AGENT_NAME}}') td.contact_details" } ] } } ] }
{ "origin_url": "https://start.getdata.io", "columns": [{ "col_name": "product_name", "dom_query": ".prod-name", "options": { "method": "post", "origin_url": "https://search_engine.com/search/getcards?term=", "headers": { "Content-Type": "application/json" }, "post_data": { "filters": [ { "name": "projectCode", "values": [ "{{product_name}}" ] } ], }, "columns": [{ "col_name": "price", "dom_query": ".price" }] } }] }
{ "origin_url": "https://start.getdata.io", "columns": [{ "col_name": "product_name", "dom_query": ".prod-name" }], "data": { "internal_reference_code": "INTERNAL_PREFIX_{{product_name}}" } }
The options_object
is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes
columns | Required columns |
An Array of column_object s |
---|---|---|
origin_url | Conditional Required origin_url_object |
An Object that list the URL(s) to get data from: Variant 4 |
next_page | Optional next_page_object |
An Object indicates the hyperlink to the next page of this listing |
cookies | Optional cookies_object |
An Array that list the Cookies to use for the site |
headers | Optional headers_object |
An Object that is sent as the REQUEST HEADER |
wait | Optional Integer | See wait_object for more details |
to_click | Optional to_click_object |
An Array that list the DOM Elements in the page to click before getting data |
Crawls from a parent page to a nested page to get more data via the
href
that is indicated as the required_attribute
of the parent column_object
{ ... "columns" : [{ "col_name" : "product_page", "dom_query" : ".product a", "required_attribute" : "href", "options" : { "columns" : columns, "next_page" : next_page_object } }], ... }
Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL
using
the value from parent column_object
.
href
need not be indicated as the required_attribute
of
the parent column_object
.
{ ... "columns" : [{ "col_name" : "product_name", "dom_query" : ".product", "options" : { "origin_url" : "https://www.google.com/search?q={{product_name}}+price", "columns" : columns, "next_page" : next_page_object, } }], ... }
To find out more about how GetData.IO works