Drafting your recipe

The Semantic Query Language (SQL) allows you to declare the data you need gathered from the web. Your declarations are called recipes (documents in JSON format). These recipes are interpreted by our decentralized ODBC engine. The engine fetches the desired data on your behalf

The SQL supports the following clauses.

Clauses

name	Optional String	Artibrary description you use to describe the level of in the site hierarchy this crawler is at for easier debugging purposes
engine	Optional `engine_object`	Indicates which type of web crawler to use
origin_url	Required `origin_url_object`	An Object that list the URL(s) to get data from
headers	Optional `headers_object`	An Object that is sent as the REQUEST HEADER
method	Optional `method_object`	A String that indicates the HTTP REQUEST method to use for making the page load call
cookies	Optional `cookies`	An Array that list the Cookies to use for the site
disable_cookies	Optional Boolean	A Boolean value when set to true disables the sending and receiving of cookies from our engines
post_data	Optional `post_data_object`	A Object that indicates the form values to send along in a HTTP POST REQUEST
actions	Optional `actions`	An Array of `action_object`s that defines the list of actions to perform on the page
columns	Required `columns`	An Array of `column_object`s that defines the data to get from page
next_page	Optional `next_page_object`	An Object indicates the hyperlink to the next page of this listing
source_format	Optional `source_format_object`	A String that indicates what format the web page is served in
required_json_attr	Optional `required_json_attr_object`	A String that indicates the `JSON`attribute which the targeted `HTML` data is embedded in
groupings	Optional `groupings_object`	An Object that indicates groups attributes in your data set as sub-attributes for your `JSON` exports
order_by	Optional `order_by_object`	An Array that indicates the order in which records in your data should be sorted by

Sample Recipe

{
  "engine"      : "chrome",
  "origin_url"  : "http://some_url.com",
  "method"      : "post",
  "post_data"   : {
    "username"  : "My User Name",
    "password"  : "My Password"
  },
  "actions":    : [{
      "action_name": "scroll_bottom",
      "times": 10
    },{
      "action_name": "click",
      "dom_query": "button.show_more"
    },{
      "action_name": "click",
      "xpath": "//button[[contains(@class, 'show_more')]]"
    },{
      "action_name": "wait",
      "milliseconds": [1000, 5000]
  }],
  "columns"     : [{
      "col_name"  : "merchant name",
      "dom_query" : "td.merchant_name",
      "is_index"  : true
    },{
      "col_name"  : "transaction date",
      "dom_query" : "td.transaction_date"
    },{
      "col_name"  : "transaction name",
      "dom_query" : "td.transaction_name"  
    },{
      "col_name"  : "transaction amount",
      "dom_query" : "td.transaction_amount"      
  }],
  "next_page"   : {
    "dom_query" : "a.next_page"
  },
  "cookies"     : [{
    "domain": "localhost",
    "name": "__profilin",
    "value": "p%3Dt"
  }],
  "headers"     : {
    Referer: "http://localhost:3000/docs/define-data",
    User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko)"
  },
  "groupings": [{
    "group_name"    : "transactions",
    "group_members" : [
      "transaction date", 
      "transaction name",
      "transaction amount"
    ]
  }],
  "order_by": [{
      "col_name" : "merchant name",
      "order": "asc"
    },{
      "col_name" : "transaction date",
      "order": "desc"
  }]  
}

scroll to top

engine_object

The engine_object is an optional clause that indicates the type of engine to get data with. There are currently 2 available engines to choose from

nokogiri - the Nokogiri engine is useful for extracting data from static html pages
chrome - the Chrome engine is useful for extracting data from pages that are Ajax heavy. it is experimental and does not support POST methods
pdf_reader - the Pdf Reader engine is used for extracting data from PDF documents published on websites
text_reader - the Text Reader engine is used for extracting data from unstructured text files published on websites

scroll to top

origin_url_object

The origin_url_object, a required clause, is a String that indicates the URL(s) to get data from. It supports Dynamic variables.

Variant 1

The String indicating the URL to get data from.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones"
  ...
}

Variant 2

When used in a options_object, the String is constructed at run time with the use of Dynamic variables that are already collected.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{some_col_name}}",
  ...
}

scroll to top

headers_object

The headers_object is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on WikiPedia

Example

{
  ...
  "headers": {
    "Content-Type": "application/json",
    "Accept-Language": "es",
    "X-Test": "foo",
    "DNT": "1"
  },
  ...
}

scroll to top

method_object

The method_object is a String the indicates the HTTP method to use to make the page load call. Defaults to get if this not indicated.

Supported Methods

post
get

scroll to top

cookies

The cookies is an Array of cookie_objects sent along with the page request in the Request Header to emulate an Authenticated User Session.

Example

{
  ...
  "cookies": [
    cookie_object_1, 
    cookie_object_2, 
    cookie_object_3
  ],
  ...
}

scroll to top

cookie_object

The cookie_object is an Object that is attached to the header of our HTTP Request as a Cookie when making a request to a URL to get data.

name	Required String	The name of the cookie
value	Required String	The name of the value of the cookie
domain	Required String	The domain of the cookie

Example

{
  "name" : 'SSID',
  "value" : 'AnocULQskbWigvbl6',
  "domain" : '.google.com'
}

scroll to top

disable_cookies

The disable_cookies is a Boolean when set to true disables the sending and getting of cookies from our engine to the receipient server.

scroll to top

post_data_object

The post_data_object is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.

Example

{
  ...
  "post_data" : {
    "key_1": "value_1",
    "key_2": "value_2",
    "key_3": "value_3",
    ...
  },
  ...
}

scroll to top

actions

An Array of action_objects to sequences the list of actions you want perform on a web page before getting data from it

Only supported by the chrome engine

Example

{
  ...  
  "actions": [
    action_object_1, 
    action_object_2, 
    action_object_3
  ]
  ...
}

scroll to top

action_object

An Object that indicates the sequence of action to perform before you get data from the Page. It has 5 Variants. The Object contains the following attributes.

Attributes

action_name	Required String	The action to perform on this step. Support variables options include `click` `insert` `scroll_bottom` `scroll_pixel` `gather` `gather_skip` `wait` `trigger_change` `goto` `switch_window` `solve_captcha`
xpath	Conditional Required String	A valid XPath Selector that describes the DOM Element you want to click on if `action_name` is `click`
dom_query	Conditional Required String	A valid CSS Selector that describes the DOM Element you want to click on if `action_name` is `click`. It supports `Dynamic variables`.
milliseconds	Optional Integer	The time to wait if `action_name` is `wait` or `scroll_bottom`. Defaults to 1000 milliseconds if not indicated.
times	Optional Integer	The number of times to scroll to the bottom if `action_name` is `scroll_bottom`. Defaults to 1 if not indicated.
origin_url	Optional String	This is used in conjuection with the `goto` action. It specifies the next URL to go within the same window after a prior action is completed.
title	Optional String	This is used in conjuection with the `switch_window` action. It specifies the title of the new window to focus on after a prior action is completed.
scroll_bottom	Optional Boolean	This is used in conjuection with the `click` action. It scrolls the window to the bottom of the document after the `click` event occurs. It defaults to `false`.
scroll_pixel	Optional Integer	This is used in conjunction with the `scroll_bottom` action. It scrolls to down the targeted dom element by the value of pixel indicated. It defaults to the bottom of the dom element.
gather	Optional Boolean	This is used in conjunction with the `scroll_bottom` action. It gathers the results in between each scroll event. It is very useful for tackling the situation when elements at the top of the list are removed during scrolling.
gather_skip	Optional Integer	The number of times to skip for each `scroll_bottom` action before gathering the results. It is very useful for tackling the situation when the inifinite scroll page has a large volume of records. This helps reduce waiting time before the crawler completely scrolls through all the records on the page.

Variant 1

Scrolls to the bottom of the page to cause rendering of more elements by the page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "scroll_bottom",
      "times": 10, // Number of times to scroll to the bottom of the page. Defaults to 1
      "milliseconds": 1000, // Time to wait before next action
      "dom_query": "CSS selector of element to scroll bottom in" // Defaults to the body if not indicated,
      "gather": true, // Boolean (optional) gather the data available before further scrolling. 
      "gather_skip": 10, // Integer (optional) the number of times to wait before we gather results from the page.
      "scroll_pixel": 100 // pixels to scroll. Defaults to the bottom of the page

    }
    ...
  ]
  ...
}

Variant 2

Click on matching DOM Elements with XPath.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "dom_query": "CSS selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // time to wait before next action
    }
    ...
  ]
  ...
}

Variant 3

Click on matching DOM Elements with CSS Selector.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "xpath": "XPath selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 4

Waits for specific period of time before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 5

Waits for random period of time within specific range before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": [Integer, Integer] // Random time to wait before next action within specified range
    }
    ...
  ]
  ...
}

Variant 6

Inserting value into input field using CSS selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "dom_query": "#userid",
      "value": "MY_USERID",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 7

Inserting value into input field using XPath selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "xpath": "//input[contains(@id,'password')]",
      "value": "MY_PASSWORD",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 8

Triggering javascript change event on selected dom elements. This is useful for getting fresh data to load after a new option is selected in a dropdown

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "trigger_change",
      "dom_query": "CSS selector of element to receive event",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 9

Directing the crawler to go another URL after completing prior actions on the current page. This is useful for sites where modify your account settings page before fetching data from another part of the site.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "goto",
      "origin_url": "The next URL to go to"
    }
    ...
  ]
  ...
}

Variant 10

Directing the crawler to focus on another window in the browser. This is useful for sites where actions trigger a popup window and the data you want is within the popup window.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "switch_window",
      "title": "Matrix"
    }
    ...
  ]
  ...
}

Variant 11

Have the crawler wait while a person solves the Captcha that is being served up on a page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "solve_captcha",
      "dom_query": "CSS selector of the captcha element on the page" // The crawler will keep on waiting while this element is found on the page
    }
    ...
  ]
  ...
}

scroll to top

columns

An Array of column_objects to get all the data attributes you want from a web page. It supports Dynamic variables.

Example

{
  ...  
  "columns": [
    column_object_1, 
    column_object_2, 
    column_object_3
  ]
  ...
}

scroll to top

column_object

An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.

Attributes

col_name	Required String	An arbitrary string that describes the data attribute represented by this `column_object`
dom_query	Conditional Required String	A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page. Either `dom_query`, `xpath` or `row_number` must be present. When none are present, the engine defaults to returning the entire documents text. Supported only by the following engines `nokogiri` `chrome`
outer_dom_query	Optional String	A valid jQuery CSS Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page. Is active on when `dom_query` is present Supported only by the following engines `nokogiri` `chrome`
xpath	Conditional Required String	A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either `dom_query`, `xpath` or `row_number` must be present. When none are present, the engine defaults to returning the entire documents text. Supported only by the following engines `nokogiri` `chrome`
outer_xpath	Optional String	A valid XPath Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page. Is active on when `xpath` is present Supported only by the following engines `nokogiri` `chrome`
required_attribute	Optional `required_attribute_type`	The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements. Supported only by the following engines `nokogiri` `chrome`
options	Optional `options_object`	The nested page to get data from. When used, either `required_attribute` of `column_object` must be href or `options_object` must contain a `origin_url_object`
regex_pattern	Optional String	Regular Expression Pattern that is global and case insensitive.
regex_group	Optional Integer	If a regular expression group is used, indicates the Nth matching value to return. If `*` is used instead of an Integer, combines all matching values into a String together in a comma separated format.
pdf_page	Optional Integer	If indicated only returns text from a specific page of a PDF document. Pagination starts from 1. It otherwise defaults to returning the text for the entire PDF document. Supported only by the following engines `pdf_reader`
prefix	Optional String	Prepends the string provided to the beginning of the column that was collected from the page.
postfix	Optional String	Appends the string provided to the end of the column that was collected from the page.
is_index	Optional Boolean	Indicates it values in this column should be used as the primary key for this data set. It affects how our Engine interprets the `groupings_object` you provide in your recipe
listagg	Optional Boolean	If set to true, collapses all the values of this column on the page into a single line. `false` by default.
listagg_delimiter	Optional String	If provided, uses the string provide as the delimiter between the values of this column in a `listagg` operation `""` by default.
row_number	Conditional Required Boolean	When flagged as true, a sequential integer is assigned to this column for each row of result from a page. Either `dom_query`, `xpath` or `row_number` must be present
order_by	Optional `order_by_object`	Used in conjunction with the `row_number`, determines the order in which sequential integers should be assigned. When omitted, the natural order in which records are gathered is assumed, typically from the top to the bottom of the page.
partition_by	Optional Array of`col_name`s	Used in conjunction with the `row_number`, divides rows of results from a page into groups. When omitted, all rows of results gathered from the page is in a single partition.

Variant 1

Gets a list of innerHTML from matching DOM Elements with jQuery selectors.

Example

{
  ...
  "columns": [{ 
    "col_name" :  'Column name',
    "dom_query" : 'div.product-name'  
  }]
  ...
}

Variant 2

Gets a list of innerHTML from matching DOM Elements with XPath.

Example

{
  ...
  "columns": [{
    "col_name" : "column name",
    "xpath" : '//xpath/to/elements'
  }]
  ...
}

Variant 3

Gets a list of DOM Attribute from matching DOM Elements.

Example

{
  ...
  "columns": [{
    "col_name" : "Price of Products",
    "xpath" : "//xpath/to/elements",
    "required_attribute" : required_attribute_type  
  }]
  ...
}

Variant 4

Gets the URLs to the nested pages and then continues to get data from these pages.

Example

{
  ...
  "columns": [{
    "col_name" : 'Product detailed page',
    "dom_query" : 'a.detailed-url',
    "required_attribute" : 'href',
    "options" : options_object
  }]
  ...
}

Variant 5

Gets the innerHTML from matching DOM Elements and only gets a subset of the String that matches a Regex Pattern

Example

{
  ...
  "columns": [{
    "col_name" : 'Product Price',
    "dom_query" : 'div.product-description',
    "required_attribute" : "innerHTML",
    "regex_pattern" : "[0-9.]+",
    "regex_group" : 1
  }]
  ...
}

Variant 6

Applies row_number to a column for rows of results gathered from a page.

Example

{
  ...
  "columns": [{
      "col_name" : 'product_ranking',
      "row_number" : true,
      "partition_by": ['product_category', 'product_sub_category'],
      "order_by": [{
        "col_name": "product_price",
        "order": "desc"
      },{
        "col_name": "product_rating",
        "order": "desc"
      }]
    },
    ...
    {
      "col_name" : 'product_category',
      "dom_query" : '.category'
    },{
      "col_name" : 'product_sub_category',
      "dom_query" : '.sub-category'  
    },{
      "col_name" : 'product_name',
      "dom_query" : '.prod-name'  
    },{
      "col_name" : 'product_price',
      "dom_query" : '.prod-price'  
    },{
      "col_name" : 'product_rating',
      "dom_query" : '.prod-rating'  
  }]
  ...
}

scroll to top

required_attribute

The required_attribute is a String that indicates the attribute to get from a DOM Element.

These are the general String values available in normal DOM Elements on top of other non-common ones.

General Types

inner_html
outer_html
text
href
src

scroll to top

next_page_object

The next_page_object is an Object that describes the hyperlink to the next page in an listing page. It has 2 variants. The Object has the following attributes.

Attributes

dom_query	Conditional Required String	A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page. Either `dom_query` or `xpath` must be present
xpath	Conditional Required String	A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either `xpath` or `dom_query` must be present
persist	Optional Boolean	Only supported by the `chrome` engine. A Boolean when `true` loads the next page within the same Window by performing an actual click on the described DOM Element. This is useful for handling paginations on sites where the next page is loaded via Ajax.
refresh	Optional Boolean	Only supported by the `chrome` engine. This is used in conjunction with `persist`. The page is refreshed after the next page is loaded. This is useful for clearing out javascript memory leaks in pages. It cannot be used for Ajax enabled paginations that reset upon refresh. It defaults to `false`
milliseconds	Optional Integer	Only supported by the `chrome` engine. This is used in conjunction with `persist` to ensure the contents from the next page is loaded before we execute the `columns` clause. It defaults to a one second wait time if not declared.
scroll_bottom	Optional Boolean	Only supported by the `chrome` engine. This is used in conjunction with `persist`. It scrolls the window to the bottom of the page after the click event occurs. It defaults to `false`.
max_page_dom_query	Optional String	Only supported by the `chrome` engine. This is used in conjunction with `persist`. A valid jQuery CSS Selector that describes the dom element to extract the max page number value from. Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached. This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
max_page_xpath	Optional String	Only supported by the `chrome` engine. This is used in conjunction with `persist`. A valid XPath Selector that describes the dom element to extract the max page number value from. Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached. This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
times	Optional Integer	Only supported by the `chrome` engine. This is used in conjunction with `persist`. It indicates the number of times to crawl to the next page before the crawler should stop.

Variant 1

This variant uses jQuery selector to detect the link to the next page

Example

{
  ...
  "next_page": {  
    "dom_query" : '.listing_next_page' 
  },
  ...
}

Variant 2

This variant uses the xPath selector to detect the link to the next page

Example

{
  ...
  "next_page": {
    "xpath" : '//xPath/to/next/page' 
  },  
}

Variant 3

This variant clicks on the link next page within the same browser window.

Example

{
  ...
  "next_page": {  
    "dom_query" : "#pagination .next_page a",
    "presist" : true,
    "refresh": true,
    "milliseconds": 1000
  },
  ...
}

scroll to top

source_format

The source_format is a String that indicates the format a web page is served in

Supported formats

html - the data on the page is in HTML format
json - the data on the page is in JSON format
json_attr - the data on the page has HTML format embedded in JSON format

scroll to top

required_json_attr

The required_json_attr is a String which indicates the JSON attribute to extract HTML data from. It only works if the source_format_object is also provided in the recipe

scroll to top

groupings_object

The groupings_object is an Array . It allows you to cluster attributes as sub-attributes in your JSON export. For it to work, you will need to indicate is_index as true for one column_object.

Sample Input

{
  ...
  "columns": [{
      "col_name": "Author",
      "dom_query": ".author",
      "is_index": true
    },{
      "col_name": "Book name",
      "dom_query": ".book_name"
    },{
      "col_name": "Published date",
      "dom_query": ".published_at"
    }],{
      "col_name": "Rater Name",
      "dom_query": ".rater"
    },{
      "col_name": "Rating",
      "dom_query": ".rating"
  }],
  ...
  "groupings": [{
      "group_name" : "publised_books",
      "group_members": ["Book name", "Published date"]
    },{
      "group_name" : "ratings",
      "group_members": ["Rater Name", "Rating"]
  }]
  ...
}

scroll to top

order_by_object

The order_by is an Array . It allows you to that indicates the order in which records in your data should be sorted by.

Attributes

col_name Required String A String that references the column to sort the records in the data set by

order

col_name	Required String	A String that references the column to sort the records in the data set by
order	Optional String	A String that indicates the order in which the records should be sorted by. When not provided, it defaults to ascending. Supported `asc` `desc`

Optional String

A String that indicates the order in which the records should be sorted by. When not provided, it defaults to ascending.

Supported

asc
desc

Sample Input

{
  ...
  "order_by": [{
      "col_name": "Author name",
      "order": "asc"
    },{
      "col_name": "Rating",
      "order": "desc"
  }]
  ...
}

scroll to top

data object

The data_object is an Object which allows you to repopulate your crawler with desired columns and values.

Example

{
  ...
  "data": {
    "internal_reference_code": "for_fun"
  }
  ...
}

scroll to top

variables

Variables correspond to col_names within the recipe whose values have already been gathered. When utilized the attributes, their corresponding values are constructed dynamically. They are written in the format {{SOME_PATTERN}}.

Example usage in the origin_url object

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://www.imdb.com/title/tt0119177/fullcredits",
    "columns": [
        {
            "col_name": "CAST_NAME",
            "dom_query": "table.cast_list tr td:nth-child(2).itemprop a span.itemprop",
            "options": {
                "origin_url": "https://www.google.com/search?q={{CAST_NAME}}+date+of+birth",
                "columns": [
                    {
                        "col_name": "date_of_birth",
                        "dom_query": "div[role='heading'][aria-level='3'].HwtpBd.kno-fb-ctx div:nth-child(1)"
                    }
                ]
            }
        }
    ]
}

Example usage in the column_object.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://sanfrancisco.postalcodes.com",
    "columns": [
        {
            "col_name": "POSTAL_CODE",
            "dom_query": "td.postal_code",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "columns": [
                    {
                        "col_name": "real_estate_agent",
                        "dom_query": "tr:contains('{{POSTAL_CODE}}') td.agent_name"
                    }
                ]
            }
        }
    ]
}

Example usage in the actions.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://real-estate-agents.com",
    "columns": [
        {
            "col_name": "AGENT_NAME",
            "dom_query": "td.agent_name",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "actions": [{
                    "action_name": "click",
                    "dom_query": "tr:contains('{{AGENT_NAME}}') td button.show_contacts"
                }],
                "columns": [
                    {
                        "col_name": "phone_number",
                        "dom_query": "tr:contains('{{AGENT_NAME}}') td.contact_details"
                    }
                ]
            }
        }
    ]
}

Example usage in the post_data object

{
  "origin_url": "https://start.getdata.io",
  "columns": [{
    "col_name": "product_name",
    "dom_query": ".prod-name",
    "options": {
        "method": "post",
        "origin_url": "https://search_engine.com/search/getcards?term=",
        "headers": {
            "Content-Type": "application/json"
        },
        "post_data": {
            "filters": [
                {
                    "name": "projectCode",
                    "values": [
                        "{{product_name}}"
                    ]
                }
            ],
        },
        "columns": [{
            "col_name": "price",
            "dom_query": ".price"
        }]
    }
  }]
}

Example usage in the data object

{
  "origin_url": "https://start.getdata.io",
  "columns": [{
    "col_name": "product_name",
    "dom_query": ".prod-name"
  }],
  "data": {
    "internal_reference_code": "INTERNAL_PREFIX_{{product_name}}"
  }
}

scroll to top

options_object

The options_object is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes

Attributes

columns	Required `columns`	An Array of `column_object`s
origin_url	Conditional Required `origin_url_object`	An Object that list the URL(s) to get data from: Variant 4
next_page	Optional `next_page_object`	An Object indicates the hyperlink to the next page of this listing
cookies	Optional `cookies_object`	An Array that list the Cookies to use for the site
headers	Optional `headers_object`	An Object that is sent as the REQUEST HEADER
wait	Optional Integer	See `wait_object` for more details
to_click	Optional `to_click_object`	An Array that list the DOM Elements in the page to click before getting data

Variant 1

Crawls from a parent page to a nested page to get more data via the href that is indicated as the required_attribute of the parent column_object

Example

{
  ...
  "columns" : [{
    "col_name" : "product_page",
    "dom_query" : ".product a",
    "required_attribute" : "href",
    "options" : {
      "columns" : columns,
      "next_page" : next_page_object
    } 
  }],
  ...
}

Variant 2

Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL using the value from parent column_object. href need not be indicated as the required_attribute of the parent column_object.

Example

{
  ...
  "columns" : [{
    "col_name" : "product_name",
    "dom_query" : ".product",
    "options" : {
      "origin_url" : "https://www.google.com/search?q={{product_name}}+price",
      "columns" : columns,
      "next_page" : next_page_object,  
    }
  }],
  ...
}

3 simple steps to get you started

Step 1. Watching this 55 seconds tutorial

Step 2. Get

Step 3. Get data in a few clicks

Need more help?

3 simple steps to get you started

Step 1. Watching this 55 seconds tutorial

Step 2. Get

Step 3. Get data in a few clicks

Need more help?

Supported Clauses

Additional Language Features

Next Steps

Drafting your recipe

Clauses

Sample Recipe

engine_object

origin_url_object

Variant 1

Example

Variant 2

Example

headers_object

Example

method_object

Supported Methods

cookies

Example

cookie_object

Example

disable_cookies

post_data_object

Example

actions

Example

action_object

Attributes

Variant 1

Example

Variant 2

Example

Variant 3

Example

Variant 4

Example

Variant 5

Example

Variant 6

Example

Variant 7

Example

Variant 8

Example

Variant 9

Example

Variant 10

Example

Variant 11

Example

columns

Example

column_object

Attributes

Variant 1

Example

Variant 2

Example

Variant 3

Example

Variant 4

Example

Variant 5

Example

Variant 6

Example

required_attribute

General Types

next_page_object

Attributes

Variant 1