Drafting your recipe

The Semantic Query Language (SQL) allows you to declare the data you need gathered from the web. Your declarations are called recipes (documents in JSON format). These recipes are interpreted by our decentralized ODBC engine. The engine fetches the desired data on your behalf

The SQL supports the following clauses.


Clauses

name Optional String Artibrary description you use to describe the level of in the site hierarchy this crawler is at for easier debugging purposes
engine Optional engine_object Indicates which type of web crawler to use
origin_url Required origin_url_object An Object that list the URL(s) to get data from
headers Optional headers_object An Object that is sent as the REQUEST HEADER
method Optional method_object A String that indicates the HTTP REQUEST method to use for making the page load call
cookies Optional cookies An Array that list the Cookies to use for the site
disable_cookies Optional Boolean A Boolean value when set to true disables the sending and receiving of cookies from our engines
post_data Optional post_data_object A Object that indicates the form values to send along in a HTTP POST REQUEST
actions Optional actions An Array of action_objects that defines the list of actions to perform on the page
columns Required columns An Array of column_objects that defines the data to get from page
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
source_format Optional source_format_object A String that indicates what format the web page is served in
required_json_attr Optional required_json_attr_object A String that indicates the JSONattribute which the targeted HTML data is embedded in
groupings Optional groupings_object An Object that indicates groups attributes in your data set as sub-attributes for your JSON exports
order_by Optional order_by_object An Array that indicates the order in which records in your data should be sorted by

Sample Recipe

{
  "engine"      : "chrome",
  "origin_url"  : "http://some_url.com",
  "method"      : "post",
  "post_data"   : {
    "username"  : "My User Name",
    "password"  : "My Password"
  },
  "actions":    : [{
      "action_name": "scroll_bottom",
      "times": 10
    },{
      "action_name": "click",
      "dom_query": "button.show_more"
    },{
      "action_name": "click",
      "xpath": "//button[[contains(@class, 'show_more')]]"
    },{
      "action_name": "wait",
      "milliseconds": [1000, 5000]
  }],
  "columns"     : [{
      "col_name"  : "merchant name",
      "dom_query" : "td.merchant_name",
      "is_index"  : true
    },{
      "col_name"  : "transaction date",
      "dom_query" : "td.transaction_date"
    },{
      "col_name"  : "transaction name",
      "dom_query" : "td.transaction_name"  
    },{
      "col_name"  : "transaction amount",
      "dom_query" : "td.transaction_amount"      
  }],
  "next_page"   : {
    "dom_query" : "a.next_page"
  },
  "cookies"     : [{
    "domain": "localhost",
    "name": "__profilin",
    "value": "p%3Dt"
  }],
  "headers"     : {
    Referer: "http://localhost:3000/docs/define-data",
    User-Agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko)"
  },
  "groupings": [{
    "group_name"    : "transactions",
    "group_members" : [
      "transaction date", 
      "transaction name",
      "transaction amount"
    ]
  }],
  "order_by": [{
      "col_name" : "merchant name",
      "order": "asc"
    },{
      "col_name" : "transaction date",
      "order": "desc"
  }]  
}


engine_object

The engine_object is an optional clause that indicates the type of engine to get data with. There are currently 2 available engines to choose from

  • nokogiri - the Nokogiri engine is useful for extracting data from static html pages
  • chrome - the Chrome engine is useful for extracting data from pages that are Ajax heavy. it is experimental and does not support POST methods
  • pdf_reader - the Pdf Reader engine is used for extracting data from PDF documents published on websites
  • text_reader - the Text Reader engine is used for extracting data from unstructured text files published on websites


origin_url_object

The origin_url_object, a required clause, is a String that indicates the URL(s) to get data from. It supports Dynamic variables.

Variant 1

The String indicating the URL to get data from.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=phones"
  ...
}

Variant 2

When used in a options_object, the String is constructed at run time with the use of Dynamic variables that are already collected.

Example

{
  "origin_url": "http://www.amazon.com/s/ref=nb_sb_noss?field-keywords={{some_col_name}}",
  ...
}


headers_object

The headers_object is an Object sent along as the Request Header when making a request to a URL.
The full list of support header attributes can be referenced on WikiPedia

Example

{
  ...
  "headers": {
    "Content-Type": "application/json",
    "Accept-Language": "es",
    "X-Test": "foo",
    "DNT": "1"
  },
  ...
}

method_object

The method_object is a String the indicates the HTTP method to use to make the page load call. Defaults to get if this not indicated.

Supported Methods

  • post
  • get

cookies

The cookies is an Array of cookie_objects sent along with the page request in the Request Header to emulate an Authenticated User Session.

Example

{
  ...
  "cookies": [
    cookie_object_1, 
    cookie_object_2, 
    cookie_object_3
  ],
  ...
}


disable_cookies

The disable_cookies is a Boolean when set to true disables the sending and getting of cookies from our engine to the receipient server.


post_data_object

The post_data_object is an Object that indicates the form data to be sent along in a HTTP POST REQUEST.

Example

{
  ...
  "post_data" : {
    "key_1": "value_1",
    "key_2": "value_2",
    "key_3": "value_3",
    ...
  },
  ...
}

actions

An Array of action_objects to sequences the list of actions you want perform on a web page before getting data from it

Only supported by the chrome engine

Example

{
  ...  
  "actions": [
    action_object_1, 
    action_object_2, 
    action_object_3
  ]
  ...
}


action_object

An Object that indicates the sequence of action to perform before you get data from the Page. It has 5 Variants. The Object contains the following attributes.

Attributes

action_name Required String The action to perform on this step. Support variables options include
  • click
  • insert
  • scroll_bottom
  • scroll_pixel
  • gather
  • gather_skip
  • wait
  • trigger_change
  • goto
  • switch_window
  • solve_captcha
xpath Conditional Required String A valid XPath Selector that describes the DOM Element you want to click on if action_name is click
dom_query Conditional Required String A valid CSS Selector that describes the DOM Element you want to click on if action_name is click. It supports Dynamic variables.
milliseconds Optional Integer The time to wait if action_name is wait or scroll_bottom. Defaults to 1000 milliseconds if not indicated.
times Optional Integer The number of times to scroll to the bottom if action_name is scroll_bottom. Defaults to 1 if not indicated.
origin_url Optional String This is used in conjuection with the goto action. It specifies the next URL to go within the same window after a prior action is completed.
title Optional String This is used in conjuection with the switch_window action. It specifies the title of the new window to focus on after a prior action is completed.
scroll_bottom Optional Boolean This is used in conjuection with the click action. It scrolls the window to the bottom of the document after the click event occurs. It defaults to false.
scroll_pixel Optional Integer This is used in conjunction with the scroll_bottom action. It scrolls to down the targeted dom element by the value of pixel indicated. It defaults to the bottom of the dom element.
gather Optional Boolean This is used in conjunction with the scroll_bottom action. It gathers the results in between each scroll event. It is very useful for tackling the situation when elements at the top of the list are removed during scrolling.
gather_skip Optional Integer The number of times to skip for each scroll_bottom action before gathering the results. It is very useful for tackling the situation when the inifinite scroll page has a large volume of records. This helps reduce waiting time before the crawler completely scrolls through all the records on the page.

Variant 1

Scrolls to the bottom of the page to cause rendering of more elements by the page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "scroll_bottom",
      "times": 10, // Number of times to scroll to the bottom of the page. Defaults to 1
      "milliseconds": 1000, // Time to wait before next action
      "dom_query": "CSS selector of element to scroll bottom in" // Defaults to the body if not indicated,
      "gather": true, // Boolean (optional) gather the data available before further scrolling. 
      "gather_skip": 10, // Integer (optional) the number of times to wait before we gather results from the page.
      "scroll_pixel": 100 // pixels to scroll. Defaults to the bottom of the page

    }
    ...
  ]
  ...
}

Variant 2

Click on matching DOM Elements with XPath.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "dom_query": "CSS selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // time to wait before next action
    }
    ...
  ]
  ...
}

Variant 3

Click on matching DOM Elements with CSS Selector.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "click",
      "xpath": "XPath selector of element to click",
      "times": 1, // Numbers of times to click and wait before clicking again. Defaults to 1      
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 4

Waits for specific period of time before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 5

Waits for random period of time within specific range before next action.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "wait",
      "milliseconds": [Integer, Integer] // Random time to wait before next action within specified range
    }
    ...
  ]
  ...
}

Variant 6

Inserting value into input field using CSS selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "dom_query": "#userid",
      "value": "MY_USERID",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 7

Inserting value into input field using XPath selectors

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "insert",
      "xpath": "//input[contains(@id,'password')]",
      "value": "MY_PASSWORD",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 8

Triggering javascript change event on selected dom elements. This is useful for getting fresh data to load after a new option is selected in a dropdown

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "trigger_change",
      "dom_query": "CSS selector of element to receive event",
      "milliseconds": 1000 // Time to wait before next action
    }
    ...
  ]
  ...
}

Variant 9

Directing the crawler to go another URL after completing prior actions on the current page. This is useful for sites where modify your account settings page before fetching data from another part of the site.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "goto",
      "origin_url": "The next URL to go to"
    }
    ...
  ]
  ...
}

Variant 10

Directing the crawler to focus on another window in the browser. This is useful for sites where actions trigger a popup window and the data you want is within the popup window.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "switch_window",
      "title": "Matrix"
    }
    ...
  ]
  ...
}

Variant 11

Have the crawler wait while a person solves the Captcha that is being served up on a page.

Example

{
  ...
  "actions": [
    ...
    {
      "action_name": "solve_captcha",
      "dom_query": "CSS selector of the captcha element on the page" // The crawler will keep on waiting while this element is found on the page
    }
    ...
  ]
  ...
}


columns

An Array of column_objects to get all the data attributes you want from a web page. It supports Dynamic variables.

Example

{
  ...  
  "columns": [
    column_object_1, 
    column_object_2, 
    column_object_3
  ]
  ...
}


column_object

An Object that indicates a Data Attribute you want to get from the Page. It has 6 Variants. The Object contains the following attributes.

Attributes

col_name Required String An arbitrary string that describes the data attribute represented by this column_object
dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query, xpath or row_number must be present. When none are present, the engine defaults to returning the entire documents text.

Supported only by the following engines
  • nokogiri
  • chrome
outer_dom_query Optional String A valid jQuery CSS Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page.
Is active on when dom_query is present

Supported only by the following engines
  • nokogiri
  • chrome
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either dom_query, xpath or row_number must be present. When none are present, the engine defaults to returning the entire documents text.

Supported only by the following engines
  • nokogiri
  • chrome
outer_xpath Optional String A valid XPath Selector that describes the list of DOM Containers to fetch DOM Elements you want to get from a web page. Is active on when xpath is present

Supported only by the following engines
  • nokogiri
  • chrome
required_attribute Optional required_attribute_type The attribute to get from the matching DOM Elements. If not indicated, defaults to the innerHTML of the matching DOM Elements.

Supported only by the following engines
  • nokogiri
  • chrome
options Optional options_object The nested page to get data from. When used, either
regex_pattern Optional String Regular Expression Pattern that is global and case insensitive.
regex_group Optional Integer If a regular expression group is used, indicates the Nth matching value to return.
If * is used instead of an Integer, combines all matching values into a String together in a comma separated format.
pdf_page Optional Integer If indicated only returns text from a specific page of a PDF document. Pagination starts from 1.
It otherwise defaults to returning the text for the entire PDF document.

Supported only by the following engines
  • pdf_reader
prefix Optional String Prepends the string provided to the beginning of the column that was collected from the page.
postfix Optional String Appends the string provided to the end of the column that was collected from the page.
is_index Optional Boolean Indicates it values in this column should be used as the primary key for this data set. It affects how our Engine interprets the groupings_object you provide in your recipe
listagg Optional Boolean

If set to true, collapses all the values of this column on the page into a single line.

false by default.

listagg_delimiter Optional String

If provided, uses the string provide as the delimiter between the values of this column in a listagg operation

"" by default.

row_number Conditional Required Boolean When flagged as true, a sequential integer is assigned to this column for each row of result from a page. Either dom_query, xpath or row_number must be present
order_by Optional order_by_object Used in conjunction with the row_number, determines the order in which sequential integers should be assigned. When omitted, the natural order in which records are gathered is assumed, typically from the top to the bottom of the page.
partition_by Optional Array ofcol_names Used in conjunction with the row_number, divides rows of results from a page into groups. When omitted, all rows of results gathered from the page is in a single partition.

Variant 1

Gets a list of innerHTML from matching DOM Elements with jQuery selectors.

Example

{
  ...
  "columns": [{ 
    "col_name" :  'Column name',
    "dom_query" : 'div.product-name'  
  }]
  ...
}

Variant 2

Gets a list of innerHTML from matching DOM Elements with XPath.

Example

{
  ...
  "columns": [{
    "col_name" : "column name",
    "xpath" : '//xpath/to/elements'
  }]
  ...
}

Variant 3

Gets a list of DOM Attribute from matching DOM Elements.

Example

{
  ...
  "columns": [{
    "col_name" : "Price of Products",
    "xpath" : "//xpath/to/elements",
    "required_attribute" : required_attribute_type  
  }]
  ...
}

Variant 4

Gets the URLs to the nested pages and then continues to get data from these pages.

Example

{
  ...
  "columns": [{
    "col_name" : 'Product detailed page',
    "dom_query" : 'a.detailed-url',
    "required_attribute" : 'href',
    "options" : options_object
  }]
  ...
}

Variant 5

Gets the innerHTML from matching DOM Elements and only gets a subset of the String that matches a Regex Pattern

Example

{
  ...
  "columns": [{
    "col_name" : 'Product Price',
    "dom_query" : 'div.product-description',
    "required_attribute" : "innerHTML",
    "regex_pattern" : "[0-9.]+",
    "regex_group" : 1
  }]
  ...
}

Variant 6

Applies row_number to a column for rows of results gathered from a page.

Example

{
  ...
  "columns": [{
      "col_name" : 'product_ranking',
      "row_number" : true,
      "partition_by": ['product_category', 'product_sub_category'],
      "order_by": [{
        "col_name": "product_price",
        "order": "desc"
      },{
        "col_name": "product_rating",
        "order": "desc"
      }]
    },
    ...
    {
      "col_name" : 'product_category',
      "dom_query" : '.category'
    },{
      "col_name" : 'product_sub_category',
      "dom_query" : '.sub-category'  
    },{
      "col_name" : 'product_name',
      "dom_query" : '.prod-name'  
    },{
      "col_name" : 'product_price',
      "dom_query" : '.prod-price'  
    },{
      "col_name" : 'product_rating',
      "dom_query" : '.prod-rating'  
  }]
  ...
}


required_attribute

The required_attribute is a String that indicates the attribute to get from a DOM Element.

These are the general String values available in normal DOM Elements on top of other non-common ones.

General Types

  • inner_html
  • outer_html
  • text
  • href
  • src

next_page_object

The next_page_object is an Object that describes the hyperlink to the next page in an listing page. It has 2 variants. The Object has the following attributes.

Attributes

dom_query Conditional Required String A valid jQuery CSS Selector that describes the list of DOM Elements you want to get from a web page.
Either dom_query or xpath must be present
xpath Conditional Required String A valid XPath Selector that describes the list of DOM Elements you want to get from a web page. Either xpath or dom_query must be present
persist Optional Boolean Only supported by the chrome engine. A Boolean when true loads the next page within the same Window by performing an actual click on the described DOM Element. This is useful for handling paginations on sites where the next page is loaded via Ajax.
refresh Optional Boolean Only supported by the chrome engine. This is used in conjunction with persist. The page is refreshed after the next page is loaded. This is useful for clearing out javascript memory leaks in pages. It cannot be used for Ajax enabled paginations that reset upon refresh. It defaults to false
milliseconds Optional Integer Only supported by the chrome engine. This is used in conjunction with persist to ensure the contents from the next page is loaded before we execute the columns clause. It defaults to a one second wait time if not declared.
scroll_bottom Optional Boolean Only supported by the chrome engine. This is used in conjunction with persist. It scrolls the window to the bottom of the page after the click event occurs. It defaults to false.
max_page_dom_query Optional String Only supported by the chrome engine. This is used in conjunction with persist. A valid jQuery CSS Selector that describes the dom element to extract the max page number value from. Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached. This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
max_page_xpath Optional String Only supported by the chrome engine. This is used in conjunction with persist. A valid XPath Selector that describes the dom element to extract the max page number value from. Once a valid value is extracted, the crawler will continue to crawl through to the next page the last indicated page is reached. This is useful if a page loads the next page via Ajax and the next page link does not automatically get disabled when the last page is reached
times Optional Integer Only supported by the chrome engine. This is used in conjunction with persist. It indicates the number of times to crawl to the next page before the crawler should stop.

Variant 1

This variant uses jQuery selector to detect the link to the next page

Example

{
  ...
  "next_page": {  
    "dom_query" : '.listing_next_page' 
  },
  ...
}

Variant 2

This variant uses the xPath selector to detect the link to the next page

Example

{
  ...
  "next_page": {
    "xpath" : '//xPath/to/next/page' 
  },  
}

Variant 3

This variant clicks on the link next page within the same browser window.

Example

{
  ...
  "next_page": {  
    "dom_query" : "#pagination .next_page a",
    "presist" : true,
    "refresh": true,
    "milliseconds": 1000
  },
  ...
}


source_format

The source_format is a String that indicates the format a web page is served in

Supported formats

  • html - the data on the page is in HTML format
  • json - the data on the page is in JSON format
  • json_attr - the data on the page has HTML format embedded in JSON format

required_json_attr

The required_json_attr is a String which indicates the JSON attribute to extract HTML data from. It only works if the source_format_object is also provided in the recipe


groupings_object

The groupings_object is an Array . It allows you to cluster attributes as sub-attributes in your JSON export. For it to work, you will need to indicate is_index as true for one column_object.

Sample Input

{
  ...
  "columns": [{
      "col_name": "Author",
      "dom_query": ".author",
      "is_index": true
    },{
      "col_name": "Book name",
      "dom_query": ".book_name"
    },{
      "col_name": "Published date",
      "dom_query": ".published_at"
    }],{
      "col_name": "Rater Name",
      "dom_query": ".rater"
    },{
      "col_name": "Rating",
      "dom_query": ".rating"
  }],
  ...
  "groupings": [{
      "group_name" : "publised_books",
      "group_members": ["Book name", "Published date"]
    },{
      "group_name" : "ratings",
      "group_members": ["Rater Name", "Rating"]
  }]
  ...
}

order_by_object

The order_by is an Array . It allows you to that indicates the order in which records in your data should be sorted by.

Attributes

col_name Required String A String that references the column to sort the records in the data set by
order Optional String A String that indicates the order in which the records should be sorted by. When not provided, it defaults to ascending.

Supported

  • asc
  • desc

Sample Input

{
  ...
  "order_by": [{
      "col_name": "Author name",
      "order": "asc"
    },{
      "col_name": "Rating",
      "order": "desc"
  }]
  ...
}

data object

The data_object is an Object which allows you to repopulate your crawler with desired columns and values.

Example

{
  ...
  "data": {
    "internal_reference_code": "for_fun"
  }
  ...
}

variables

Variables correspond to col_names within the recipe whose values have already been gathered. When utilized the attributes, their corresponding values are constructed dynamically. They are written in the format {{SOME_PATTERN}}.

Example usage in the origin_url object

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://www.imdb.com/title/tt0119177/fullcredits",
    "columns": [
        {
            "col_name": "CAST_NAME",
            "dom_query": "table.cast_list tr td:nth-child(2).itemprop a span.itemprop",
            "options": {
                "origin_url": "https://www.google.com/search?q={{CAST_NAME}}+date+of+birth",
                "columns": [
                    {
                        "col_name": "date_of_birth",
                        "dom_query": "div[role='heading'][aria-level='3'].HwtpBd.kno-fb-ctx div:nth-child(1)"
                    }
                ]
            }
        }
    ]
}

Example usage in the column_object.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://sanfrancisco.postalcodes.com",
    "columns": [
        {
            "col_name": "POSTAL_CODE",
            "dom_query": "td.postal_code",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "columns": [
                    {
                        "col_name": "real_estate_agent",
                        "dom_query": "tr:contains('{{POSTAL_CODE}}') td.agent_name"
                    }
                ]
            }
        }
    ]
}

Example usage in the actions.dom_query

{
    "engine": "nokogiri",
    "client_version": "4.0.20",
    "origin_url": "https://real-estate-agents.com",
    "columns": [
        {
            "col_name": "AGENT_NAME",
            "dom_query": "td.agent_name",
            "options": {
                "origin_url": "https://real-estate-agents.com",
                "actions": [{
                    "action_name": "click",
                    "dom_query": "tr:contains('{{AGENT_NAME}}') td button.show_contacts"
                }],
                "columns": [
                    {
                        "col_name": "phone_number",
                        "dom_query": "tr:contains('{{AGENT_NAME}}') td.contact_details"
                    }
                ]
            }
        }
    ]
}

Example usage in the post_data object

{
  "origin_url": "https://start.getdata.io",
  "columns": [{
    "col_name": "product_name",
    "dom_query": ".prod-name",
    "options": {
        "method": "post",
        "origin_url": "https://search_engine.com/search/getcards?term=",
        "headers": {
            "Content-Type": "application/json"
        },
        "post_data": {
            "filters": [
                {
                    "name": "projectCode",
                    "values": [
                        "{{product_name}}"
                    ]
                }
            ],
        },
        "columns": [{
            "col_name": "price",
            "dom_query": ".price"
        }]
    }
  }]
}

Example usage in the data object

{
  "origin_url": "https://start.getdata.io",
  "columns": [{
    "col_name": "product_name",
    "dom_query": ".prod-name"
  }],
  "data": {
    "internal_reference_code": "INTERNAL_PREFIX_{{product_name}}"
  }
}


options_object

The options_object is an Object that describes the data you want to get in a nested/sub-page. It has 2 variants.
The Object has the following attributes

Attributes

columns Required columns An Array of column_objects
origin_url Conditional Required origin_url_object An Object that list the URL(s) to get data from: Variant 4
next_page Optional next_page_object An Object indicates the hyperlink to the next page of this listing
cookies Optional cookies_object An Array that list the Cookies to use for the site
headers Optional headers_object An Object that is sent as the REQUEST HEADER
wait Optional Integer See wait_object for more details
to_click Optional to_click_object An Array that list the DOM Elements in the page to click before getting data

Variant 1

Crawls from a parent page to a nested page to get more data via the href that is indicated as the required_attribute of the parent column_object

Example

{
  ...
  "columns" : [{
    "col_name" : "product_page",
    "dom_query" : ".product a",
    "required_attribute" : "href",
    "options" : {
      "columns" : columns,
      "next_page" : next_page_object
    } 
  }],
  ...
}

Variant 2

Jumps from a page to another arbitrary page that is not necessarily linked by forming a new URL using the value from parent column_object. href need not be indicated as the required_attribute of the parent column_object.

Example

{
  ...
  "columns" : [{
    "col_name" : "product_name",
    "dom_query" : ".product",
    "options" : {
      "origin_url" : "https://www.google.com/search?q={{product_name}}+price",
      "columns" : columns,
      "next_page" : next_page_object,  
    }
  }],
  ...
} 


Next Steps

To find out more about how GetData.IO works