Harvesting data from Amazon

This tutorial will guide you step by step on how to start harvesting data from Amazon.


Register an account

To first get started, you will need to have an account. If you haven't done so yet, you will need to register for one.


Login to your dashboard

Now that you have an account and have logged in to it, you will see your dashboard. This is where you will see all your projects.


Create a new Krake

In your dashboard create your krake. Give it a name and a description. The daily frequency tells your krake how often you want it to run.

If you want other people from the community to be able to access your Krake make it public. On the other hand, if you are creating this Krake for the sole purpose of private use within your company then make your krake private

Now this is where the fun starts. In the text field of the edit tab start with the below chunk of definition. In this tutorial we want to scrap pricing information iPhones listed on eBay.com

{
  origin_url: 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=iphone'
}
    

Next, we define the attributes we want on this page. In this tutorial, we want to extract the product names, product image and product price.

For demonstration purposes, we will use CSS selector to get the product name, xpath to get product price and product image. Notice if the definition below, I am using dom_query for product name and xpath for product price and product image.

Notice also that for product image, I have included an additional attribute called required_attribute with src as the corresponding attribute. This tells our engine to extract the value for the src of this dom element

{
  origin_url: 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=iphone',
  columns : [{
      col_name : 'product name',
      dom_query : 'span.lrg.bold'
    },{
      col_name : 'product price',
      xpath : '//ul[1]/li/a[1]/span[@class="bld lrg red"]'
    },{
      col_name : 'product image',
      xpath : '//img[@class="productImage"]',
      required_attribute : 'src'
  }]
}
    

This should work for version 1. Click on the Test it! to see the output you will get. You should see 3 column outputs in your Run tab


Diving even deeper with your Krake

Now that your Krake has ran, let's get it to dive even deeper to Amazon's sitemap and extract the each individual sellers profile.

Notice I have declared a new col_name called diving deeper?

This basically tells Krake to go one level deeper via the product name link

{
  origin_url: 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=iphone',
  columns : [{
      col_name : 'product name',
      dom_query : 'h3.newaps a span.lrg.bold'
    },{
      col_name : 'product price',
      xpath : '//ul[1]/li/a[1]/span[@class="bld lrg red"]'
    },{
      col_name : 'product image',
      xpath : '//img[@class="productImage"]',
      required_attribute : 'src'
    },{            
      col_name : 'diving deeper',
      dom_query : 'h3.newaps a span.lrg.bold',
      required_attribute : 'href',
      options : {

      }            
  }]
}
    

Now that Krake is one level deeper, let's get the seller's name

{
  origin_url: 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=iphone',
  columns : [{
      col_name : 'product name',
      dom_query : 'h3.newaps a span.lrg.bold'
    },{
      col_name : 'product price',
      xpath : '//ul[1]/li/a[1]/span[@class="bld lrg red"]'
    },{
      col_name : 'product image',
      xpath : '//img[@class="productImage"]',
      required_attribute : 'src'
    },{            
      col_name : 'diving deeper',
      dom_query : 'h3.newaps a',
      required_attribute : 'href',
      options : {
        columns : [{
          col_name : 'seller name',
          xpath : '//*[@id="handleBuy"]/div[1]/span/a'
        }]
      }
  }]
}
    

And now you are done. Click on the Preview to see the new output you will get. You should see 4 column outputs in your Run tab

Now that you have finished editing and testing your krake, click the Save button. You will be directed to your krake's page.


Feed your krake

Like all sea creatures your krake needs to be fed too. Your krake consumes 1 token from your account's quota for each page it dives to on the internet to harvest data from.

It stops working when your account runs out of quota. So you will need to ensure your account does not run out of quota while your krake is working. Otherwise your krake starves and stop working.

Before you tell your krake to start working ensure it has enough quota to finish the task you assigned to it.

To be safe, top up your account's quota at least once by clicking on the Top up quota with paypal button in your krake's page.


Get your krake to start harvesting data

Now that you have ensured your krake has enough food to eat while it goes about its work, get your krake to start harvesting data from Amazon.com by clicking on the Run Krake button. Your krake will now start harvesting the data you need.

krake will finish harvesting the data in a while. The time required depends on the size of your harvest. A job that requires your krake to dive to 100 pages typically requires 15 minutes.


Consuming the data your krake has harvested for you

Now that your krake has finished harvesting data from your web pages. You are ready to consume it.

You can download your data via CSV format. To do so, click on the download csv button

Or you can download your data via JSON format. To do so, click on the download json button.

You can also choose to integrate your own application to these URL sources directly via HTTP