Pages

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:



Part B of the video is here (sorry about the crap sound, working on it):



Useful XPaths:

//a
grabs all the anchors (hyperlinks) in a document

//a/@href
grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

//input[@type='text']/..
grabs the parent element of all input text elements

count(//p)
returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

https://developer.mozilla.org/en/XPath/Functions

and here:

http://www.w3schools.com/XPath/xpath_syntax.asp

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

16 comments:

  1. HURRAY!
    Thank you!
    Now we only have to hope that this is going to be an at least 10 part series :))

    ReplyDelete
  2. The Xpath idea is really great for allowing real time data. I played around a bit with HTTRack in the past, but the Xpath option seems to be much more flexible and effective. Thanks!

    ReplyDelete
  3. Tnx! More please :-)

    ReplyDelete
  4. Informative video, which gives us more knowledge about web scraing with google spreadsheet. This spreadsheet will scrap a table from an HTML web page. The url of the target web page and the target table element need to be in double quotes.

    ReplyDelete
  5. Have any of you guys tried Helium Scraper (http://www.heliumscraper.com)? Looks kinda promising...

    ReplyDelete
  6. This is really cool, I'm looking forward to the other posts in the series.

    ReplyDelete
  7. Thanks a lot! This is just great.

    I have a question regarding the hyperlinks:

    - I want to put the results returned by crawling the hyperlinks returned with //a/@href, in a different sheet, but I do not know how to make reference to them, since each hyperlink is in a cell with the following reference: CONTINUE(B5,2,1), CONTINUE(B5, 3,1), etc.

    Any thoughts on how I could procede?

    Thanks a lot!

    ReplyDelete
  8. Thanks for these videos, They are really informative... Just a thought tho, you really pound on that keyboard, can't be good for it lol

    Thanks again!

    ReplyDelete
  9. @fash, ya, the microphone is built into my laptop, so it picks up all the keystrokes really loudly! i just got a nice external mic, so hopefully the next video sounds better.

    ReplyDelete
  10. @anonymous, you should be able to reference they cell values the same as any other value. the continue() function just allows another function to return values to multiple cells.

    ReplyDelete
  11. If you look for web data scraping tool, you can try http://smokedoc.org - a best known data extracting software!
    Quick, precise and very efficient!

    Helen.

    ReplyDelete
  12. Great post!! I'm definitely going to check out RapidMiner! I also started a collection of ImportXML & XPaths because we use them at work. I would love to share more recipes with anyone! :) Check it out ..

    ImportXML Cookbook:
    http://www.seerinteractive.com/blog/importxml-cookbook/2011/09/17/

    ReplyDelete
  13. Great videos! How does one obtain the links of the images that an ad has? They seem to be embedded into a table but I've tried accessing them and I can't seem to make it work. I would really love to know how this is done. Thanks again!

    ReplyDelete
  14. @anonymous, you should be able to find img tags and use their @src attribute. hth

    ReplyDelete
  15. This is the first time that i watch this video tutorial and i can say that i learn about web scraping .To be honest i am doing data analysis but aside from that i wanna learn different kind of things specially in online things.

    ReplyDelete