Simple web scraper

From time to time there are situations where it is helpful to quickly rip some information from a group of web pages.  For example, I was doing some keyword analysis and wanted to know how many Google results there are for each keyword in my list (both exact and broad matches).  The answer is right there on the page when you do the Google search, so all there is too it is to automate the process of doing queries.

Python provides the “requests” library that makes accessing urls a snap.  It can handle https authentication and keeps track of cookies.  Another useful library is PyQuery, which is a python clone of Jquery.  This library makes grabbing specific data from a website super simple.  Here is the code I ended up using, which is highly reusable for every time a similar situation comes up:

'''quickly get data from webpages in python'''
import requests        #probably came with python
from pyquery import PyQuery as pq    #you'll probably have to get this

sesh = requests.Session()

def number_of_results(phrase):
    '''Return the number of results google reports for a given phrase'''
    url = 'http://google.com/search?q='+phrase.replace(' ','+')
    result = sesh.get(url)
    doc = pq(result.content)

    #Looking at the page source from a browser, i see the number of results
    # is marked with <div id="resultStats">About 1,090,000,000 results</div>
    # id is indicated by '#' and a class by '.'

    stat = doc('div#resultStats')[0]     # there's only one result
    #print stat.text # 'about 1,090,000,000' results'

    #slice off 'about' and 'results', drop the ','s, and convert to integer
    if stat.text[0] =='A':
        return int(stat.text[6:len(stat.text)-8].replace(',',''))
     else:
        return int(stat.text[0:len(stat.text)-8].replace(',',''))

if __name__ =='__main__':
    '''Print tab delimited results of google competition for phrases in as .csv'''
    with open('keyword_ideas.csv','rb') as f:
        reader = csv.reader(f,delimiter='\t')
    for line in reader:
        #printing like this can be copied and pasted into a csv
        #broad match
        print line[0]+'\t'+str(number_of_results(line[0].strip()))
        time.sleep(.5)  #todo: replace this with exponential backoff decorator
        #exact match
        print '['+line[0]+']\t'+str(number_of_results('"'+line.strip()+'"'))
        time.sleep(.5)

Leave a Reply

Your email address will not be published. Required fields are marked *