Requests with concurrent.futures in Python 2.7

Python 3 is making great steps towrd easy concurrency, and some of those have been backported into python 2.7. The concurrent.futures module is available after you `pip install futures`. This package brings very convinient methods for doing threading (ThreadPool) or multiprocessing (ProcessPool).

Threads are useful when the code is blocked by non bytecode execution, such as I/O or external process execution (C code, system calls, etc). If byte code execution is holding things up, the ProcessPool starts multiple interpreters that can execute in parallel. However, there is more overhead in spinning up these interpreters and in them communicating with the main thread through serialized representations (basically pickle or json over a socket if I understand correctly).

Here is an example with requests, which is I/O bound:

import requests
from concurrent.futures import ThreadPoolExecutor, wait, as_completed
from time import time

urls = ['google.com','cnn.com','reddit.com','imgur.com','yahoo.com']
urls = ["http://"+url for url in urls]

# Time requests running synchronously
then = time()
sync_results = map(requests.get, urls)
print "Synchronous done in %s" % (time()-then)

# Time requests running in threads
then = time()
pool = ThreadPoolExecutor(len(urls))  # for many urls, this should probably be capped at some value.
 
futures = [pool.submit(requests.get,url) for url in urls]
results = [r.result() for r in as_completed(futures)]
print "Threadpool done in %s" % (time()-then)

The results:

Synchronous done in 46.8979928493
Threadpool done in 14.2200219631

With a longer list of urls, These numbers are:

Synchronous done in 164.506973982
Threadpool done in 16.3909759521

Since the synchronous code takes the sum of all of the request times to complete, and the threadpool just takes the time of the longest request.

Leave a Reply

Your email address will not be published. Required fields are marked *