Python scraping data parallel requests with urllib2 -


I want to scrap a site. There are about 8000 items for scraps. I have a problem that if it takes 1 second to request 1 item then these items will take approximately 8000 seconds, which means that it takes about 134 minutes and 2.5 hours. How anyone can do this and help about multi requests at the same time. I am using Python urllib2 to request content.

  1. Use better HTTP client Urllib2 Is: Close , so there should always be a conversation on the new TCP connection. With Request , you can reuse that TCP connection.

      s = Request Session () r = s.get ("http://example.org")  
  2. Make a request in parallel. Since it is I / O-bound, it is fine with GIL and you can use threads. You can run a few simple threads which download a batch of URLs and then wait for them to end but may be better in some things like "parallel map" - I can simplify this example Example:

    If you are sharing anything between threads, make sure it is secure - the requested session object looks secure to the thread:

  3. < / Ol>

    Updates - a little higher Ahrn:

      #! / Usr / bin / env Python Import Lxml.html Import Import Requests Multiprocessing Dump Import Threading first_url = "http://api.stackexchange.com/2.2/questions?pagesize=10&order=desc&sort=activity&site = Stackoverflow "rs = requests.session () R = rs.get (first_url) link = [item [" link "] RJSN ("Items"]] lock = threading for items in () Lock () Def F (Data): n, link = Data R = rs.get (link) doc = lxml.html.document_fromstring (r.content) name = [el.text for L in doc.xpath ("/ / Div [@ class = 'user-details'] / a ")] with lock: print ("% s.% S "% (n + 1, link)) print (", ".join (name)) Print ("---") # You can also return the value, they will be refunded Return to Links related to D # pool.map () (link, name) Pool = Multiprocessing Dummy. Pool (5) name_list = pool .map (f, enumerate (link)) print (name_list)  

Comments