Web Scraping with Python 3.4: Getting Started

TheDemx27 · Mar 27, 2014

I'm starting this web scraper, and all I'm trying to do so far is to just retrieve the source code from the sites.

Code:

Code:

import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

i = 0

while i < len(urls):

	htmlfile = urllib.request.urlopen(urls)
	htmltext = htmlfile.read()
	print(htmltext)
	i+=1

I run it, and I get this error:

Code:

Traceback (most recent call last):
  File "C:\Python34\WebScraper.py", line 10, in <module>
    htmlfile = urllib.request.urlopen(urls)
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 446, in open
    req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
[Finished in 0.2s with exit code 1]

I'm using python 3.4.

AlephZero · Mar 27, 2014

Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

The bad thing about languages with no type checking (like Python) is that when things go wrong, the result can be total confusion - hence the not very helpful error message about timeout.

DavidSnider · Mar 27, 2014

Also you might consider writing something like this:

Code:

import urllib.request
from urllib.request import urlopen
urls = ["[PLAIN]http://www.google.com","http://www.nytimes.com","http://www.rockpapershotgun.com"][/PLAIN] 

# create a function that returns the result of a page scrape
def scrape_url(url):
        htmlfile = urllib.request.urlopen(url)
        htmltext = htmlfile.read()
        return htmltext

#apply the scrape_url function to every element in the urls list and convert the resulting iterator into a list
scrape = list(map(scrape_url,urls))

#print the list
print(scrape)

TheDemx27 · Mar 28, 2014

AlephZero said:

Why are you passing the whole list of URLs to urlopen()? Shouldn't you be passing just one URL each time through the loop?

Good point. Programming never fails to make me feel idiotic. :P

Code:

import urllib.request
from urllib.request import urlopen

urls = ["[PLAIN]http://google.com",[/PLAIN]  "[PLAIN]http://nytimes.com",[/PLAIN]  "[PLAIN]http://www.rockpapershotgun.com/"][/PLAIN] 

incrimentSite = 0 #Variable to choose from array urls

while incrimentSite < len(urls):

	htmlfile = urllib.request.urlopen(urls[incrimentSite])
	htmltext = htmlfile.read()
	print(htmltext)
	
        incrimentSite +=1 #next url

Works for me. Thankyou.

gmar · Apr 17, 2014

What is all this?

Code:

while i < len (x):
    print (x[i])
    i +=1

Is that Python? It looks a C programmer tried to write Python.

Might I suggest this:

Code:

    things = ["alice", "bob", "carol"]
    for t in things:
        print (t)

If you need the index, use this:

Code:

    things = ["alice", "bob", "carol"]
    for i, t in enumerate (things):
        print (t, "has list index", i)

Web Scraping with Python 3.4: Getting Started

Related to Web Scraping with Python 3.4: Getting Started

1. What is web scraping and why is it useful?

2. What tools do I need to get started with web scraping in Python 3.4?

3. Is web scraping legal?

4. Can I use web scraping for any website?

5. How can I handle errors while web scraping in Python 3.4?

Similar threads

Hot Threads

Recent Insights