python - Trouble parsing HTML using BeautifulSoup -

- April 15, 2012

i'm trying use beautifulsoup parse html in python. specifically, i'm trying create 2 arrays of soup objects: 1 dates of postings on website, , 1 postings themselves. however, when use findall on div class matches postings, initial tag returned, not text inside tag. on other hand, code works fine dates. going on??

# store texts of posts texts = soup.findall("div", {"class":"quote"})  # store dates of posts dates = soup.findall("div", {"class":"datetab"})

the first line above returns only

<div class="quote">

which not want. second line returns

<div class="datetab">feb<span>2</span></div>

which want (pre-refining).

i have no idea i'm doing wrong. here website i'm trying parse. homework, , i'm really desperate.

which version of beautifulsoup using? version 3.1.0 performs worse real-world html (read: invalid html) 3.0.8. code works 3.0.8:

import urllib2 beautifulsoup import beautifulsoup  page = urllib2.urlopen("http://harvardfml.com/") soup = beautifulsoup(page) incident in soup.findall('span', { "class" : "quote" }):     print incident.contents

Search This Blog

Aleternatvie

python - Trouble parsing HTML using BeautifulSoup -

Comments

Post a Comment

Popular posts from this blog

java - netbeans "Please wait - classpath scanning in progress..." -

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

openxml - Programmatically format a date in an excel sheet using Office Open Xml SDK -