python - Trouble parsing HTML using BeautifulSoup -
i'm trying use beautifulsoup parse html in python. specifically, i'm trying create 2 arrays of soup objects: 1 dates of postings on website, , 1 postings themselves. however, when use findall on div class matches postings, initial tag returned, not text inside tag. on other hand, code works fine dates. going on??
# store texts of posts texts = soup.findall("div", {"class":"quote"}) # store dates of posts dates = soup.findall("div", {"class":"datetab"})
the first line above returns only
<div class="quote">
which not want. second line returns
<div class="datetab">feb<span>2</span></div>
which want (pre-refining).
i have no idea i'm doing wrong. here website i'm trying parse. homework, , i'm really desperate.
which version of beautifulsoup using? version 3.1.0 performs worse real-world html (read: invalid html) 3.0.8. code works 3.0.8:
import urllib2 beautifulsoup import beautifulsoup page = urllib2.urlopen("http://harvardfml.com/") soup = beautifulsoup(page) incident in soup.findall('span', { "class" : "quote" }): print incident.contents
Comments
Post a Comment