regex - Attribute Error for strings created from lists -
i'm trying create data-scraping file class, , data have scrape requires use while loops right data separate arrays-- i.e. states, , sat averages, etc.
however, once set while loops, regex cleared majority of html tags data broke, , getting error reads:
attribute error: 'nonetype' object has no attribute 'groups'
my code is:
import re, util beautifulsoup import beautifulstonesoup # create comma-delineated file delim = ", " #base url sat data base = "http://www.usatoday.com/news/education/2007-08-28-sat-table_n.htm" #get webpage object site soup = util.mysoupopen(base) #get column headings colcols = soup.findall("td", {"class":"vatextbold"}) #get data datacols = soup.findall("td", {"class":"vatext"}) #append data cols in range(len(datacols)): colcols.append(datacols[i]) #open csv file write data fob=open("sat.csv", 'a') #initiate 5 arrays states = [] participate = [] math = [] read = [] write = [] #split 5 lists each row in range(len(colcols)): if i%5 == 0: states.append(colcols[i]) i=1 while i<=250: participate.append(colcols[i]) = i+5 i=2 while i<=250: math.append(colcols[i]) = i+5 i=3 while i<=250: read.append(colcols[i]) = i+5 i=4 while i<=250: write.append(colcols[i]) = i+5 #write data file in range(len(states)): states = str(states[i]) participate = str(participate[i]) math = str(math[i]) read = str(read[i]) write = str(write[i]) #regex remove html data scraped #remove <td> tags line = re.search(">(.*)<", states).groups()[0] + delim + re.search(">(.*)<", participate).groups()[0]+ delim + re.search(">(.*)<", math).groups()[0] + delim + re.search(">(.*)<", read).groups()[0] + delim + re.search(">(.*)<", write).groups()[0] #append data point file fob.write(line)
any ideas regarding why error appeared? regex working fine until tried split data different lists. have tried printing various strings inside final "for" loop see if of them "none" first value (0), string supposed be.
any appreciated!
it looks regex search failing on (one of) strings, returns none
instead of matchobject
.
try following instead of long #remove <td> tags
line:
out_list = [] item in (states, participate, math, read, write): try: out_list.append(re.search(">(.*)<", item).groups()[0]) except attributeerror: print "regex match failed on", item sys.exit() line = delim.join(out_list)
that way, can find out regex failing.
also, suggest use .group(1)
instead of .groups()[0]
. former more explicit.
Comments
Post a Comment