python - How to wrap this into a recursive call -
i'm writing little spider in python. hits page, gets links on page according pattern, goes each of pages, , repeats.
i need make recursive somehow. below url patterns:
www.example.com
then links based on regex, visit each page.
recursive part:
say visiting page url like:
www.example.com/category/1
now if page contains links like:
www.example.com/category/1_234
(basically same url, except has additional "_234234")
i visit page, check url like:
www.example.com/category/1_234_4232
(again, same url plus underscore , number)
i keep doing until there no more links fitting pattern.
1. visit category page 2. contain links same url + "_dddd" if yes, visit page 3. #2 unless no links
i don't need regex, need structuring recursive call.
just-recursion, in manner asking for, might not best approach.
for 1 thing, if there more 1000 pages indexed, might 'bottom out' call stack , crash. another, if not careful releasing variables during recursion, might end consuming quite bit of memory.
i recommend more like:
visit_next = set(['www.example.com']) visited = set() while len(visit_next): next_link = visit_next.pop() if next_link not in visited: visited.add(next_link) link in find_all_links_from(next_link): if link not in visited: visit_next.add(link)
edit:
as suggested, have rewritten using sets; should use less memory (on long traversal, visit_next list have collected many duplicate links).
Comments
Post a Comment