python - How to wrap this into a recursive call -


i'm writing little spider in python. hits page, gets links on page according pattern, goes each of pages, , repeats.

i need make recursive somehow. below url patterns:

www.example.com 

then links based on regex, visit each page.

recursive part:

say visiting page url like:

www.example.com/category/1 

now if page contains links like:

www.example.com/category/1_234 

(basically same url, except has additional "_234234")

i visit page, check url like:

www.example.com/category/1_234_4232 

(again, same url plus underscore , number)

i keep doing until there no more links fitting pattern.

1. visit category page 2. contain links same url + "_dddd" if yes, visit page 3. #2 unless no links 

i don't need regex, need structuring recursive call.

just-recursion, in manner asking for, might not best approach.

for 1 thing, if there more 1000 pages indexed, might 'bottom out' call stack , crash. another, if not careful releasing variables during recursion, might end consuming quite bit of memory.

i recommend more like:

visit_next = set(['www.example.com']) visited = set()  while len(visit_next):     next_link = visit_next.pop()     if next_link not in visited:         visited.add(next_link)         link in find_all_links_from(next_link):             if link not in visited:                 visit_next.add(link) 

edit:

as suggested, have rewritten using sets; should use less memory (on long traversal, visit_next list have collected many duplicate links).


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

binding - How can you make the color of elements of a WPF DrawingImage dynamic? -

c# - How to add a new treeview at the selected node? -