Specifying end of string sentinels in Python prior to constructing a suffix array -


i'm implementing algorithms in http://portal.acm.org/citation.cfm?id=1813708 utilize suffix arrays find longest common substrings. algorithms involve constructing suffix array string concatenation of set of given strings string separators called sentinels. example, if given strings a, b , c, new string d created a$1b$2c$3 $1, $2, $3 sentinel characters marking ends of each string. sentinel characters must unique , lexicographically less other characters in a, b , c.

my question revolves around representation of sentinel characters in python. if a, b , c ascii strings, i'm thinking might need convert strings utf-8 , shift range 0-127 higher range such there characters available lexicographically less in strings. if seems reasonable, efficient mechanism remapping characters in python such range n-127+n n number of strings provided?

i think should use tokenizer , replace each string integer. sentinels, there plenty of integers left over. probably, it's more convenient use larger integers sentinels rather small ones. printout, can use whatever unicode character want, , may use same character of them.

are implementing yamamoto & church? if so, have @ newer literature before start. recommend abouelhoda et al extended suffix array , kim, kim & park, linearized suffix trees. , if combinatorics, at: schürmann, klaus-bernd, suffix arrays in theory , practice.

also, recommend 3-way radix quicksort, opposed specialized suffix sorting algorithm. need suffix sorting algorithm in case of redundancies in corpus. these redundancies unnecessary, , screw statistics.

and if make interesting, interested see

dale gerdemann


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

c# - How to add a new treeview at the selected node? -

java - netbeans "Please wait - classpath scanning in progress..." -