regex - What would be the best (runtime performance) application or pattern or code or library for matching string patterns -



have been trying figure out decent way of matching string patterns. try best provide information can regarding trying do.

the simplest thougt there specified patterns , want know of these patterns match or partially given request. specified patterns hardly change. amount of requests 10k per day results have pe provided asap , runtime performance highest priority.

i have been thinking of using assembly compiled regular expression in c# this, not sure if headed in right direction.

scenario:
data file:
let's assume data provided xml request in known schema format. has anywehere between 5-20 rows of data. each row has 10-30 columns. each of columns can have data in pre-defined pattern. example:

  1. a1- "3 digits" followed "." follwed "2 digits" - [0-9]{3}.[0-9]{2}
  2. a2- "1 character" follwoed "digits" - [a-z][0-9]{4}

    the sample like:

<data>     <r1>       <a1>123.45</a1>       <a2>a5567</a2>       <a4>456ev</a4>       <an>xxx</an>     </r1> </data> 

rule file:

rule id    a1                 a2        1001       [0-9]{3}.45        a55[0-8]{2}   2002       12[0-9].55         [x-z][0-9]{4}    3055       [0-9]{3}.45        [x-z][0-9]{4} 

rule location - planning store rule ids in sort of bit mask.
rule ids listed location on string

rule id     location (from left right)   1001            1    2002            2   3055            3 

pattern file: (this not final structure, thought)

column   pattern                rule location a1       [0-9]{3}.45            101 a1       12[0-9].55             010  a2       a55[0-8]{2}            100 a2       [x-z][0-9]{4}          011 

now let's assume somehow (not sure how going limit search save time) run regex , make sure a1 column matched aginst a1 patterns , a2 column against a2 patterns. end follwoing reults "rule location"

column   pattern                rule location a1       [0-9]{3}.45            101 a2       a55[0-8]{2}            100 
  • doing , on each of loctions gives me location 1 - 1001 - complete match.
  • doing xor on each of loctions gives me location 3 - 3055 - partial match. (i purposely not doing or, because have returned 1001 , 3055 result wrong partial match)

the final reulsts looking are:
1001 - complete match
3055 - partial match

start edit_1: explanation on matching results

  • complete match - occurs when of patterns in given rule matched.
  • partial match - ocurrs when not of patterns in given rule matched, atleast 1 pattern matches.

    example complete match (and):
    rule id 1001 matched a1(101) , a2 (100). if @ first charcter in 101 , 100 "1". when , - 1 , 1 result 1. position 1 i.e. 1001 complete match.

    exmple partial match (xor):
    rule id 3055 matched a1(101). if @ last character in 101 , 100 "1" , "0". when xor - 1 xor 0 result 1. position 3 i.e. 3055 partial match.
    end edit_1

input:
data provided in sort of xml request. can 1 big request 100k data nodes or 100k requests 1 data node only.
rules:
matching values have intially saved sort of pattern make easier write , edit. let's assume there approximately 100k rules.

output:
need know rules matched , partially.

preferences:
prefer doing of coding can in c#. if there major performance boost, can use different language.

the "input" , "output" requirements, how manage "output" not matter. has fast, lets each data node has processed in approximately 1 second.

questions:

  1. are there existing pattern or framewroks this?
  2. is using regex right path assembly compiled regex?
  3. if end using regex how can specify a1 patterns match against a1 column?
  4. if specify rule locations in bit type pattern. how process ands , xors when grows 100k charcter long?

i looking suggestions or options should consider.

thanks..

the regular expression api tells when matched, not when partially matched. therefore need variation on regular expression api lets try match multiple regular expressions @ once, , @ end can tell matched fully, , partially matched. ideally 1 lets precompile set of patterns can avoid compilation @ runtime.

if had match a1 patterns against ai column, a2 columns against a2 pattern, , on. list of partial , full regular expressions.

the bad news don't know of software out there implements this.

the news strategy described in http://swtch.com/~rsc/regexp/regexp1.html should able implement this. in particular state sets can extended have information current state in multiple patterns @ same time. extended set of state sets result in more complex state diagram (because you're tracking more stuff), , more complex return @ end (you're returning set of state sets), runtime won't changed bit, whether you're matching 1 pattern or 50.


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

c# - How to add a new treeview at the selected node? -

java - netbeans "Please wait - classpath scanning in progress..." -