python - Creating a new corpus with NLTK -


i reckoned answer title go , read documentations, ran through nltk book doesn't give answer. i'm kind of new python.

i have bunch of .txt files , want able use corpus functions nltk provides corpus nltk_data.

i've tried plaintextcorpusreader couldn't further than:

>>>import nltk >>>from nltk.corpus import plaintextcorpusreader >>>corpus_root = './' >>>newcorpus = plaintextcorpusreader(corpus_root, '.*') >>>newcorpus.words() 

how segment newcorpus sentences using punkt? tried using punkt functions punkt functions couldn't read plaintextcorpusreader class?

can lead me how can write segmented data text files?

edit: question had bounty once, , has second bounty. see text in bounty box.

i think plaintextcorpusreader segments input punkt tokenizer, @ least if input language english.

documentation of plaintextcorpusreader's __init__

__init__(     self,     root,     fileids,     word_tokenizer=wordpuncttokenizer(pattern='\\w+|[^\\w\\s]+', gaps=false, disc...,       sent_tokenizer=nltk.data.lazyloader('tokenizers/punkt/english.pickle'),     para_block_reader=<function read_blankline_block @ 0x1836d30>,     encoding=none ) 

you can pass reader word , sentence tokenizer, latter default nltk.data.lazyloader('tokenizers/punkt/english.pickle').

for single string, tokenizer used follows (explained here, see section 5 punkt tokenizer).

>>> import nltk.data >>> text = """ ... punkt knows periods in mr. smith , johann s. bach ... not mark sentence boundaries.  , sentences ... can start non-capitalized words.  variable ... name. ... """ >>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') >>> tokenizer.tokenize(text.strip()) 

Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

c# - How to add a new treeview at the selected node? -

java - netbeans "Please wait - classpath scanning in progress..." -