python - Creating a new corpus with NLTK -
i reckoned answer title go , read documentations, ran through nltk book doesn't give answer. i'm kind of new python.
i have bunch of .txt
files , want able use corpus functions nltk provides corpus nltk_data
.
i've tried plaintextcorpusreader
couldn't further than:
>>>import nltk >>>from nltk.corpus import plaintextcorpusreader >>>corpus_root = './' >>>newcorpus = plaintextcorpusreader(corpus_root, '.*') >>>newcorpus.words()
how segment newcorpus
sentences using punkt? tried using punkt functions punkt functions couldn't read plaintextcorpusreader
class?
can lead me how can write segmented data text files?
edit: question had bounty once, , has second bounty. see text in bounty box.
i think plaintextcorpusreader
segments input punkt tokenizer, @ least if input language english.
documentation of plaintextcorpusreader's __init__
__init__( self, root, fileids, word_tokenizer=wordpuncttokenizer(pattern='\\w+|[^\\w\\s]+', gaps=false, disc..., sent_tokenizer=nltk.data.lazyloader('tokenizers/punkt/english.pickle'), para_block_reader=<function read_blankline_block @ 0x1836d30>, encoding=none )
you can pass reader word , sentence tokenizer, latter default nltk.data.lazyloader('tokenizers/punkt/english.pickle')
.
for single string, tokenizer used follows (explained here, see section 5 punkt tokenizer).
>>> import nltk.data >>> text = """ ... punkt knows periods in mr. smith , johann s. bach ... not mark sentence boundaries. , sentences ... can start non-capitalized words. variable ... name. ... """ >>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') >>> tokenizer.tokenize(text.strip())
Comments
Post a Comment