In a recent unannounced project I needed to extract keywords (known as named entities) from news articles. I found this code online and modified it a bit for my purposes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | import nltk def get_named_entities(text): tokenized = nltk.word_tokenize(text) pos_tagged = nltk.pos_tag(tokenized) chunked = nltk.chunk.ne_chunk(pos_tagged, binary=False) continuous_chunk = [] current_chunk = [] for i in chunked: if type(i) == nltk.tree.Tree: current_chunk.append(" ".join([token for token, pos in i.leaves()])) elif current_chunk: named_entity = " ".join(current_chunk) if named_entity not in continuous_chunk: continuous_chunk.append(named_entity) current_chunk = [] else: continue return continuous_chunk |
On a side note, don’t call your .py files the same name as the module you’re importing (i.e. nltk.py). You’ll blow away a couple hours wondering why nothing is defined in the module you want.