Extract keywords from news articles with python and nltk

In a recent unannounced project I needed to extract keywords (known as named entities) from news articles. I found this code online and modified it a bit for my purposes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import nltk

def get_named_entities(text):
    tokenized = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(tokenized)
    chunked = nltk.chunk.ne_chunk(pos_tagged, binary=False)
    continuous_chunk = []
    current_chunk = []
    for i in chunked:
        if type(i) == nltk.tree.Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    return continuous_chunk

On a side note, don’t call your .py files the same name as the module you’re importing (i.e. nltk.py). You’ll blow away a couple hours wondering why nothing is defined in the module you want.