什么是NLTK

nltk的全称是Natural Language Toolkit ,是在自然语言处理中常用的一个库,nltk库默认的处理对象是英文文本,而其对于其他语言的处理具有一定的局限性,所以nltk一般用于处理英文文本

如何安装NLTK

nltk的安装也很容易,如果你用的是IntelliJ IDEA或PyCharm这样的现代化IDE,可以直接创建一个.py文件中运行以下代码进行导入,输入以下代码后,将鼠标移到报错的import上,IDE会提示安装软件包

import nltk
nltk.download()

当然,由于 GFW(Great Firewall) 的存在,很多时候基本上是下载不下来的,不过我们可以通过自行下载手动导入数据包的方式,下载完成后,将其解压到C盘或者其他盘的根目录下即可

当然,使用pip安装也是可以的,在命令行环境下使用如下命令

pip install nltk

之后参照如上即可

NLTK使用

功能表

nltk-1.png

Tokenization(标记化)

在大多数情况,一段文本是不能直接用于分析的,所以我们要先将文本处理为字、词、句、段等形式。这个过程叫做 Tokenization(标记化),也可以理解为分词,在 nltk 中提供了两种分词方式,分别是word_tokenize()按照单词分词,和sent_tokenize()按照句子分词,二者返回值均为列表list(str)

import nltk

article = "I believe there is a person who brings sunshine into your life. That person may have enough to spread " \
          "around. But if you really have to wait for someone to bring you the sun and give you a good feeling, " \
          "then you may have to wait a long time."
# 分词
words = nltk.word_tokenize(article)
''' ['I', 'believe', 'there', 'is', 'a', 'person', 'who', 'brings', 'sunshine', 'into', 'your', 'life', '.', 'That',
'person', 'may', 'have', 'enough', 'to', 'spread', 'around', '.', 'But', 'if', 'you', 'really', 'have', 'to',
'wait', 'for', 'someone', 'to', 'bring', 'you', 'the', 'sun', 'and', 'give', 'you', 'a', 'good', 'feeling', ',',
'then', 'you', 'may', 'have', 'to', 'wait', 'a', 'long', 'time', '.']'''
# 分句
sentences = nltk.sent_tokenize(article)
'''['I believe there is a person who brings sunshine into your life.', 'That person may have enough to spread 
around.', 'But if you really have to wait for someone to bring you the sun and give you a good feeling, then you may 
have to wait a long time.']'''

Tag(词性标注)

词性标注也是非常使用的一个工具,方法名为pos_tag() ,需要一个list作为参数,需要注意的是,所提供的 list 最好是有上下文的 list ,而不是杂乱无章的,不然可能会导致词性标注错误,返回值是一个列表,列表中每一个元素是一个元组list(tuple(str, str)) 分别表示单词和词性,词性表如下

CC     coordinatingconjunction 并列连词
CD     cardinaldigit  纯数  基数
DT     determiner  限定词(置于名词前起限定作用,如 the、some、my 等)
EX     existentialthere (like:"there is"... think of it like "thereexists")   存在句;存现句
FW     foreignword  外来语;外来词;外文原词
IN     preposition/subordinating conjunction介词/从属连词;主从连词;从属连接词
JJ     adjective    'big'  形容词
JJR    adjective, comparative 'bigger' (形容词或副词的)比较级形式
JJS    adjective, superlative 'biggest'  (形容词或副词的)最高级
LS     listmarker  1)
MD     modal (could, will) 形态的,形式的 , 语气的;情态的
NN     noun, singular 'desk' 名词单数形式
NNS    nounplural  'desks'  名词复数形式
NNP    propernoun, singular     'Harrison' 专有名词
NNPS  proper noun, plural 'Americans'  专有名词复数形式
PDT    predeterminer      'all the kids'  前位限定词
POS    possessiveending  parent's   属有词  结束语
PRP    personalpronoun   I, he, she  人称代词
PRP$  possessive pronoun my, his, hers  物主代词
RB     adverb very, silently, 副词    非常  静静地
RBR    adverb,comparative better   (形容词或副词的)比较级形式
RBS    adverb,superlative best    (形容词或副词的)最高级
RP     particle     give up 小品词(与动词构成短语动词的副词或介词)
TO     to    go 'to' the store.
UH     interjection errrrrrrrm  感叹词;感叹语
VB     verb, baseform    take   动词
VBD    verb, pasttense   took   动词   过去时;过去式
VBG    verb,gerund/present participle taking 动词  动名词/现在分词
VBN    verb, pastparticiple     taken 动词  过去分词
VBP    verb,sing. present, non-3d     take 动词  现在
VBZ    verb, 3rdperson sing. present  takes   动词  第三人称
WDT    wh-determiner      which 限定词(置于名词前起限定作用,如 the、some、my 等)
WP     wh-pronoun   who, what 代词(代替名词或名词词组的单词)
WP$    possessivewh-pronoun     whose  所有格;属有词
WRB    wh-abverb    where, when 副词
来源:https://wenku.baidu.com/view/c63bec3b366baf1ffc4ffe4733687e21af45ffab.html

当然,如果你觉得这样太过复杂,而你分词的目的只是为了区分名词,形容词,副词之类,可以在方法中加一个tagset='universal'参数,结果将会变为更普通的形式,具体用法示例如下

tag = nltk.pos_tag(words)
'''[('I', 'PRP'), ('believe', 'VBP'), ('there', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('person', 'NN'), ('who', 'WP'),
('brings', 'VBZ'), ('sunshine', 'NN'), ('into', 'IN'), ('your', 'PRP$'), ('life', 'NN'), ('.', '.'), ('That', 'IN'),
('person', 'NN'), ('may', 'MD'), ('have', 'VB'), ('enough', 'NN'), ('to', 'TO'), ('spread', 'VB'), ('around', 'IN'),
('.', '.'), ('But', 'CC'), ('if', 'IN'), ('you', 'PRP'), ('really', 'RB'), ('have', 'VBP'), ('to', 'TO'), ('wait',
'VB'), ('for', 'IN'), ('someone', 'NN'), ('to', 'TO'), ('bring', 'VB'), ('you', 'PRP'), ('the', 'DT'), ('sun', 'NN'),
('and', 'CC'), ('give', 'VB'), ('you', 'PRP'), ('a', 'DT'), ('good', 'JJ'), ('feeling', 'NN'), (',', ','), ('then',
'RB'), ('you', 'PRP'), ('may', 'MD'), ('have', 'VB'), ('to', 'TO'), ('wait', 'VB'), ('a', 'DT'), ('long', 'JJ'),
('time', 'NN'), ('.', '.')]'''
tag_universal = nltk.pos_tag(words, tagset='universal')
'''[('I', 'PRON'), ('believe', 'VERB'), ('there', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('person', 'NOUN'), ('who',
'PRON'), ('brings', 'VERB'), ('sunshine', 'NOUN'), ('into', 'ADP'), ('your', 'PRON'), ('life', 'NOUN'), ('.', '.'),
('That', 'ADP'), ('person', 'NOUN'), ('may', 'VERB'), ('have', 'VERB'), ('enough', 'NOUN'), ('to', 'PRT'), ('spread',
'VERB'), ('around', 'ADP'), ('.', '.'), ('But', 'CONJ'), ('if', 'ADP'), ('you', 'PRON'), ('really', 'ADV'), ('have',
'VERB'), ('to', 'PRT'), ('wait', 'VERB'), ('for', 'ADP'), ('someone', 'NOUN'), ('to', 'PRT'), ('bring', 'VERB'),
('you', 'PRON'), ('the', 'DET'), ('sun', 'NOUN'), ('and', 'CONJ'), ('give', 'VERB'), ('you', 'PRON'), ('a', 'DET'),
('good', 'ADJ'), ('feeling', 'NOUN'), (',', '.'), ('then', 'ADV'), ('you', 'PRON'), ('may', 'VERB'), ('have',
'VERB'), ('to', 'PRT'), ('wait', 'VERB'), ('a', 'DET'), ('long', 'ADJ'), ('time', 'NOUN'), ('.', '.')]'''

如果想要提取其中的某一种词性,比如名词,可以看到词性表中表示名词的有 NN、NNS、NNP、NNPS,所以我们可以参考如下写法提取其中所有的名词

for noun in nltk.pos_tag(word_list):
    if noun[1] in ('NN', 'NNS', 'NNP', 'NNPS'):
        noun_word_list.append(noun[0])

待补充……