word2vec python 接口安装使用

创建时间：2017-04-10 投稿人：浏览次数：8570

https://github.com/danielfrg/word2vec

Installation

I recommend the Anaconda python distribution

pip install word2vec

Wheel: Wheels packages for OS X and Windows are provided on Pypi on a best effort sense. The code is quite easy to compile so consider using: --no-use-wheel on Linux and OS X.

Linux: There is no wheel support for linux so you have to compile the C code. The only requirement is gcc. You can override the compilation flags if needed: CFLAGS="-march=corei7" pip install word2vec

Windows: Very experimental support based this win32 port

%load_ext autoreload
%autoreload 2

This notebook is equivalent to demo-word.sh, demo-analogy.sh, demo-phrases.sh and demo-classes.sh from Google.

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [2]:

import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [3]:

word2vec.word2phrase("/Users/drodriguez/Downloads/text8", "/Users/drodriguez/Downloads/text8-phrases", verbose=True)

[u"word2phrase", u"-train", u"/Users/drodriguez/Downloads/text8", u"-output", u"/Users/drodriguez/Downloads/text8-phrases", u"-min-count", u"5", u"-threshold", u"100", u"-debug", u"2"]
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206

This will create a text8-phrases that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the origial data as input for word2vec.

Train the model using the word2phrase output.

In [4]:

word2vec.word2vec("/Users/drodriguez/Downloads/text8-phrases", "/Users/drodriguez/Downloads/text8.bin", size=100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 286.52k

That generated a text8.bin file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

In [5]:

word2vec.word2clusters("/Users/drodriguez/Downloads/text8", "/Users/drodriguez/Downloads/text8-clusters.txt", 100, verbose=True)

Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 287.55k

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions

In [1]:

import word2vec

Import the word2vec binary file created above

In [2]:

model = word2vec.load("/Users/drodriguez/Downloads/text8.bin")

We can take a look at the vocabulaty as a numpy array

In [3]:

model.vocab

Out[3]:

array([u"</s>", u"the", u"of", ..., u"dakotas", u"nias", u"burlesques"], 
      dtype="<U78")

Or take a look at the whole matrix

In [4]:

model.vectors.shape

Out[4]:

(98331, 100)

In [5]:

model.vectors

Out[5]:

array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.1220774 ,  0.04939618,  0.09545057, ..., -0.00804222,
        -0.05441621, -0.10076696],
       [ 0.16844609,  0.03734054,  0.22085373, ...,  0.05854521,
         0.04685341,  0.02546694],
       ..., 
       [-0.06760896,  0.03737842,  0.09344187, ...,  0.14559349,
        -0.11704484, -0.05246212],
       [ 0.02228479, -0.07340827,  0.15247506, ...,  0.01872172,
        -0.18154132, -0.06813737],
       [ 0.02778879, -0.06457976,  0.07102411, ..., -0.00270281,
        -0.0471223 , -0.135444  ]])

We can retreive the vector of individual words

In [6]:

model["dog"].shape

Out[6]:

(100,)

In [7]:

model["dog"][:10]

Out[7]:

array([ 0.05753701,  0.0585594 ,  0.11341395,  0.02016246,  0.11514406,
        0.01246986,  0.00801256,  0.17529851,  0.02899276,  0.0203866 ])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [8]:

indexes, metrics = model.cosine("socks")
indexes, metrics

Out[8]:

(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]),
 array([ 0.8375354 ,  0.83590846,  0.82818749,  0.82533614,  0.82278399,
         0.81476386,  0.8139092 ,  0.81253798,  0.8105933 ,  0.80850171]))

This returned a tuple with 2 items:

numpy array with the indexes of the similar words in the vocabulary
numpy array with cosine similarity to each word

Its possible to get the words of those indexes

In [9]:

model.vocab[indexes]

Out[9]:

array([u"hairy", u"pumpkin", u"gravy", u"nosed", u"plum", u"winged",
       u"bock", u"petals", u"biscuits", u"striped"], 
      dtype="<U78")

There is a helper function to create a combined response: a numpy record array

In [10]:

model.generate_response(indexes, metrics)

Out[10]:

rec.array([(u"hairy", 0.8375353970603848), (u"pumpkin", 0.8359084628493809),
       (u"gravy", 0.8281874915608026), (u"nosed", 0.8253361379785071),
       (u"plum", 0.8227839904046932), (u"winged", 0.8147638561412592),
       (u"bock", 0.8139092031538545), (u"petals", 0.8125379796045767),
       (u"biscuits", 0.8105933044655644), (u"striped", 0.8085017054444408)], 
      dtype=[(u"word", "<U78"), (u"metric", "<f8")])

Is easy to make that numpy array a pure python response:

In [11]:

model.generate_response(indexes, metrics).tolist()

Out[11]:

[(u"hairy", 0.8375353970603848),
 (u"pumpkin", 0.8359084628493809),
 (u"gravy", 0.8281874915608026),
 (u"nosed", 0.8253361379785071),
 (u"plum", 0.8227839904046932),
 (u"winged", 0.8147638561412592),
 (u"bock", 0.8139092031538545),
 (u"petals", 0.8125379796045767),
 (u"biscuits", 0.8105933044655644),
 (u"striped", 0.8085017054444408)]

Phrases

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases"

In [12]:

indexes, metrics = model.cosine("los_angeles")
model.generate_response(indexes, metrics).tolist()

Out[12]:

[(u"san_francisco", 0.886558000570455),
 (u"san_diego", 0.8731961018831669),
 (u"seattle", 0.8455603712285231),
 (u"las_vegas", 0.8407843553947962),
 (u"miami", 0.8341796009062884),
 (u"detroit", 0.8235412519780195),
 (u"cincinnati", 0.8199138493085706),
 (u"st_louis", 0.8160655356728751),
 (u"chicago", 0.8156786240847214),
 (u"california", 0.8154244925085712)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [13]:

indexes, metrics = model.analogy(pos=["king", "woman"], neg=["man"], n=10)
indexes, metrics

Out[13]:

(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826,  648, 1426]),
 array([ 0.2917969 ,  0.27353295,  0.26877692,  0.26596514,  0.26487509,
         0.26428581,  0.26315492,  0.26261258,  0.26136635,  0.26099078]))

In [14]:

model.generate_response(indexes, metrics).tolist()

Out[14]:

[(u"queen", 0.2917968955611075),
 (u"prince", 0.27353295205311695),
 (u"empress", 0.2687769174818083),
 (u"monarch", 0.2659651399832089),
 (u"regent", 0.26487508713026797),
 (u"wife", 0.2642858109968327),
 (u"aragon", 0.2631549214361766),
 (u"throne", 0.26261257728511833),
 (u"emperor", 0.2613663460665488),
 (u"bishop", 0.26099078142148696)]

Clusters

In [15]:

clusters = word2vec.load_clusters("/Users/drodriguez/Downloads/text8-clusters.txt")

We can see get the cluster number for individual words

In [16]:

clusters["dog"]

Out[16]:

We can see get all the words grouped on an specific cluster

In [17]:

clusters.get_words_on_cluster(90).shape

Out[17]:

(221,)

In [18]:

clusters.get_words_on_cluster(90)[:10]

Out[18]:

array(["along", "together", "associated", "relationship", "deal",
       "combined", "contact", "connection", "bond", "respect"], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [19]:

model.clusters = clusters

In [20]:

indexes, metrics = model.analogy(pos=["paris", "germany"], neg=["france"], n=10)

In [21]:

model.generate_response(indexes, metrics).tolist()

Out[21]:

[(u"berlin", 0.32333651414395953, 20),
 (u"munich", 0.28851564633559, 20),
 (u"vienna", 0.2768927258877336, 12),
 (u"leipzig", 0.2690537010929304, 91),
 (u"moscow", 0.26531859560322785, 74),
 (u"st_petersburg", 0.259534503067277, 61),
 (u"prague", 0.25000637367753303, 72),
 (u"dresden", 0.2495974800117785, 71),
 (u"bonn", 0.24403155303236473, 8),
 (u"frankfurt", 0.24199720792200027, 31)]

In [ ]:

声明：该文观点仅代表作者本人，牛骨文系教育信息发布平台，牛骨文仅提供信息存储空间服务。

上一篇： Python word2vector（含安装环境）
下一篇：【用户行为分析】用wiki百科中文语料训练word2vec模型

热门文章: CTF writeup 2_南邮网络攻防训...; SSM框架——详细整合教程（...; Linux Shell脚本编程－－curl命...; HttpClient使用详解; Java面试题全集（上）; JAVA设计模式之单例模式; java.lang.OutOfMemoryError: PermGen ...; TCP协议中的三次握手和四次...; form表单的两种提交方式，su...; String,StringBuffer与StringBuilder...

最新文章: Java之品优购课程讲义_day20（7）; 剑指 Offer - 8：跳台阶; Netty权威指南_札记02_NIO编程; mysql时间属性之时间戳和datetime之...; 虚拟现实或许可以拯救古埃及的“...; spring cloud服务注册中心eureka---集群...; Java SE 第六章; HTTP请求+数据库; HIDL学习笔记之HIDL C++（第二天）; ubuntu系统下指定tomcat运行时为JDK1.8...