Can word embeddings be used to improve psychological text analysis?

Part 1 - setting up shop

2016-07-16

Mattias Östmar, Amateur researcher in Computational Media Studies

twitter.com/mattiasostmar

Background

  1. Created typealyzer.com in 2008 with @jonkagstrom

  2. The idea is to analyze personality type based on stylometry, like IBM's Watson Personality Insight

  3. Based on Jon's great Naive Bayesian classifier uClassify.com

  4. ... And my hand-selected examples of different psychological writing styles

The problem with that approach is that it's very subjective what features (and examples) I think are valid

Let's try: http://polyglot-nlp.com By Rami Al-Rfou

A multilingual text (NLP) processing toolkit based on Word Embeddings

- unsupervised machine learning on Wikipedia dump

In [ ]:
pip install polyglot

Download the pre-processed Word Embeddings

For all download options, read the docs

In [ ]:
from polyglot.downloader import downloader
downloader.download("embeddings2.sv") # the extension (.sv) follows ISO language codes, i.e. 'en' for English 

And then load we the embeddings made from Wikipedia articles

In [1]:
from polyglot.mapping import Embedding
embeddings = Embedding.load("/Users/mos/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2")
print("Number of unique words: {} Vectors describing each: {}".format(embeddings.shape[0],embeddings.shape[1]))
Number of unique words: 100004 Vectors describing each: 64

Intro: personality in language style

  • The way we use language says something about our personality, i.e. how we percieve and engage with others
  • A computer tirelessly observing linguistic details reveals subconcious patterns, i.e. rationality vs sociability
  • IBM Watson Personlaity Insigths analyzes Big Five and Typealyzer.com analyzes the Myers-Briggs model.

Different personalities?

In [2]:
print(" "*30 + "-"*10 + "Two different personality styles" + "-"*10)
from IPython.display import Image
Image(filename="./images/the_dude_vs_spock.jpg")
                              ----------Two different personality styles----------
Out[2]:

How can we start mapping out "The Dude" Lebowski's (awesome) style?

Maybe Word Embeddings can help?

Let's look for nice style-related linguistic features (*)

(*) In machine learning, a feature is an individual measurable property of a phenomenon being observed ~ Wikipedia

Hey, nice marmot. ~ The Dude

Hey, nice marmot. ~ The Dude

In [3]:
embeddings.nearest_neighbors("nice", top_k=10)
Out[3]:
['weird',
 'neat',
 'wonderful',
 'messy',
 'lovely',
 'curious',
 'tricky',
 'decent',
 'silly',
 'delightful']

How can we start mapping out Mr Spock's (formal) style?

Captain, I see no reason to stand here and be insulted. ~ Mr. Spock

Captain, I see no reason to stand here and be insulted ~ Mr. Spock

In [4]:
embeddings.nearest_neighbors("reason",top_k=10)
Out[4]:
['context',
 'suggestion',
 'concern',
 'excuse',
 'justification',
 'case',
 'phrase',
 'occasion',
 'explanation',
 'thing']

Hmm. It says something about style that seems intuitively right.

But...

... let's try with more extreme words on those two dimensions

More social, casual...

In [5]:
embeddings.nearest_neighbors("cute", top_k=10)
Out[5]:
['sexy',
 'goofy',
 'scary',
 'creepy',
 'cheesy',
 'gorgeous',
 'stupid',
 'weird',
 'pathetic',
 'disgusting']

Yeah, man! :-)

Let's try more rational, formal ...

In [27]:
embeddings.nearest_neighbors("probabilistic", top_k=10)
Out[27]:
['deterministic',
 'multidimensional',
 'stochastic',
 'categorical',
 'propositional',
 'deductive',
 'qualitative',
 'causal',
 'macroscopic',
 'phenomenological']

Definately an increased accuracy, fellow word-nerd.

Can word vectors give a mathematical basis for psychological opposites e.g Thinking-Feeling, Sensing-iNtiution?

First, let's adjust the vectors for word frequency differences by normalization

In [7]:
normalized_embeddings = embeddings.normalize_words()

Boring pre-requisites - a table class for pretty printing

In [9]:
class ListTable(list):
    """ Overridden list class which takes a 2-dimensional list of 
        the form [[1,2,3],[4,5,6]], and renders an HTML Table in 
        IPython Notebook. """
    
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)
In [10]:
table = ListTable()
neighbors = embeddings.nearest_neighbors("theoretical")
table.append(['Word', 'Distance'])
for w,d in zip(neighbors, embeddings.distances("theoretical", neighbors)):
    table.append([w,d])
table
Out[10]:
WordDistance
mathematical1.3104431629180908
analytical1.3861656188964844
conceptual1.4288334846496582
computational1.4675350189208984
evolutionary1.4727905988693237
scientific1.4766939878463745
qualitative1.4767696857452393
physiological1.476980209350586
empirical1.4911034107208252
probabilistic1.4911788702011108
In [11]:
norm_table = ListTable()
neighbors = normalized_embeddings.nearest_neighbors("theoretical")
norm_table.append(["Word","Distance"])
for w,d in zip(neighbors, normalized_embeddings.distances("theoretical", neighbors)):
  norm_table.append([w,d])
norm_table
Out[11]:
WordDistance
mathematical0.4411289095878601
scientific0.5036850571632385
analytical0.5096677541732788
computational0.5240076780319214
evolutionary0.5250150561332703
physiological0.5262303352355957
conceptual0.5288679003715515
empirical0.5415818691253662
physical0.5443376898765564
qualitative0.5497934222221375
In [12]:
vec_table = ListTable()
vec_table.append(["Normalized","Max","Min","Diff"])
vec_table.append(["No",embeddings.vectors.max(), embeddings.vectors.max(),(embeddings.vectors.max() - embeddings.vectors.max())])
vec_table.append(["Yes",normalized_embeddings.vectors.max(), normalized_embeddings.vectors.min(), normalized_embeddings.vectors.max() - normalized_embeddings.vectors.min()])
vec_table
Out[12]:
NormalizedMaxMinDiff
No5.2197813987731935.2197813987731930.0
Yes0.618454098701477-0.55784118175506591.176295280456543

Let's try un-normalized distances (note "reasonable" in the end)

In [13]:
list(embeddings.distances("probabilistic",["cute","friendly","reasonable"]))
Out[13]:
[2.8954368, 3.3609173, 2.2395024]
In [14]:
list(embeddings.distances("probabilistic",["cute","friendly","reasonable"]))
Out[14]:
[2.8954368, 3.3609173, 2.2395024]

Now let's try normalized vectors (note "reasonable" in the end)

In [15]:
list(normalized_embeddings.distances("probabilistic",["cute","friendly","reasonable"]))
Out[15]:
[1.1978812, 1.2580826, 0.81279546]

Conclusion: Word Embeddings seems be pretty good at clustering words of a certain "cognitive style".

Ergo: Let's try to improve the training data for different personality styles using word2vec next.

Thank you, and to be continued! :-)

/Mattias Östmar 2016-07-16

In [ ]: