Loading BokehJS ...

T-SNE and Reddit

Online forums provide a wealth of language data and even, by the topical nature of the forums, come complete with some labeling. One can assume, with some level of confidence, that the language of a forum titled 'teenagers' has a strong relationship with the language of teenagers themselves, and the occasional misleading title is manageable. Researchers have already made some use of loosely labeled data to research patterns in the mental health community. The work below was inspired by Al-Mosaiwi and Johnstone's In an Absolute State which showed increased use of absolutist words ('always', 'never', 'completely', etc.) in forums concerning depression, anxiety, and suicidal ideation. This analysis uses Al-Mosaiwi and Johnstone's list of absolutist words alongside several other sentiment dictionaries (see also Linguistic Inquiry and Word Count (LIWC) software) representing categories such as functional, negation, positive emotion, negative emotion, body, work, leisure, and money related words.

The remarkable result is that the language use each subreddit, even when represented by these dictionary frequencies alone, is completely distinguishable from other subreddits, and subreddits also tend to form clusters within the categories they were selected to represent (video gaming, mental health, computers, relationships, recreational drugs, sports, and media). The easiest way to see this result is via t-SNE dimensionality reduction, which takes as input dictionary frequencies (or a normalized, PCA reduced representation thereof) and outputs two-dimensional points, which are graphed below. For more information on the t-SNE algorithm, see the original paper on it or these slides introducing the concept (additional presentation notes to come). \

The notebook code and graphs below present the data and resulting visualizations. I've removed the least interesting code blocks; for the curious, that code, as well as a few other examples from other projects, is available in this folder on my GitHub. If you're interested in these results, my work on forum language, or how else I think t-SNE visualizations could help make NLP and DL more understandable, please email me!

Smallest Dataset

In [4]:
norm = Normalizer()
pca = PCA(n_components=30)
model=TSNE(n_iter=2000, perplexity=5.0)


small_r, small_r_nums, small_r_norm = reset_input(small_filename)
small_r_pca=pca.fit_transform(small_r_norm)

twod=model.fit_transform(small_r_pca)

small_r['Tsne1']=twod[:,0]
small_r['Tsne2']=twod[:,1]
small_r['category']=small_r.subreddit.apply(lambda x: sub_to_cat[x])

small_r.head(10)
Out[5]:
subreddit month count(1) sum(wordcount) absolutist_freq funct_freq pronoun_freq i_freq we_freq you_freq ... home_freq money_freq relig_freq death_freq assent_freq nonfl_freq filler_freq Tsne1 Tsne2 category
i
(buildapc, 16-12) buildapc 16-12 11394 2670024 0.005858748835216462 0.4225051909645756 0.11574839776721108 0.053445961534428155 0.0013104751118342007 0.01522795300716398 ... 0.0014790129227302826 0.011878919440424506 4.1797377102228294E-4 2.8089635149346974E-4 0.0036336752029195243 8.052362076146132E-4 0.0026561558997222497 -147.540161 49.934986 computers
(mentalhealth, 16-12) mentalhealth 16-12 442 145483 0.013726689716324243 0.6269942192558581 0.21192854147907317 0.11441886680918045 0.0037667631269632878 0.0050933786078098476 ... 0.004048582995951417 0.002646357306351945 7.904703642349965E-4 0.0013403627915289072 0.001058542922540778 0.0010860375439054736 0.0062619000158094075 190.499847 -145.644958 mental_health
(DestinyTheGame, 16-12) DestinyTheGame 16-12 1455 460088 0.011289144685364539 0.5269904887760603 0.12380240301855297 0.03517370589974092 0.006753055937125072 0.018155222479177897 ... 0.0014823251204117472 0.003573229469145033 0.0014866721148997584 0.0032146024238841266 0.001754012275912434 0.0011584740310549286 0.0033732677226965277 -15.350249 147.387512 gaming
(buildapcforme, 16-12) buildapcforme 16-12 1271 458336 0.0037134329400265306 0.5228391398450045 0.12008221043077567 0.026332210430775674 0.0035519793339384206 0.05030370732388466 ... 0.0037876143266075543 0.015161366333868604 5.23633317042519E-4 1.985442993786218E-4 0.004679972771067514 5.476331774069679E-4 0.005799238986245898 -289.262390 43.901253 computers
(opiates, 16-12) opiates 16-12 800 208320 0.012864823348694316 0.595526113671275 0.17950748847926268 0.08860887096774193 0.004123463901689708 0.010973502304147465 ... 0.005088325652841782 0.006149193548387097 0.0016801075268817205 0.0012192780337941629 0.002491359447004608 0.0016897081413210445 0.004560291858678955 73.852013 -235.847214 drugs
(proED, 16-12) proED 16-12 439 96588 0.01443243467097362 0.6190624094090363 0.20788296682817742 0.12564707831200564 0.0024019546941649065 0.006698554685882304 ... 0.0034786930053422784 0.00271255228392761 0.0018842920445604008 5.694289145649563E-4 0.0024951339710937177 0.0011595643351140928 0.006201598542261978 86.838821 -165.386108 mental_health
(wow, 16-12) wow 16-12 1522 336830 0.01135884570851765 0.571243654068818 0.14879909746756523 0.05916931389721818 0.006884778671733515 0.01284030519846807 ... 0.0012469198111807144 0.004310779918653327 0.0024700887688151292 0.0017872517293590238 0.0019980405545824303 0.0011459786836089422 0.004485942463557284 -48.052700 181.579590 gaming
(r4r, 16-12) r4r 16-12 1692 396283 0.009962577249087142 0.5959604625986984 0.19473204755187581 0.10451874039512167 0.006783031318527416 0.019566824718698507 ... 0.0024906443122717854 0.003553016404942932 0.0011607866095694238 6.40956084414421E-4 0.0032148742186770564 0.0019102510074870736 0.00745679224190793 -67.443962 221.125809 general
(DotA2, 16-12) DotA2 16-12 2553 644136 0.010853298061278984 0.5244839599090875 0.12707564862078816 0.035922538097544615 0.005118484295241999 0.015473440391470125 ... 5.138666368592968E-4 0.0053544593067302556 0.0017946520610554293 0.001743420644087584 0.0017170287020132394 0.001032390675261125 0.004199423724182471 -66.375130 135.386505 gaming
(baseball, 16-12) baseball 16-12 209 88815 0.007408658447334347 0.4771603895738332 0.08962450036592919 0.016663851826831052 0.003895738332488881 0.005798570061363508 ... 0.0019478691662444406 0.0052018240162134775 0.0011259359342453415 0.007600067556156055 0.0012610482463547824 0.0010133423408208073 0.0024883184146822046 15.160878 149.868851 sports

10 rows × 63 columns

This dataset includes 1 month(s) of data from 51 subreddits.

  These data include the frequency of words from 55 word collections in each month of posts to each subreddit. The total number of posts to a given subreddit in that month and the total word count in those posts are also recorded.

The names of these word collections are:
	 absolutist   funct   pronoun   i
	 we   you   shehe   they
	 article   verb   auxverb   past
	 present   future   adverb   preps
	 conjunctions   negate   quant   number
	 swear   social   family   friend
	 humans   affect   posemo   negemo
	 anx   anger   sad   cogmech
	 insight   cause   discrep   tentat
	 certain   inhib   percept   bio
	 body   ingest   relativ   motion
	 space   time   work   achieve
	 leisure   home   money   relig
	 death   assent   nonfl   filler

One Month Visual

The image at the beginning of this section is a scatter plot showing one month of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one subreddit. The name of the subreddit associated with each point can be seen by hovering over the data point.

Of note is the way points representing each subreddit cluster with each other. The main exception is the 'general' category which includes the following, largely unrelated subreddits:

- tifu  (today I fucked up) 
- r4r
- AskReddit
- reddit.com
- tipofmytongue (in which people describe words they are searching for but can't remember)
- Life
- Advice
- jobs
- teenagers
- HomeImprovement
- redditinreddit

Also of interest is the way category clusters also form two larger groups, one containing the computers, video game, and sports categories and the other including forums about relationships, mental health, and drugs.

Big Reddit Data

In [10]:
big_r=pd.read_csv('vis_data_cache.csv')
big_r.head(10)
Out[10]:
i subreddit month Tsne1 Tsne2 category count(1)
0 ('mentalhealth', '15-11') mentalhealth 15-11 35.442670 -41.172127 mental_health 279
1 ('buildapc', '16-12') buildapc 16-12 -31.161697 4.441895 computers 11394
2 ('relationship_advice', '15-12') relationship_advice 15-12 52.036568 42.083908 relationships 2282
3 ('trees', '16-11') trees 16-11 -42.680676 39.775772 drugs 1408
4 ('hardwareswap', '16-01') hardwareswap 16-01 -46.412006 16.485046 computers 938
5 ('buildapcforme', '15-06') buildapcforme 15-06 57.582960 -0.978246 computers 2747
6 ('Overwatch', '15-07') Overwatch 15-07 -5.354017 -8.762697 gaming 123
7 ('mentalhealth', '16-04') mentalhealth 16-04 37.051860 -40.462860 mental_health 261
8 ('bipolar', '13-06') bipolar 13-06 32.023180 -44.671570 mental_health 253
9 ('Drugs', '12-10') Drugs 12-10 -20.675976 40.302505 drugs 715
This dataset includes 70 month(s) of data from 57 subreddits.

These data include the frequency of words from 55 word collections in each month of posts to each subreddit, 
as well as the total number of posts in that month and the total word count in those posts.

In total, this data represents 4733344 text posts.

Five Year Visual

The following visualization is a scatter plot representing 5 years of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one month of text posted to one subreddit. The name of the subreddit and the month represented can be seen by hovering over any data point. Note also that the bokeh toolbar to the right of the image should allow zooming.

While the categories are not so clearly delineated in this image as they are in the visualization of just one month of data (likely because the t-SNE algorithm struggles to represent global relations between so many points), relationships between months of data from the same subreddit are shockingly close. This suggests that each forum has its own linguistic habits that continue across time.

In [12]:
data=big_r
palette = d3['Category10'][len(data['category'].unique())]
color_map = CategoricalColorMapper(factors=data['category'].unique(), palette=palette)

source = ColumnDataSource(data)

TOOLTIPS = [
    ("index", "@i"),

]

p_big=figure(x_axis_label='Tsne 1',y_axis_label='Tsne 2',plot_width=800, plot_height=800,tooltips=TOOLTIPS)
p_big.cross(x='Tsne1', y='Tsne2',
         color={'field': 'category', 'transform': color_map}, 
         legend='category',
         source=source,
         size=7)

Big Reddit data, without metadata

One wonders if the primary reason subreddits form such distinct clusters might be their consistent number of posts and total words across months. I've replicated the efforts above, removing the post and word count metadata, and the results show a similarly clustered dataset.