T-SNE and Reddit¶

Online forums provide a wealth of language data and even, by the topical nature of the forums, come complete with some labeling. One can assume, with some level of confidence, that the language of a forum titled 'teenagers' has a strong relationship with the language of teenagers themselves, and the occasional misleading title is manageable. Researchers have already made some use of loosely labeled data to research patterns in the mental health community. The work below was inspired by Al-Mosaiwi and Johnstone's In an Absolute State which showed increased use of absolutist words ('always', 'never', 'completely', etc.) in forums concerning depression, anxiety, and suicidal ideation. This analysis uses Al-Mosaiwi and Johnstone's list of absolutist words alongside several other sentiment dictionaries (see also Linguistic Inquiry and Word Count (LIWC) software) representing categories such as functional, negation, positive emotion, negative emotion, body, work, leisure, and money related words.

The remarkable result is that the language use each subreddit, even when represented by these dictionary frequencies alone, is completely distinguishable from other subreddits, and subreddits also tend to form clusters within the categories they were selected to represent (video gaming, mental health, computers, relationships, recreational drugs, sports, and media). The easiest way to see this result is via t-SNE dimensionality reduction, which takes as input dictionary frequencies (or a normalized, PCA reduced representation thereof) and outputs two-dimensional points, which are graphed below. For more information on the t-SNE algorithm, see the original paper on it or these slides introducing the concept (additional presentation notes to come). \

The notebook code and graphs below present the data and resulting visualizations. I've removed the least interesting code blocks; for the curious, that code, as well as a few other examples from other projects, is available in this folder on my GitHub. If you're interested in these results, my work on forum language, or how else I think t-SNE visualizations could help make NLP and DL more understandable, please email me!

Smallest Dataset¶

norm = Normalizer()
pca = PCA(n_components=30)
model=TSNE(n_iter=2000, perplexity=5.0)


small_r, small_r_nums, small_r_norm = reset_input(small_filename)
small_r_pca=pca.fit_transform(small_r_norm)

twod=model.fit_transform(small_r_pca)

small_r['Tsne1']=twod[:,0]
small_r['Tsne2']=twod[:,1]
small_r['category']=small_r.subreddit.apply(lambda x: sub_to_cat[x])

small_r.head(10)

This dataset includes 1 month(s) of data from 51 subreddits.

  These data include the frequency of words from 55 word collections in each month of posts to each subreddit. The total number of posts to a given subreddit in that month and the total word count in those posts are also recorded.

The names of these word collections are:
	 absolutist   funct   pronoun   i
	 we   you   shehe   they
	 article   verb   auxverb   past
	 present   future   adverb   preps
	 conjunctions   negate   quant   number
	 swear   social   family   friend
	 humans   affect   posemo   negemo
	 anx   anger   sad   cogmech
	 insight   cause   discrep   tentat
	 certain   inhib   percept   bio
	 body   ingest   relativ   motion
	 space   time   work   achieve
	 leisure   home   money   relig
	 death   assent   nonfl   filler

One Month Visual¶

The image at the beginning of this section is a scatter plot showing one month of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one subreddit. The name of the subreddit associated with each point can be seen by hovering over the data point.

Of note is the way points representing each subreddit cluster with each other. The main exception is the 'general' category which includes the following, largely unrelated subreddits:

- tifu  (today I fucked up) 
- r4r
- AskReddit
- reddit.com
- tipofmytongue (in which people describe words they are searching for but can't remember)
- Life
- Advice
- jobs
- teenagers
- HomeImprovement
- redditinreddit

Also of interest is the way category clusters also form two larger groups, one containing the computers, video game, and sports categories and the other including forums about relationships, mental health, and drugs.

Big Reddit Data¶

big_r=pd.read_csv('vis_data_cache.csv')
big_r.head(10)

This dataset includes 70 month(s) of data from 57 subreddits.

These data include the frequency of words from 55 word collections in each month of posts to each subreddit, 
as well as the total number of posts in that month and the total word count in those posts.

In total, this data represents 4733344 text posts.

Five Year Visual¶

The following visualization is a scatter plot representing 5 years of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one month of text posted to one subreddit. The name of the subreddit and the month represented can be seen by hovering over any data point. Note also that the bokeh toolbar to the right of the image should allow zooming.

While the categories are not so clearly delineated in this image as they are in the visualization of just one month of data (likely because the t-SNE algorithm struggles to represent global relations between so many points), relationships between months of data from the same subreddit are shockingly close. This suggests that each forum has its own linguistic habits that continue across time.

data=big_r
palette = d3['Category10'][len(data['category'].unique())]
color_map = CategoricalColorMapper(factors=data['category'].unique(), palette=palette)

source = ColumnDataSource(data)

TOOLTIPS = [
    ("index", "@i"),

]

p_big=figure(x_axis_label='Tsne 1',y_axis_label='Tsne 2',plot_width=800, plot_height=800,tooltips=TOOLTIPS)
p_big.cross(x='Tsne1', y='Tsne2',
         color={'field': 'category', 'transform': color_map}, 
         legend='category',
         source=source,
         size=7)

Big Reddit data, without metadata¶

One wonders if the primary reason subreddits form such distinct clusters might be their consistent number of posts and total words across months. I've replicated the efforts above, removing the post and word count metadata, and the results show a similarly clustered dataset.

	subreddit	month	count(1)	sum(wordcount)	absolutist_freq	funct_freq	pronoun_freq	i_freq	we_freq	you_freq	...	home_freq	money_freq	relig_freq	death_freq	assent_freq	nonfl_freq	filler_freq	Tsne1	Tsne2	category
i
(buildapc, 16-12)	buildapc	16-12	11394	2670024	0.005858748835216462	0.4225051909645756	0.11574839776721108	0.053445961534428155	0.0013104751118342007	0.01522795300716398	...	0.0014790129227302826	0.011878919440424506	4.1797377102228294E-4	2.8089635149346974E-4	0.0036336752029195243	8.052362076146132E-4	0.0026561558997222497	-147.540161	49.934986	computers
(mentalhealth, 16-12)	mentalhealth	16-12	442	145483	0.013726689716324243	0.6269942192558581	0.21192854147907317	0.11441886680918045	0.0037667631269632878	0.0050933786078098476	...	0.004048582995951417	0.002646357306351945	7.904703642349965E-4	0.0013403627915289072	0.001058542922540778	0.0010860375439054736	0.0062619000158094075	190.499847	-145.644958	mental_health
(DestinyTheGame, 16-12)	DestinyTheGame	16-12	1455	460088	0.011289144685364539	0.5269904887760603	0.12380240301855297	0.03517370589974092	0.006753055937125072	0.018155222479177897	...	0.0014823251204117472	0.003573229469145033	0.0014866721148997584	0.0032146024238841266	0.001754012275912434	0.0011584740310549286	0.0033732677226965277	-15.350249	147.387512	gaming
(buildapcforme, 16-12)	buildapcforme	16-12	1271	458336	0.0037134329400265306	0.5228391398450045	0.12008221043077567	0.026332210430775674	0.0035519793339384206	0.05030370732388466	...	0.0037876143266075543	0.015161366333868604	5.23633317042519E-4	1.985442993786218E-4	0.004679972771067514	5.476331774069679E-4	0.005799238986245898	-289.262390	43.901253	computers
(opiates, 16-12)	opiates	16-12	800	208320	0.012864823348694316	0.595526113671275	0.17950748847926268	0.08860887096774193	0.004123463901689708	0.010973502304147465	...	0.005088325652841782	0.006149193548387097	0.0016801075268817205	0.0012192780337941629	0.002491359447004608	0.0016897081413210445	0.004560291858678955	73.852013	-235.847214	drugs
(proED, 16-12)	proED	16-12	439	96588	0.01443243467097362	0.6190624094090363	0.20788296682817742	0.12564707831200564	0.0024019546941649065	0.006698554685882304	...	0.0034786930053422784	0.00271255228392761	0.0018842920445604008	5.694289145649563E-4	0.0024951339710937177	0.0011595643351140928	0.006201598542261978	86.838821	-165.386108	mental_health
(wow, 16-12)	wow	16-12	1522	336830	0.01135884570851765	0.571243654068818	0.14879909746756523	0.05916931389721818	0.006884778671733515	0.01284030519846807	...	0.0012469198111807144	0.004310779918653327	0.0024700887688151292	0.0017872517293590238	0.0019980405545824303	0.0011459786836089422	0.004485942463557284	-48.052700	181.579590	gaming
(r4r, 16-12)	r4r	16-12	1692	396283	0.009962577249087142	0.5959604625986984	0.19473204755187581	0.10451874039512167	0.006783031318527416	0.019566824718698507	...	0.0024906443122717854	0.003553016404942932	0.0011607866095694238	6.40956084414421E-4	0.0032148742186770564	0.0019102510074870736	0.00745679224190793	-67.443962	221.125809	general
(DotA2, 16-12)	DotA2	16-12	2553	644136	0.010853298061278984	0.5244839599090875	0.12707564862078816	0.035922538097544615	0.005118484295241999	0.015473440391470125	...	5.138666368592968E-4	0.0053544593067302556	0.0017946520610554293	0.001743420644087584	0.0017170287020132394	0.001032390675261125	0.004199423724182471	-66.375130	135.386505	gaming
(baseball, 16-12)	baseball	16-12	209	88815	0.007408658447334347	0.4771603895738332	0.08962450036592919	0.016663851826831052	0.003895738332488881	0.005798570061363508	...	0.0019478691662444406	0.0052018240162134775	0.0011259359342453415	0.007600067556156055	0.0012610482463547824	0.0010133423408208073	0.0024883184146822046	15.160878	149.868851	sports

	i	subreddit	month	Tsne1	Tsne2	category	count(1)
0	('mentalhealth', '15-11')	mentalhealth	15-11	35.442670	-41.172127	mental_health	279
1	('buildapc', '16-12')	buildapc	16-12	-31.161697	4.441895	computers	11394
2	('relationship_advice', '15-12')	relationship_advice	15-12	52.036568	42.083908	relationships	2282
3	('trees', '16-11')	trees	16-11	-42.680676	39.775772	drugs	1408
4	('hardwareswap', '16-01')	hardwareswap	16-01	-46.412006	16.485046	computers	938
5	('buildapcforme', '15-06')	buildapcforme	15-06	57.582960	-0.978246	computers	2747
6	('Overwatch', '15-07')	Overwatch	15-07	-5.354017	-8.762697	gaming	123
7	('mentalhealth', '16-04')	mentalhealth	16-04	37.051860	-40.462860	mental_health	261
8	('bipolar', '13-06')	bipolar	13-06	32.023180	-44.671570	mental_health	253
9	('Drugs', '12-10')	Drugs	12-10	-20.675976	40.302505	drugs	715