Online forums provide a wealth of language data and even, by the topical nature of the forums, come complete with some labeling. One can assume, with some level of confidence, that the language of a forum titled 'teenagers' has a strong relationship with the language of teenagers themselves, and the occasional misleading title is manageable. Researchers have already made some use of loosely labeled data to research patterns in the mental health community. The work below was inspired by Al-Mosaiwi and Johnstone's In an Absolute State which showed increased use of absolutist words ('always', 'never', 'completely', etc.) in forums concerning depression, anxiety, and suicidal ideation. This analysis uses Al-Mosaiwi and Johnstone's list of absolutist words alongside several other sentiment dictionaries (see also Linguistic Inquiry and Word Count (LIWC) software) representing categories such as functional, negation, positive emotion, negative emotion, body, work, leisure, and money related words.
The remarkable result is that the language use each subreddit, even when represented by these dictionary frequencies alone, is completely distinguishable from other subreddits, and subreddits also tend to form clusters within the categories they were selected to represent (video gaming, mental health, computers, relationships, recreational drugs, sports, and media). The easiest way to see this result is via t-SNE dimensionality reduction, which takes as input dictionary frequencies (or a normalized, PCA reduced representation thereof) and outputs two-dimensional points, which are graphed below. For more information on the t-SNE algorithm, see the original paper on it or these slides introducing the concept (additional presentation notes to come). \
The notebook code and graphs below present the data and resulting visualizations. I've removed the least interesting code blocks; for the curious, that code, as well as a few other examples from other projects, is available in this folder on my GitHub. If you're interested in these results, my work on forum language, or how else I think t-SNE visualizations could help make NLP and DL more understandable, please email me!
norm = Normalizer()
pca = PCA(n_components=30)
model=TSNE(n_iter=2000, perplexity=5.0)
small_r, small_r_nums, small_r_norm = reset_input(small_filename)
small_r_pca=pca.fit_transform(small_r_norm)
twod=model.fit_transform(small_r_pca)
small_r['Tsne1']=twod[:,0]
small_r['Tsne2']=twod[:,1]
small_r['category']=small_r.subreddit.apply(lambda x: sub_to_cat[x])
small_r.head(10)
The image at the beginning of this section is a scatter plot showing one month of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one subreddit. The name of the subreddit associated with each point can be seen by hovering over the data point.
Of note is the way points representing each subreddit cluster with each other. The main exception is the 'general' category which includes the following, largely unrelated subreddits:
- tifu  (today I fucked up) 
- r4r
- AskReddit
- reddit.com
- tipofmytongue (in which people describe words they are searching for but can't remember)
- Life
- Advice
- jobs
- teenagers
- HomeImprovement
- redditinreddit
Also of interest is the way category clusters also form two larger groups, one containing the computers, video game, and sports categories and the other including forums about relationships, mental health, and drugs.
big_r=pd.read_csv('vis_data_cache.csv')
big_r.head(10)
The following visualization is a scatter plot representing 5 years of data in which each point represents the t-SNE dimensionality reduction of numerical features representing one month of text posted to one subreddit. The name of the subreddit and the month represented can be seen by hovering over any data point. Note also that the bokeh toolbar to the right of the image should allow zooming.
While the categories are not so clearly delineated in this image as they are in the visualization of just one month of data (likely because the t-SNE algorithm struggles to represent global relations between so many points), relationships between months of data from the same subreddit are shockingly close. This suggests that each forum has its own linguistic habits that continue across time.
data=big_r
palette = d3['Category10'][len(data['category'].unique())]
color_map = CategoricalColorMapper(factors=data['category'].unique(), palette=palette)
source = ColumnDataSource(data)
TOOLTIPS = [
    ("index", "@i"),
]
p_big=figure(x_axis_label='Tsne 1',y_axis_label='Tsne 2',plot_width=800, plot_height=800,tooltips=TOOLTIPS)
p_big.cross(x='Tsne1', y='Tsne2',
         color={'field': 'category', 'transform': color_map}, 
         legend='category',
         source=source,
         size=7)
One wonders if the primary reason subreddits form such distinct clusters might be their consistent number of posts and total words across months. I've replicated the efforts above, removing the post and word count metadata, and the results show a similarly clustered dataset.