Building Language Models for various channels of social media

Project Abstract

In this project, N-gram and LSTM RNN based language models are built for facebook, twitter and instagram social media platforms. Data belonging to news domain was crawled and preprocessed for all three channels and language models were trained using open source libraries ‘nltk’ and ‘keras’.

A single custom preprocessing script was written to clean and tokenize raw data (crawled data) for all three platforms.

The data statistics for all three preprocessed data are given below.

Twitter: 40.6 Lakhs sequences

Instagram: 1.38 Lakhs sequences including instagram image captions and comments

FaceBook: 11.4 Lakhs sequences including facebook post and comments

The General finding of this project are:

1. LSTM based language model perform better than N-gram based model for all three platforms.

2. The conditioned and conditioned text generated using LSTM model is of better quality than other models.

Data Collection Challenges

1. Collecting the data was a major challenge especially for instagram and facebook because of unavailability of dataset online and differences in the styles of the data presented to the user which requires different crawlers for different platforms.

2. Due to several limitations with their official apis it is not possible to crawl large amount of data and hence we used selenium based crawlers available online to crawl the data from these platforms.

3. Selection of common domain for all three social media platforms was an issue. We choose “news” domain as it is one of the most commonly used domain in NLP as plenty of data is available online.

4. Since the crawlers uses selenium which consume ram a lot and takes significant time, we decided to get around 1 million posts from both facebook and instagram respectively

Data PreProcessing Steps

1. Fix the encoding and unicode related problems.

2. Reject a sequence if it contain improperly encoded using this check - “seq.encode(encoding='utf-8').decode('ascii')”

3. Emojis were removed

4. While processing the sequence, if it contains social media acronyms (slang short forms) then its converted to its corresponding full forms like 2gethr -> together etc.

5. We removed the url, email and user addressing using ‘@’ (common in social media).

6. We don’t remove hashtags as it might contain some useful information and used segmented form of it. For example, #TwinPeaks → twin peaks.

7. We don’t remove the tokens which contain numbers like $500 (money), 8212121923 (phone number) , any number (decimal etc.), date and time. These tokens can have an impact in training language model and so are important.

8. Perform Language Identification using ‘spacy’ to filter out any language mixed or non-english sequence

Baseline Methodologies

N-gram based model.We experimented with different N-gram models:

1. Maximum Likelihood Estimation (MLE) model.

2. MLE with Lidstone smoothing.

3. Linear Interpolation N-gram model with WittenBell smoothing.

4. Linear Interpolation N-gram model with Kneserney smoothing

Architecture

Analysis

Ngram model

1. Performance of ngram model for every platform data is increases as order of language model increases from 1 to 3. For each social platform, they perform poorer than LSTM based language model.

2. We experimented with 4 types of model ad found that Interpolated model with kneserney smoothing performs better than remaining 3 for all three platforms. For twitter data wittenbell model gave almost similar performance as kneserney gives.

3. With respect to ‘amount of data’ as a parameter, we find that training n-gram based model consume RAM a lot and also takes time to train.

4. Quality of text generated using Interpolated models are almost identical for all three platforms but is worse than that of LSTM based.

LSTM Model

1. Since the data is huge, we had memory issues, as even with 25 GB RAM, we were running out of memory. To resolve this we have used, sparse categorical cross entropy for loss function which uses integers as labels. Before that we were using, categorical cross entropy which was one hot encoding, resulting into huge memory consumption.

2. Adam optimizer gave better models than RMSprop, SGD.

3. For adam optimizer, we tried different values of learning rate, the best results were at 0.001

4. We have included dropout layer with rate of 0.1 after the last lstm layer to avoid overfitting which was the case with out previously trained models.

5. We tried different batch sizes. But for larger batch sizes(512,1024), the model showed very poor convergence. Hence, we have trained on batch size of 128.

Code-Link

Github Repository Link