In this project, N-gram and LSTM RNN based language models are built for facebook, twitter and instagram social media platforms. Data belonging to news domain was crawled and preprocessed for all three channels and language models were trained using open source libraries ‘nltk’ and ‘keras’.
A single custom preprocessing script was written to clean and tokenize raw data (crawled data) for all three platforms.
The data statistics for all three preprocessed data are given below.
Twitter: 40.6 Lakhs sequences
Instagram: 1.38 Lakhs sequences including instagram image captions and comments
FaceBook: 11.4 Lakhs sequences including facebook post and comments
The General finding of this project are:
1. LSTM based language model perform better than N-gram based model for all three platforms.
2. The conditioned and conditioned text generated using