Decoding the Covid-19 tweets using NLP and Graph Database | by Shyam Pratap Singh | Dec, 2020


Analysis Flow :

Data processing flow. Image source Author

Data source: I have taken the Twitter data set ( corona_tweets_268.csv: 3,190,245 tweets (December 11, 2020, 08:00 AM — December 12, 2020 10:10 AM) related to corona from IEEE Dataport.

Twitter doesn’t allow to store/process entire user posts so once you gather Tweets ID, you have to use tools like Hydrator to extract post information from Tweets ID.

Once you get the Twitter post information in JSONL file, you can then convert to JSON format, export the file to any programming language to do data cleaning, data massage, and then import the file to neo4j. This is one of the tedious tasks.

Note: for sake of simplicity, you can download the twitter.json which contains 5000 tweets from the below GitHub link mentioned in the reference.

Core Tools and Libraries:

We are going to use the Neo4j database where we will store tweets and do further analysis.

Plugins to be installed in neo4j:

APOC pluggin
Graph data science pluggin

Graphaware: Graphaware has lots of libraries that offer NLP capability to Neo4j so we will use these libraries

List of libraries provided by graph aware:

stanford-english-corenlp ( Language Model) This is language model file, need to be downloaded separately

Set up is one of the most difficult parts due to many dependent libraries and versions.

Follow the Installation step strictly in order as mentioned in guidelines to set up Graphaware libraries in neo4j.

Due to a lack of proper documentation, you may be lost in finding the right libraries and version. So here is one document I prepared to help you out in finding libraries and their appropriate versions.

I am using the Bloom tool for visualization which is optional.

At the end of set up your neo4j.conf file should look like this :

# nlp settings
Plugins should have these jar files
import should have the twitter.json file which you have created earlier or you can download from my GitHub link.

Text Analysis :

We are going to follow the below step for text analysis:

1: Bulk upload of tweets using APOC.JSON and then create the Tweets node
2: Create the User Node from tweets using regex
3 :Create the HashTag Node from tweets using regex
4: Annotate Tweets using nlp pipeline
5: Query to analyse tweets using sentiment analysis, Named Entitiy Recognition etc.

Read More …


Write a comment