__STYLES__

Web and X Sentiment Analysis for a Banking Company

Tools used in this project
Web and X Sentiment Analysis for a Banking Company

About this project

Background

It is always challenging for companies to have consolidated reports that shed insights into how the company and its activities are being perceived online. The majority of the vendor software for sentiment analysis performs social media listening, and does not provide detailed or granular data that can used to make business-informed, data-driven decisions.

Having this backdrop, the client in the banking sector needed to know what its clients and potential clients were saying about its products, services, and events. I was tasked with performing a sentiment analysis on the business's data and giving a PoC of a working web and X sentiment analysis system.

Project Objectives

  • Implementable PoC of a working sentiment analysis system.
  • Insightful dashboards that have data granularity.
  • User-friendly, easily consumable, and presentable dashboards.

Tech-stack

  • Twitter API v2.
  • MongoDB.
  • Databricks.
  • SQL Server.
  • Microsoft PowerBI.
  • Python.
  • SQL.

Solution

I scrapped X data using a Python script and wrote the data to MongoDB. The web data was scrapped using a NodeJS script and similarly stored in MongoDB. I then initiated an X sentiment analysis Python script that converted the data into a pandas data frame.

myclient = pymongo.MongoClient("mongodb+srv://admin:##")
mydb = myclient["##"]
mycol = mydb["##"]
stop_words = set(stopwords.words('english'))

I then cleaned the data to remove user handles, punctuations, and special characters, and short words and added the total number of longerations a tweet had.


def clean_tweets(a=5000):
    df_tweet = get_tweets(5000)
    """Remove twitter user handles"""
    df_tweet['df_Tweet'] = np.vectorize(remove_pattern)(df_tweet['Tweet Text'], "@[\w]*")
    """Remove Punctuations and special characters."""
    df_tweet['df_Tweet'] = df_tweet['df_Tweet'].str.replace("[^a-zA-Z#]", " ",regex=True)
    """Remove links."""
    df_tweet['df_Tweet'] = df_tweet['df_Tweet'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
    """Let us now remove all short words as they are usally connectors and have less meaning."""
    df_tweet['df_Tweet'] = df_tweet['df_Tweet'].apply(lambda x: ' '.join([w for w in x.split() if w not in (stop_words)]))
    df_tweet['df_Tweet'] = df_tweet['df_Tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
    df_tweet['df_Tweet'] = df_tweet['df_Tweet'].str.lower()
    """Add total number of longeractions a tweet has had"""
    df_tweet['longeractions'] = df_tweet[['Favorite Count','Retweets']].sum(axis=1)
    return df_tweet.head(a)

----command----

I then initiated a function that calculated the sentiments on the tweet texts. I used the Texblob library for the extractions of polarity from the tweet text. The polarity scale runs from -1 to 1 whereby -1 is extremely negative, 0 is neutral and 1 is extremely positive. The snippet below shows a tweet that had a neutral sentiment.

def sentiment_calc(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return None

----command----
def get_tweet_sentiment(a=5000):
    df_twit = clean_tweets(a)
    df_twit['sentiment'] = round(df_twit['Tweet Text'].apply(sentiment_calc),1)
    return df_twit.head(a)


After calculating the sentiments, I converted the pandas data frame to a spark data frame. Converting the data frame to spark was in preparation for future large volumes of data. Processing large sets of data is usually slow with pandas and fast on spark. This makes it possible for the script to be efficient when analyzing large data sets. I also cast and standardized the columns to strings to avoid conflicts when reloading the delta tables.

tweets_df = spark.createDataFrame(tweets_df.astype(str))cleaned_tweets_df = 

#standardize column names 
cleaned_tweets_df = tweets_df.select([col(cols).alias(cols.replace(' ','_').lower()) for cols in tweets_df.columns])


At this stage, It is a best practice to write the data frame directly into an analytics server. However, I was using Databricks community for this demo and a local SQL Server instance. I could not establish a connection between Databricks community and the local SQL Server due to networking issues. I therefore converted the delta table into CSV and imported the file to SQL Server for further analysis.

#writing spark-df to a csv
df_single_partition = cleaned_tweets_df.coalesce(1)
cleaned_tweets_df.write \
  .format("csv") \
  .option("header", "true") \
  .option("delimiter", ",") \
  .mode("overwrite") \
  .save("dbfs:/FileStore/tables/crdb.csv")
cleaned_tweets_df.display()

After loading the CSV to a SQL Server database, extracted various metrics such as topic by polarity, tweet device by location, and topic by polarity using SQL. I saved the metrics as views (virtual tables) for visualization. I then established a connection between SQL Server and PowerBI and got data from the database which I used to build the dashboards.

NOTE:

The description given for this project takes X sentiment analysis for illustration. The same architecture was used for web sentiment analysis.

Additional project images

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.