SAP HANA Tutorial, Material and Certification Guide

Since long i am seeing various post on COVID-19 data analysis and forecasting and having the same thought i decided to post my recent work mainly on text mining & text analysis on COVID-19 tweets & data.

IDEA is to connect twitter from python and insert data into SAP HANA and utilize the built-in powerful capabilities in SAP HANA for text mining & text analysis & displaying word cloud and using few text mining functions as well.

we have seen people are expressing their views and thoughts onto twitter with different sentiments and you know we have lot of unstructured data & to dig into that is really a very difficult job hence i pick this area to try my hands on and thought to showcase few insights of it hope you will like it and please feel free to share your suggestion and feedback.

Technical flow for twitter and HANA communication – we have now python which is going to make our job easy and provides a smooth integration with SAP HANA + so many built-in python functions used for mining and analysis.

HANA ML Python APIs – For more information about HANA Machine learning API please refer SAP HANA document.

SAP ABAP Tutorial and Material, SAP ABAP Certification, SAP ABAP Exam Prep, SAP ABAP COVID-19, SAP ABAP Learning

Pre-requisites: HANA ML Installation environment + SAP HANA DB (UP and Running)

Let’s go step by step

◉ First of all we need one twitter developer account this is pretty easy and straightforward go to twitter for developer url and setup one app for yourself and save all the tokens mentioned there somewhere in notepad

◉ After successful creation of twitter account you will see below & please have them handy & save them somewhere in notepad

Twitter developer tokens

◉ Go to your conda/python command prompt and install the twitter API (there are few APIs for python install based on your comfort or like)

◉ execute the below command

pip install tweepy

## if this is successful execute the below command to check

import tweepy as twitter

◉ Initialize all the tokens , give them the value we noted from twitter developer account this is going to be used for connecting with twitter

consumer_key = 'e6QvccFkjlkdsjlksadjlkjdsa'

consumer_secret = 'aslkdjsalkdjsalkdjlsakd'

token='1384909104-2alakdlsakdlsakdlakdslkdsalkdsaladkads'

token_secret='rT0ZEUdasldkasdkjsadlkalkdjlkdajlkdsa'

◉ Start with twitter authentication if connection is successful test for downloading few tweets based on the Tags

# Authenticate to Twitter passing key and secret tokens

auth = twitter.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(token, token_secret)

# Create API object - Wait on rate limit is true so that we can download the tweets

#without any connection break

api = twitter.API(auth, wait_on_rate_limit=True)

# Create a tweet - let' see this

api.update_status("Hi from HANA ML API to Twitter - Analysing tweets")

◉ Tweeting using python – last command will post a tweet using my account

◉ The connectivity between python and twitter is fine now – let’s try to download few tweets and see how it’s coming at client and then cleansing and some formatting will be required before sending them to SAP HANA DB

◉ Try fetching first 10 tweets based on Hash tag (COVID19) & lets what we get :

hashtag = '#COVID19'

date = '2020-06-30'

tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date, ).items(10)

for tweet in tweets:

for i in range(len(tweet.entities.get('hashtags'))):

print(tweet.entities.get('hashtags')[i]['text'])

◉ Here we are passing date as well so it’s going to give us 10 items since 30th June & lets have a look at output.It’s up to you what parameter you choose & you are free to change Hash Tag .

◉ If you see in below image mainly people are talking about mask / air travel etc. we will see this in text analysis in detail this is just to check if we can download the tweets or not.

◉ Now execute the same command for 5K items

tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date).items(5000)

tweet_data = [[tweet.geo,tweet.text,tweet.user.screen_name,tweet.user.location,tweet.created_at,tweet.entities['hashtags']]for tweet in tweets]

◉ This is going to download 5000 tweets for you and surely we can do some text analysis with this data – you can go for 50K items as well it’s gonna take sometime & save those 50K times somewhere in the excel-sheet , and off-course you can try more function for text mining.

Columns

◉ GEO (Geo Location)

◉ Tweet Text

◉ User ID

◉ Location

◉ Date

◉ tags

Create a dataframe for storing the tweets

import pandas as pd

tweet_df = pd.DataFrame(data=tweet_data,columns=['geo','text','user_name','loc','date','tags'])

pd.set_option('max_colwidth',800)

Define a function to fetch Hash Tags

def hash_tag(tdata):

tags = ''

for i in range(len(tdata)):

tags = tags + str(tdata[i]['text']) + ','

tags = tags.strip(',')

return tags

◉ Tweets Processing only Hash Tag column

#process tweets , to have clean data for hastags

for i in range(len(tweet_df['tags'])):

tweet_df['tags'][i] = hash_tag(tweet_df['tags'][i])

Let’s check how many records are tweet_df is having ? its (5000 X 6)

Regex function for cleaning the text

Let’s build now some regex to clean the text before any text mining and also before inserting the data into SAP HANA

import re

re.sub("RT @[\w]*:", "" , tweet_df.text[0])

Output

' BREAKING NEWS Kimberly Guilfoyle Trump Jr s girlfriend has tested positive for COVID19'

# RT and @ user name replaced with space , lets clean more

##Complete function which is going to remove @ , RT , HTTPS , New lines , # Tags , Idea is to clean the ##text.

def clean_data(tweet):

tweet = re.sub("RT @[\w]*:", "", tweet)

tweet = re.sub("@[\w]*", "", tweet)

tweet = re.sub("https://[A-Za-z0-9./]", "", tweet)

tweet = re.sub("\n", "", tweet)

tweet = re.sub("&amp", "", tweet)

tweet = re.sub("#", "", tweet)

tweet = re.sub(r"[^\w]", ' ', tweet )

return tweet

Apply the regex function and clean the text

tweet_df['text'] = tweet_df['text'].apply(lambda x: clean_data(x))

Print the text

tweet_df['text']

Output

if you look at the text i see most of things are clean but still scope is always there for improvement , i see here if i can remove the “HEY” but still HANA is going to remove it using the stop-words

Save DataFrame

Save this dataframe in your local file so that you have it for later use –

tweet_df.to_csv('test_new.csv')

Establish the connection to SAP HANA System

from hana_ml import dataframe

from data_load_utils import DataSets, Settings

url, port, user, pwd = Settings.load_config(r"C:\Users\ABC\config\abc.ini")

connection_context = dataframe.ConnectionContext(url, port, user, pwd)

print(connection_context.connection.isconnected())

##if connection is successful it will print

True

Go to your HANA System now –

◉ Choose the schema where you want to create your table for storing tweets go to sql console

CREATE COLUMN TABLE TWEETS_DATA(

"ID" INTEGER NOT NULL,

"USER_NAME" NVARCHAR(100),

"CREATED_AT" DATE,

"TEXT" NVARCHAR (140),

"HASH_TAGS" NVARCHAR (100),

"GEO" NVARCHAR(100),

"LOCATION" NVARCHAR(100),

PRIMARY KEY("ID")

);

Note Once this is done create the full text index and text mining as well for this table

CREATE FULLTEXT INDEX index_table_name ON table_name (column_name)

ASYNC

FAST PREPROCESS OFF

SEARCH ONLY OFF

TEXT ANALYSIS ON

TEXT MINING ON

TEXT MINING CONFIGURATION MYTMCONFIGURATION.textminingconfig

LANGUAGE DETECTION ('EN')

CONFIGURATION 'LINGANALYSIS_FULL'

Come back to Jupyter notebook (I am using jupyter for all the python work i have posted above)

◉ Create an empty list example data = [ ]

◉ Append all the tweets into this list and insert this data into tweet table

data = []

for indx , row in tweet_df.iterrows():

data.append([indx , row['geo'], row['text'], row['user_name'], row['loc'], row['date'], row['tags']])

sql = ('insert into ABC.TWEETS("ID", "GEO", "TEXT", "USER_NAME", "LOC", "DATE", "HASH_TAGS")' 'VALUES (?, ?, ?, ?, ?, ?, ?)')

with connection_context.connection.cursor() as cur:

rows_inserted = cur.executemany(sql, data)

print(rows_inserted)

Note – you can also use csv upload approach instead of it up to you which approach to follow

Text Mining / Text Analysis

Let’s see word cloud first from HANA –

Let’s try the same thing using Python which will be talking to SAP HANA using Dataframe.

# Start with loading all necessary libraries

import numpy as np

import pandas as pd

from os import path

from PIL import Image

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt

% matplotlib inline

## Prepare SQL for Tokens

sql_new = """SELECT STRING_AGG( "TA_TOKEN", ',' ) AS TOKENS FROM "ABC"."$TA_TWEET_DATA" """

##Collect will fetch the data

df_new_sql = connection_context.sql(sql_new).collect()

Generate word cloud using python

Set the properties for word cloud before display

wordcloud = WordCloud(width = 800, height = 800,

background_color ='white',

min_font_size = 12).generate(df_new_sql.TOKENS[0])

# plot the WordCloud image

plt.figure(figsize = (12, 12), facecolor = None)

plt.imshow(wordcloud)

plt.axis("off")

plt.tight_layout(pad = 5)

plt.show()

Note : – you may see some bad data as well (it’s not 100 % cleaned) but yes you will see the important text mining output as well

Execute the above python code to generate word cloud

◉ if you see the word cloud – data coming directly from SAP HANA – you see the covid 19 tag off-course and addition to this you see words like “severe” , “respiratory” , “sick” , “neurological”, ” cases” , “mask” etc.

◉ Important words – banned , plasma , President

So using this way you can insert more information and for sure it’s going to give you more relevant data for displaying into word cloud.

Fetching terms and checking importance using SAP HANA (text mining) –

Function Used – GET_SUGGESTED_TERM (See SAP HANA SQL for more info)

◉ Let’s see how text mining capability of SAP HANA is going to score the probability of word “CORONA”

input_tokens

['CORONA']

◉ Call the function

##Create DataFrame to Store Result

df_terms = pd.DataFrame()

## Call TM function for to fetch score

for token in input_tokens:

df_new = GET_SUGGESTED_TERMS(token,"TEXT")

df_terms = df_terms.append(df_new)

Output

See the score for first record is 0.75 means 75% probability that this is getting used in text so ranking of this term is very high

Let’s check for different token –

input_tokens = ['trump']

Output

df_terms

Ranking for this word is really High 97 %

We can also do the analysis for sentence as well and you will see the different ranking and probability of each word in sentence.

input_text = "corona is spreading in india during lockdown as well"

◉ India is also ranked very high with 98% so you can say recent tweets are more related to india and you know people are also talking about lock-down its 45% around

Note – We have taken only 5000 tweets for analysis i am sure this can be better if we take more data , may be in my next post i will be uploading dataset from kaggle and will be playing more with TA & TM functions.

SAP HANA Central

Pages

Monday 13 July 2020

COVID-19 Tweets analysis using HANA ML Python APIs.

No comments:

Post a Comment