Friday 17 January 2020

HANA ML DataFrame : End-to-end methods and it’s usage

A small write-up on HANA ML dataframe , it is really a learning , an exposure and a knowledge sharing process to write something beautiful you learn along with your day to day job so holding the passion for technology in both of my hands here come’s my first post of 2020 and topic is interesting enough , everyone’s favorite. HANA Machine learning & it’s about dataframe this time.

Dataframe methods it’s meaning and their python implementation.

Let’s go step by step.

HANA ML Dataframe – A Skeleton for data

◉ Represents a frame that is backed by a database SQL statement and can also be created by the table statement.

◉ The SAP HANA dataframe , which provides a set of methods for analyzing the data in SAP HANA without bringing the data to the client.

This module represents a database query as a dataframe. Most operations are designed to not bring data back from the database unless explicitly requested.

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

DataFrame Methods : – 

Import dataframe library from hana_ml package.

import hana_ml
from hana_ml import dataframe   ##import dataframe from HANA ML

Before diving deep into dataframes let’s connect first to SAP HANA System and load some data into dataframe for manipulation , filtering & slicing etc.

from data_load_utils import DataSets , Settings
url , port , usr , pwd = Settings.load_config(r"C:\Users\abc\config\e2edata.ini")

ConnectionContext – This represents a connection to HANA System let’s code this

connection_context = dataframe.ConnectionContext(url,port,usr,pwd) ##pass user id and password host and port for connection

connection_context.connection.isconnected()     if this is connected you will True in console as output

Load Data –  Create dataframe from table if you are choosing the table method pass the table name of corresponding schema

dataset1 = connection_context.table("ADULT_DATA_FULL_TBL") ##Table Name of Schema 

##Another way of achieving it 

dataset1 = connection_context.sql("SELECT * FROM ADULT_DATA_FULL_TBL") ##SQL Query for dataframe 

this returns the dataframe which is just a skeleton , it’s not having any data So most of the operations on the dataframes are optimized to execute inside SAP HANA & on the fly which result’s in huge performance benefit.

dataset1.select_statement   ## This is select statement for table which we have used to create dataframe , Meanning is ..
Output
'SELECT * FROM "ADULT_DATA_FULL_TBL"'

Execute dataset1 jupyter cell and output is the hana ml dataframe in return it doesn’t contain data as of now , we need to call the .collect() method which will further execute the select statement & give’s result back to the client (in our case jupyter notebook or python environment)

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

count() –  Return the number of records in the dataframe.

dataset1.count()

Output
48822

collect() – This triggers the select statement on dataframe & fetches data from SAP HANA , this must be used when you think you need that data into python environment for some display , validate or any other kind of requirement , the design goal of this is to perform calculations on the fly and give back the result to python environment for further processing.

example:

dataset1.head(10).collect() ## Use Collect method to load data from sap hana into client ,  this will fetch first 10 records

head(10) means top 10 records

result :

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

columns –  List the number of columns in dataset

dataset1.columns

Ouput
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT',
 'EDUCATION',
 'EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

Limit rows during display – just pass the lower & upper limit in the square bracket , this is just to get the overview of data like you wanna sample something.

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

Copy of dataframe into a new dataframe –  this is the case when you don’t wanna change/alter the original dataframe.

dataset3 = dataset1.head(100).collect()

dataset3 ##it will display the 100 records copied from dataset1 

Drop duplicates – drop duplicates from the dataset , this is required when you are processing the data in algorithms which just require the unique entries.

data cleansing

dataset4 = dataset1.drop_duplicates() 

##Check the select statement for dataset4 

dataset4.select_statement 
'SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

When removing duplicates from the dataset , select statement on the fly selects only the unique records from the sap hana database table , hence your original table in sap hana database remains intact.

Drop Columns from the dataset –  Drop the columns which are not required this is used where some useless columns are removed so that algorithms performs better and results must be accurate in regression/classification.

dataset5 = dataset1.drop(["AGE"])

dataset5.select_statement      #####There is no column AGE here ,this is also called slicing of data 

'SELECT "ID", "WORKCLASS", "FNLWGT", "EDUCATION", "EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", "NATIVECOUNTRY", "INCOME" FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") AS "DT_4"'

So every time you perform an operation on dataset using any method , it prepare’s corresponding select statement & execute it on the fly & give’s result back to the client.

Select TOP 10 records from dataset but for two columns only – 

Select statement on dataframe contains only two columns and selecting only 10 records once the collect method called.

dataset6 = dataset1.select('ID' , 'WORKCLASS').head(10)# , 'AGE' , 'HOURSPERWEEK' , 'RACE').head(10).collect()

dataset6.select_statement  ##Dynamic select statement
'SELECT TOP 10 * FROM (SELECT "ID", "WORKCLASS" FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") dt'

dataset6.collect()

ID WORKCLASS
0 28 Private
1 75 Private
2 196 Private
3 264 Private
4 303 Private
5 324 Private
6 390 Private
7 442 Private
8 447 Private
9 472 Private

NAs in Dataset – Check for Null values in dataset , this is required before processing the data into any algorithm , drop the null or put any average value into it.

dataset1.hasna('CAPITALGAIN') ##Check if there is any NULL value in CAPITALGAIN column of dataset it will
return true or false 

Output
True

Fill NAs with some random value 

dataset1 = dataset1.fillna(10 , ["CAPITALGAIN"])
dataset1.hasna('CAPITALGAIN')

Output
False

Operation on dataframe and do some manipulation on fields – 

A new column of TWICE_AGE has been added with a multiplication of 2 in original age.

dsp = dataset1.select('ID' , 'AGE'  , ('"AGE"*2' , "TWICE_AGE"))     ##Age Column has been added with multiplication

dsp.head(5).collect()

Ouput - 
ID AGE TWICE_AGE
0 1 71 142
1 2 44 88
2 3 39 78
3 4 27 54
4 5 44 88

Checkout the select statement of above dataframe (dsp)

dsp.select_statement

Output

'SELECT "ID", "AGE", "AGE"*2 AS "TWICE_AGE" FROM 
(SELECT "ID", "AGE", "WORKCLASS", "FNLWGT", "EDUCATION", 
"EDUCATIONNUM", "MARITALSTATUS", "OCCUPATION",
"RELATIONSHIP", "RACE", COALESCE("CAPITALGAIN", 10) 
AS "CAPITALGAIN", "CAPITALLOSS", "HOURSPERWEEK", 
"NATIVECOUNTRY", "INCOME" 
FROM (SELECT * FROM "ADULT_DATA_TEST_TBL") dt) AS "DT_18"'

Data Filtration methods – 

dataset4.head(5).collect()  ## Select 5 Sample records from the dataset 

Output

ID AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS OCCUPATION RELATIONSHIP RACE GENDER CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY INCOME
0 41959 39 Private 286789 Doctorate 16 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States >50K
1 33040 22 Private 349212 Some-college 10 Never-married Other-service Own-child White Female 0 0 20 United-States <=50K
2 46142 53 Private 191389 HS-grad 9 Divorced Exec-managerial Not-in-family White Female 0 0 40 United-States <=50K
3 1319 37 Private 162424 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 45 United-States <=50K
4 35823 56 Private 200316 Some-college 10 Married-civ-spouse Adm-clerical Husband White Male 0 0

Query – We want those records which are having AGE greater than 60 

dataset_age = dataset4.filter('AGE > 60') ## records AGE greater than 60

dataset_age.collect()
Output:

dataset_age.count()
Output:
3606

dataset_age.select_statement
Output:
'SELECT * FROM (SELECT DISTINCT * FROM (SELECT * FROM "ADULT_DATA_FULL_TBL") 
AS "DT_6") AS "DT_20" WHERE AGE > 60'

Output : 3906 records displayed.

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

Add Multiple conditions for multiple columns – 

dataset4_res = dataset4.filter('AGE > 60 AND EDUCATIONNUM = 8')
dataset4_res.select_statement

Output:
'SELECT * FROM (SELECT DISTINCT * FROM 
(SELECT * FROM "ADULT_DATA_FULL_TBL") AS "DT_6") 
AS "DT_20" 
WHERE AGE > 70 AND EDUCATIONNUM = 8'

Here multiple conditions have been added and see the 
where clause for this dynamic select statement 

dataset4_res.count()

Output:
7

Only 7 records

Age > 70 & Educationnum = 8

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

## Suppose we want those records where EDUCATIONNUM = 9 

dataset4.filter('EDUCATIONNUM = 9').head(5).collect()

## Suppose we want those records where AGE = 27

dataset4.filter('AGE = 27').head(5).collect()

Multiple filter conditions on Pandas Dataframe – 

dataset3[(dataset3['AGE'] > 27) & ( dataset3['OCCUPATION'] == 'Craft-repair')]

if you have a pandas dataframe at client and you wanna filter records at client end only then above filter syntax will suffice.

Output :

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

Slicing of Dataframe – 

Slice few columns from dataset

## Copy the original dataset into a new one 
new_dataset1 = dataset1.head(5).collect()

new_dataset1[['ID' , 'AGE']]

Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35

Copy sliced data into third one 
new_dataset2 = new_dataset1[['ID' , 'AGE']]

new_dataset2
Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35

Create dataframe for few columns only – 

dataset_id_age = connection_context.sql('SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL')

dataset_id_age.collect()

Output:
ID AGE
0 28 27
1 75 57
2 196 48
3 264 52
4 303 35
5 324 24
6 390 34
7 442 49
8 447 35
9 472 42

dataset_id_age.select_statement
'SELECT TOP 10 ID , AGE FROM ADULT_DATA_FULL_TBL'

In Short dataframe is a select statement which just 
wait to trigger unless it is called

describe() –  Describe the dataframe with all the stats information

dataset4.describe().collect() 

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

empty() – Returns True if this DataFrame has 0 rows

dataset2.empty()

Output:
False

has() – Returns true if a column is in the DataFrame.

dataset1.columns

Output:
['ID',
 'AGE',
 'WORKCLASS',
 'FNLWGT', 'EDUCATION','EDUCATIONNUM',
 'MARITALSTATUS',
 'OCCUPATION',
 'RELATIONSHIP',
 'RACE',
 'GENDER',
 'CAPITALGAIN',
 'CAPITALLOSS',
 'HOURSPERWEEK',
 'NATIVECOUNTRY',
 'INCOME']

dataset1.has('ID')

Output:
True

join() – Joining of two dataset based on the condition like ID = ID

condition = '{}."ID"={}."ID"'.format(dataset4.quoted_name, dataset2.quoted_name)

## Display Condition

condition 

Output:
'"DT_7"."ID"="DT_3"."ID"'

## Let's join dataset4 and dataset2 and copy the result into a new dataset7
dataset7 = dataset4.join(dataset2, condition)

dataset7.head(5).collect()
Output:

dataset7 

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

save dataframe save() – 

dataset4.save('#SAVEDATA')

## retrieve the saved dataframe into a new dataframe 
new_dataset4 = connection_context.table('#SAVEDATA')

##Fetch first 10 records and display 
new_dataset4.head(10).collect()

Ouput: 

10 Records displayed 

new_dataset4.select_statement   ### See here data is coming from temp database table or may be a pointer to the original table 

'SELECT * FROM "#SAVEDATA"'

AGGREGATE FUNCTIONS – one of the most important topic under the dataframe methods , calculating average age , or summing up the prices , counting the items sold , such kind of requirements resulted inventing these kind of functions , let’s code and see how they work.

## Suppose we need to find out the average age of employees working in a specific department

agg_list = [('AVG' , 'AGE' , 'AVG_AGE')] ## Aggr. funtion , column and New column 

dataset_avg_age = dataset4.agg(agg_list=agg_list , group_by='WORKCLASS').collect()

## Display the results 

dataset_avg_age

Group by WORKCLASS and Average Age of employees are displayed here.

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

## Count the number of people of same age

dataset_age_count = dataset4.agg(agg_list=[('COUNT' , 'ID' , 'ID_COUNT')] , group_by='AGE').collect()

dataset_age_count 

Output:

from IPython.display import HTML
HTML(dataset_age_count.head(5).to_html())

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

## Maximum AGE of employees in each group

dataset4.agg([('max', 'AGE', 'MAX_AGE')], group_by='WORKCLASS').collect()

Output:

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

## MIN Age of employees in each workclass

dataset4.agg([('min', 'AGE', 'MIN_AGE')], group_by='WORKCLASS').collect()

Output:

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

corr() –  finding correlation between columns of a dataset

dataset4.corr('ID' , 'AGE').collect()

SAP HANA Tutorial and Materials, SAP HANA Guides, SAP HANA Learning, SAP HANA Online Exam

Create dataframe from pandas – An SAP HANA DataFrame that contains the data in the pandas_df.

import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}  

df = pd.DataFrame(data=d)

df_new = hana_ml.dataframe.create_dataframe_from_pandas(connection_context, df, 'pd_df', force=False, replace=True)

df   ##Pandas dataframe 

Output:
col1 col2
0 1 3
1 2 4

df_new.collect()  ##HANA Ml dataframe 

Output:
col1 col2
0 1 3
1 2 4

No comments:

Post a Comment