SAP HANA Tutorial, Material and Certification Guide

I am writing this blog to show training with APL using python package hana_ml. With APL, you can automate preprocessing to some extent.

Environment

Environment is as below.

◉ Python: 3.7.14(Google Colaboratory)

◉ HANA: Cloud Edition 2022.16

◉ APL: 2209

Python packages and their versions.

◉ hana_ml: 2.14.22091801

◉ pandas: 1.3.5

◉ scikit-learn: 1.0.2

As for HANA Cloud, I activated scriptserver and created my users. Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn’t use HDI here to make environment simple.

Python Script

1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and scikit-learn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

import pprint

from hana_ml.algorithms.apl.apl_base import get_apl_version

from hana_ml.algorithms.apl.gradient_boosting_classification \

import GradientBoostingBinaryClassifier

from hana_ml.algorithms.pal.partition import train_test_val_split

from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas

from hana_ml.model_storage import ModelStorage

from hana_ml.visualizers.unified_report import UnifiedReport

import pandas as pd

from sklearn.datasets import make_classification

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA. You can check the APL version with get_apl_version function.

HOST = '<HANA HOST NAME>'

SCHEMA = USER = '<USER NAME>'

PASS = '<PASSWORD>'

conn = ConnectionContext(address=HOST, port=443, user=USER,

password=PASS, schema=SCHEMA)

print(conn.hana_version())

# APL.Version.ServicePack is APL

print(get_apl_version(conn))

4.00.000.00.1660640318 (fa/CE2022.16)

name value

0 APL.Version.Major 4

1 APL.Version.Minor 400

2 APL.Version.ServicePack 2209

3 APL.Version.Patch 1

4 APL.Info Automated Predictive Library

5 AFLSDK.Version.Major 2

6 AFLSDK.Version.Minor 16

7 AFLSDK.Version.Patch 0

8 AFLSDK.Info 2.16.0

9 AFLSDK.Build.Version.Major 2

10 AFLSDK.Build.Version.Minor 13

11 AFLSDK.Build.Version.Patch 0

12 AutomatedAnalytics.Version.Major 10

13 AutomatedAnalytics.Version.Minor 2209

14 AutomatedAnalytics.Version.ServicePack 1

15 AutomatedAnalytics.Version.Patch 0

16 AutomatedAnalytics.Info Automated Analytics

17 HDB.Version 4.00.000.00.1660640318

18 SQLAutoContent.Date 2022-04-19

19 SQLAutoContent.Version 4.400.2209.1

20 SQLAutoContent.Caption Automated Predictive SQL Library for Hana Cloud

4. Create test data

Create test data using scikit-learn.

There are 3 features and 1 target variable.

def make_df():

X, y = make_classification(n_samples=1000,

n_features=3, n_redundant=0)

df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])

df['CLASS'] = y

return df

df = make_df()

print(df)

df.info()

Here is dataframe overview.

X1 X2 X3 CLASS

0 0.964229 1.995667 0.244143 1

1 -1.358062 -0.254956 0.502890 0

2 1.732057 0.261251 -2.214177 1

3 -1.519878 1.023710 -0.262691 0

4 4.020262 1.381454 -1.582143 1

.. ... ... ... ...

995 -0.247950 0.500666 -0.219276 1

996 -1.918810 0.183850 -1.448264 0

997 -0.605083 -0.491902 1.889303 0

998 -0.742692 0.265878 -0.792163 0

999 2.189423 0.742682 -2.075825 1

[1000 rows x 4 columns]

RangeIndex: 1000 entries, 0 to 999

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 X1 1000 non-null float64

1 X2 1000 non-null float64

2 X3 1000 non-null float64

3 CLASS 1000 non-null int64

dtypes: float64(3), int64(1)

memory usage: 31.4 KB

5. define table and upload data

Define HANA Table and upload data using function “create_dataframe_from_pandas”.

The function is very useful, since it automatically define table and upload at the same time. Please check options for further detail.

TRAIN_TABLE = 'PAL_TRAIN'

dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,

schema=SCHEMA,

force=True, # True: truncate and insert

replace=True) # True: Null is replaced by 0

6. Split data into train and test dataset

Split dataset using function “train_test_val_split”. The function needs key columns, so I added key column using function “add_id”.

train, test, _ = train_test_val_split(dfh.add_id(),

testing_percentage=0.2,

validation_percentage=0)

print(f'Train shape: {train.shape}, Test Shape: {test.shape}')

Train shape: [8000, 5], Test Shape: [2000, 5]

7. Training

Train with random forest by using class “GradientBoostingClassifier”. Please make sure class AutoClassifier is deprecated.

model = GradientBoostingBinaryClassifier()

model.fit(train, label='CLASS', key='ID', build_report=True)

8. Training result

8.1. Unified Report

Model report shows with the below code. Please see another article “Python hana_ml: PAL Classification Training(UnifiedClassification)” for the report content, which is basically same.

model.generate_notebook_iframe_report()

model.generate_html_report('apl')

8.2. Score

Score function returns mean average accuracy.

# score: mean average accuracy. cannot output other metrics

score = model.score(test)

print(score)

8.3. Summary

get_summary function returns model summary.

model.get_summary().deselect('OID').collect()

SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Learning

8.4. Metrics

get_performance_metrics function returns metrics information.

>> pprint.pprint(model.get_performance_metrics())

{'AUC': 0.991,

'BalancedClassificationRate': 0.964590677634156,

'BalancedErrorRate': 0.03540932236584404,

'BestIteration': 69,

'ClassificationRate': 0.9646017699115044,

'CohenKappa': 0.9291813552683117,

'GINI': 0.4823,

'KS': 0.9195,

'LogLoss': 0.12414480396790141,

'PredictionConfidence': 0.991,

'PredictivePower': 0.982,

'perf_per_iteration': {'LogLoss': [0.617163,

0.554102,

0.499026,

<omit>

0.125448,

0.125588]}}

8.5. Statistical Report

get_debrief_report function returns several type of statistical reports. Please See Statistical Reports in the SAP HANA APL Reference Guide.

reports = ['Statistics_Partition',

'Statistics_Variables',

'Statistics_CategoryFrequencies',

'Statistics_GroupFrequencies',

'Statistics_ContinuousVariables',

'ClassificationRegression_VariablesCorrelation',

'ClassificationRegression_VariablesContribution',

'ClassificationRegression_VariablesExclusion',

'Classification_BinaryClass_ConfusionMatrix']

for report in reports:

print('\n'+report)

display(model.get_debrief_report(report).deselect('Oid').head(3).collect())

8.6. Indicators

get_indicators function returns all indicators with unified format.

model.get_indicators().collect()

8.7. Model info

get_model_info function returns several type of reports.

for model_info in model.get_model_info():

print('\n', model_info.source_table['TABLE_NAME'])

display(model_info.deselect('OID').head(3).collect())

9. Predict

You can predict with function predict.

>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})

>> apply_out = model.predict(test)

>> print(apply_out.head(3).collect())

ID TRUE_LABEL PREDICTED gb_score_CLASS gb_contrib_X1 gb_contrib_X2 gb_contrib_X3 gb_contrib_constant_bias

0 12 0 0 2.592326 -0.222146 3.193908 -0.383197 0.003759

1 13 1 1 -4.876161 0.141867 -4.717393 -0.304394 0.003759

2 19 1 1 -4.074210 0.433828 -4.438335 -0.073464 0.003759

10. Save model

Just save model with class “ModelStorage” and function “save_model”.

ms = ModelStorage(conn)

# ms.clean_up()

model.name = 'My classification model name'

ms.save_model(model, if_exists='replace')

You can see the saved model.

# display(ms.list_models())

pprint.pprint(ms.list_models().to_dict())

{'CLASS': {0: 'hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier'},

'JSON': {0: '{"model_attributes": {"name": "My classification model name", '

'"version": 1, "log_level": 8, "model_format": "bin", "language": '

'"en", "label": "CLASS", "auto_metric_sampling": false}, '

'"fit_params": {}, "artifacts": {"schema": "I348221", '

'"model_tables": ["HANAML_APL_MODELS_DEFAULT"], "library": '

'"APL"}, "pal_meta": {}}'},

'LIBRARY': {0: 'APL'},

'MODEL_REPORT': {0: None},

'MODEL_STORAGE_VER': {0: 1},

'NAME': {0: 'My classification model name'},

'SCHEDULE': {0: '{"schedule": {"status": "inactive", "schedule_time": "every '

'1 hours", "pid": null, "client": null, "connection": '

'{"userkey": "your_userkey", "encrypt": "false", '

'"sslValidateCertificate": "true"}, "hana_ml_obj": '

'"hana_ml.algorithms.pal.xx", "init_params": {}, '

'"fit_params": {}, "training_dataset_select_statement": '

'"SELECT * FROM YOUR_TABLE"}}'},

'STORAGE_TYPE': {0: 'default'},

'TIMESTAMP': {0: Timestamp('2022-09-21 08:57:33')},

'VERSION': {0: 1}}

11. Close connection

Last but not least, close the connection.

conn.close()

SAP HANA Central

Pages

Monday 3 October 2022

Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifier)

No comments:

Post a Comment