Monday 3 October 2022

Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifier)

I am writing this blog to show training with APL using python package hana_ml. With APL, you can automate preprocessing to some extent.

Environment


Environment is as below.

◉ Python: 3.7.14(Google Colaboratory)
◉ HANA: Cloud Edition 2022.16
◉ APL: 2209

Python packages and their versions.

◉ hana_ml: 2.14.22091801
◉ pandas: 1.3.5
◉ scikit-learn: 1.0.2

As for HANA Cloud, I activated scriptserver and created my users. Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn’t use HDI here to make environment simple.

Python Script


1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and scikit-learn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

import pprint

from hana_ml.algorithms.apl.apl_base import get_apl_version
from hana_ml.algorithms.apl.gradient_boosting_classification \
    import GradientBoostingBinaryClassifier
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas
from hana_ml.model_storage import ModelStorage
from hana_ml.visualizers.unified_report import UnifiedReport
import pandas as pd
from sklearn.datasets import make_classification

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA. You can check the APL version with get_apl_version function.

HOST = '<HANA HOST NAME>'
SCHEMA = USER = '<USER NAME>'
PASS = '<PASSWORD>'
conn = ConnectionContext(address=HOST, port=443, user=USER,
                           password=PASS, schema=SCHEMA) 
print(conn.hana_version())

# APL.Version.ServicePack is APL
print(get_apl_version(conn))
4.00.000.00.1660640318 (fa/CE2022.16)
                                      name                                            value
0                        APL.Version.Major                                                4
1                        APL.Version.Minor                                              400
2                  APL.Version.ServicePack                                             2209
3                        APL.Version.Patch                                                1
4                                 APL.Info                     Automated Predictive Library
5                     AFLSDK.Version.Major                                                2
6                     AFLSDK.Version.Minor                                               16
7                     AFLSDK.Version.Patch                                                0
8                              AFLSDK.Info                                           2.16.0
9               AFLSDK.Build.Version.Major                                                2
10              AFLSDK.Build.Version.Minor                                               13
11              AFLSDK.Build.Version.Patch                                                0
12        AutomatedAnalytics.Version.Major                                               10
13        AutomatedAnalytics.Version.Minor                                             2209
14  AutomatedAnalytics.Version.ServicePack                                                1
15        AutomatedAnalytics.Version.Patch                                                0
16                 AutomatedAnalytics.Info                              Automated Analytics
17                             HDB.Version                           4.00.000.00.1660640318
18                     SQLAutoContent.Date                                       2022-04-19
19                  SQLAutoContent.Version                                     4.400.2209.1
20                  SQLAutoContent.Caption  Automated Predictive SQL Library for Hana Cloud

4. Create test data

Create test data using scikit-learn.

There are 3 features and 1 target variable.

def make_df():
    X, y = make_classification(n_samples=1000, 
                               n_features=3, n_redundant=0)
    df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
    df['CLASS'] = y
    return df

df = make_df()
print(df)
df.info()

Here is dataframe overview.

           X1        X2        X3  CLASS
0    0.964229  1.995667  0.244143      1
1   -1.358062 -0.254956  0.502890      0
2    1.732057  0.261251 -2.214177      1
3   -1.519878  1.023710 -0.262691      0
4    4.020262  1.381454 -1.582143      1
..        ...       ...       ...    ...
995 -0.247950  0.500666 -0.219276      1
996 -1.918810  0.183850 -1.448264      0
997 -0.605083 -0.491902  1.889303      0
998 -0.742692  0.265878 -0.792163      0
999  2.189423  0.742682 -2.075825      1

[1000 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      1000 non-null   float64
 1   X2      1000 non-null   float64
 2   X3      1000 non-null   float64
 3   CLASS   1000 non-null   int64  
dtypes: float64(3), int64(1)
memory usage: 31.4 KB

5. define table and upload data

Define HANA Table and upload data using function “create_dataframe_from_pandas”.

The function is very useful, since it automatically define table and upload at the same time.  Please check options for further detail.

TRAIN_TABLE = 'PAL_TRAIN'
dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,
                             schema=SCHEMA, 
                             force=True, # True: truncate and insert
                             replace=True) # True: Null is replaced by 0

6. Split data into train and test dataset

Split dataset using function “train_test_val_split”. The function needs key columns, so I added key column using function “add_id”.

train, test, _ = train_test_val_split(dfh.add_id(), 
                                      testing_percentage=0.2,
                                      validation_percentage=0)
print(f'Train shape: {train.shape}, Test Shape: {test.shape}')
Train shape: [8000, 5], Test Shape: [2000, 5]

7. Training

Train with random forest by using class “GradientBoostingClassifier”. Please make sure class AutoClassifier is deprecated.

model = GradientBoostingBinaryClassifier()
model.fit(train, label='CLASS', key='ID', build_report=True)

8. Training result

8.1. Unified Report

Model report shows with the below code. Please see another article “Python hana_ml: PAL Classification Training(UnifiedClassification)” for the report content, which is basically same.

model.generate_notebook_iframe_report()
model.generate_html_report('apl')

8.2. Score

Score function returns mean average accuracy.

# score: mean average accuracy. cannot output other metrics
score = model.score(test)
print(score)

8.3. Summary

get_summary function returns model summary.

model.get_summary().deselect('OID').collect()

SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Learning

8.4. Metrics

get_performance_metrics function returns metrics information.

>> pprint.pprint(model.get_performance_metrics())

{'AUC': 0.991,
 'BalancedClassificationRate': 0.964590677634156,
 'BalancedErrorRate': 0.03540932236584404,
 'BestIteration': 69,
 'ClassificationRate': 0.9646017699115044,
 'CohenKappa': 0.9291813552683117,
 'GINI': 0.4823,
 'KS': 0.9195,
 'LogLoss': 0.12414480396790141,
 'PredictionConfidence': 0.991,
 'PredictivePower': 0.982,
 'perf_per_iteration': {'LogLoss': [0.617163,
                                    0.554102,
                                    0.499026,
<omit>
                                    0.125448,
                                    0.125588]}}
8.5. Statistical Report

get_debrief_report function returns several type of statistical reports.  Please See Statistical Reports in the SAP HANA APL Reference Guide.

reports = ['Statistics_Partition',
           'Statistics_Variables',
           'Statistics_CategoryFrequencies',
           'Statistics_GroupFrequencies',
           'Statistics_ContinuousVariables',
           'ClassificationRegression_VariablesCorrelation',
           'ClassificationRegression_VariablesContribution',
           'ClassificationRegression_VariablesExclusion',
           'Classification_BinaryClass_ConfusionMatrix']

for report in reports:
    print('\n'+report)
    display(model.get_debrief_report(report).deselect('Oid').head(3).collect())

SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Learning

8.6. Indicators

get_indicators function returns all indicators with unified format.

model.get_indicators().collect()

SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Learning

8.7. Model info

get_model_info function returns several type of reports.

for model_info in model.get_model_info():
    print('\n', model_info.source_table['TABLE_NAME'])
    display(model_info.deselect('OID').head(3).collect())

SAP HANA, SAP HANA Career, SAP HANA Skills, SAP HANA Jobs, SAP HANA Tutorial and Materials, SAP HANA Certification, SAP HANA Guides, SAP HANA Learning

9. Predict

You can predict with function predict.

>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>> apply_out = model.predict(test)
>> print(apply_out.head(3).collect())

   ID  TRUE_LABEL  PREDICTED  gb_score_CLASS  gb_contrib_X1  gb_contrib_X2  gb_contrib_X3  gb_contrib_constant_bias
0  12           0          0        2.592326      -0.222146       3.193908      -0.383197                  0.003759
1  13           1          1       -4.876161       0.141867      -4.717393      -0.304394                  0.003759
2  19           1          1       -4.074210       0.433828      -4.438335      -0.073464                  0.003759

10. Save model


ms = ModelStorage(conn)
# ms.clean_up()
model.name = 'My classification model name'
ms.save_model(model, if_exists='replace')

You can see the saved model.

# display(ms.list_models())
pprint.pprint(ms.list_models().to_dict())

{'CLASS': {0: 'hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier'},
 'JSON': {0: '{"model_attributes": {"name": "My classification model name", '
             '"version": 1, "log_level": 8, "model_format": "bin", "language": '
             '"en", "label": "CLASS", "auto_metric_sampling": false}, '
             '"fit_params": {}, "artifacts": {"schema": "I348221", '
             '"model_tables": ["HANAML_APL_MODELS_DEFAULT"], "library": '
             '"APL"}, "pal_meta": {}}'},
 'LIBRARY': {0: 'APL'},
 'MODEL_REPORT': {0: None},
 'MODEL_STORAGE_VER': {0: 1},
 'NAME': {0: 'My classification model name'},
 'SCHEDULE': {0: '{"schedule": {"status": "inactive", "schedule_time": "every '
                 '1 hours", "pid": null, "client": null, "connection": '
                 '{"userkey": "your_userkey", "encrypt": "false", '
                 '"sslValidateCertificate": "true"}, "hana_ml_obj": '
                 '"hana_ml.algorithms.pal.xx", "init_params": {}, '
                 '"fit_params": {}, "training_dataset_select_statement": '
                 '"SELECT * FROM YOUR_TABLE"}}'},
 'STORAGE_TYPE': {0: 'default'},
 'TIMESTAMP': {0: Timestamp('2022-09-21 08:57:33')},
 'VERSION': {0: 1}}
 
11. Close connection

Last but not least, close the connection.

conn.close()

No comments:

Post a Comment