Friday 26 November 2021

Large-scale forecasting using HANA, APL and Node.js

Background

Last year I became involved in a project for a retailer based in The Netherlands who had finished construction of a new distribution center. This new innovative DC is fully mechanized and is operated with a minimal amount of personnel, which is different from conventional distribution centers where sudden large order spikes are fulfilled by having more staff pick these orders in parallel. The automated nature of the machinery in this new DC means that the order picking process has a fixed capacity and cannot be scaled up easily by adding more personnel on the fly. In order to optimize its capacity, the DC will therefore also be performing tasks during quiet hours.

These tasks are derived from a forecast the client must supply to the DC controlling software. In practice a 7-day’s forecast needs to be provided to the DC on a daily basis.

Forecasting setup and project approach

For generating these forecasts an SAP HANA system was loaded with a data mart of previous daily goods movements from the DC to all stores on an individual article level. The data mart contains about 5 years of history for each of the 70 stores with an assortment of about 10.000 articles. As you can tell this is a quite large dataset but it the large history is very helpful in determining seasonality patterns occurring yearly.

During the project we are loosely applying the CRISP-DM project approach in an iterative fashion. This approach makes sure phases as data understanding, modeling and evaluation are conducted in iteratively, improving upon the setup each time round.

The technology platform was based on the Advanced Predictive Library (APL) on a HANA 2.0 SPS05 system and an SAP Data Services system for the triggering of the database stored procedures and performing ETL on the resulting forecast.

SAP HANA, SAP HANA Certification, SAP HANA Learning, SAP HANA Tutorial and Material, SAP HANA Preparation

Models are being trained on a store/article level, which means that 700.000 models are trained during a single run. As the underlying hardware has a large number of available cores all stores can be trained in parallel. Every model has about 150 input variables, ranging from data about the articles themselves, the stores, banking holidays, the weather and the past goods movements of that article for the last days. All of these variables are fed to APL which handles the selection of most relevant variables, removes the variables which have correlations amongst each other, etc. The modeling process takes about 12 hours to complete, running in parallel for all stores.

Technical challenges

The main technical challenge of this setup was the optimization of the performance of both the train and apply procedures. As indicated the system manages to train the 700.000 models in about 12 hours, which was acceptable because this process is scheduled to run weekly. It is assumed that the models retain their value during the week as they have been trained on multiple years of data and the external circumstances will not alter that quickly. However the in-database APL apply procedure ran for about 9 hours which would not fit in the appointed time slot between other tasks the system was performing daily.

The solution for this came in the form of a JavaScript-based apply engine which in a recent release is bundled together with the APL. This engine performs the same calculations as the in-database apply procedure but is much faster at it, able to do a single prediction in several milliseconds. By running the code through Node.js this allows the JavaScript engine to be ran standalone on the command line. Moreover Node.js allows for efficient multithreaded programming which was extremely useful to parallelize the workload of applying the millions of models. This allowed the apply process to be brought down to about 20 minutes.

The high-level steps to make this setup work are as follows:

1. Train a model using the in-database APL train functions as normal

2. Export the model to JSON format using the EXPORT_APPLY_CODE built-in function

3. Extract the JSON model and a JSON-formatted extract of the future days the model needs to be applied on from the database into a filesystem

4. Apply the model onto the future days to get the forecast

As you can see this is a few extra steps when compared to simply calling an in-database procedure to perform a forecast within HANA. But even when taking into account the additional exports of the model and forecast data the total runtime is several orders of magnitude faster than the in-database apply. In fact this process now runs in about 45 minutes including the time taken to move data to the file system.

Detailed technical explanation

I will now give a detailed explanation of the high level steps so you can try a similar setup for you own forecasting scenario. You will need access to an HANA environment with APL >= version 2018.2 using Eclipse (or HANA studio).

Training the APL model

For the brevity of this blog post I am assuming a bit prior knowledge of training a model using APL. There are good examples available in the APL documentation on https://help.sap.com under the CREATE_MODEL_AND_TRAIN function reference. A set of good examples is also available by downloading the APL package from the SAP support site and looking into the ‘samples’-folder.

In this setup I will be training a regression model with a target called TARGET_GOODS_MOVEMENTS based on a set of independent variables. For the sake of this example I will just be using two: DISCOUNT_INDICATOR which is a boolean (0 or 1) variable indicating if the article is discounted and a DISCOUNT_VALUE which a continuous value indicating the applied discount. A real-world example may possibly use hundreds of independent variables.

This makes the following table which we call MOVEMENT_HISTORY_TABLE:

Variable Storage  Value type   Description 
DATE  Date Continuous  Date of the goods movement
DISCOUNT_INDICATOR   Number  Ordinal  Indicates if article is discounted
DISCOUNT_VALUE   Number  Continuous  Applied discount to the article 
TARGET_GOODS_MOVEMENTS   Number  Continuous  Number of goods movements 

Note that this is the structure of the historical dataset. Obviously the future dataset will not contain values for the TARGET_GOODS_MOVEMENTS column, however you do need future information on whether the article is discounted. This needs to be available in the same time horizon on which you want to perform the forecast.

When calling CREATE_MODEL_AND_TRAIN make sure to output the trained model into a table called OUT_MODEL_REGRESSION.

Export the model to JSON format

After training the model is stored into binary format in table OUT_MODEL_REGRESSION. It is necessary to convert this into a format which can be consumed by the JavaScript-based apply engine based on JSON. The APL code to perform this conversion is below:

-- --------------------------------------------------------------------------
-- Create the input/output tables used as arguments for the APL function
-- --------------------------------------------------------------------------
drop table FUNC_HEADER;
create table FUNC_HEADER like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.FUNCTION_HEADER";

drop table OUT_EXPORT_JSON_REGRESSION;
create table OUT_EXPORT_JSON_REGRESSION (
    "OID" VARCHAR(50),
    "KEY" VARCHAR(100),
    "VALUE" NCLOB
);

-- --------------------------------------------------------------------------
-- Execute the APL function using its AFL wrapper and the actual input/output tables
-- --------------------------------------------------------------------------

-- Export JSON

drop table EXPORT_CODE_CONFIG;
create table EXPORT_CODE_CONFIG like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED";
insert into EXPORT_CODE_CONFIG values ('APL/CodeType', 'JSON',null);
insert into EXPORT_CODE_CONFIG values ('APL/CodeTarget', 'class',null);

DO BEGIN    
    header           = select * from FUNC_HEADER;            
    model_in         = select * from OUT_MODEL_REGRESSION;
    export_config    = select * from EXPORT_CODE_CONFIG;

    "SAP_PA_APL"."sap.pa.apl.base::EXPORT_APPLY_CODE"(:header, :model_in, :export_config, :out_code);

    -- store result into table
    insert into  "USER_APL"."OUT_EXPORT_JSON_REGRESSION"            select * from :out_code;
END;

At this point table OUT_EXPORT_JSON_REGRESSION will contain your model in JSON format. This is a human-readable format which may expose some details about the model you haven’t noticed earlier. To get this using HANA studio make sure to convert the model payload to a VARCHAR:

SELECT TO_VARCHAR(VALUE) FROM OUT_EXPORT_JSON_REGRESSION

Now right-click on the output cell and select “Copy cells” and paste it into your favorite text editor. This will display the JSON in a formatted way. Note for instance that the DATE-variable is automatically transformed into others like “MonthOfYear”, “DayOfWeek” and many more to make a single date more descriptive. This is one of the automated activities that APL offloads for you.

Extract the JSON format of the model and future days

At this point we are ready to extract the table records from the database into files which can be processed by the JavaScript engine. To perform a forecast we obviously need the model that was exported to JSON format in the previous step, but also one or more records representing future days on which the forecast needs to be applied.

To be able to run the next few programs you’ll require a Node.js runtime to be installed. Node.js is a runtime scripting engine which is similar to environments like Pyton or Perl but allows to use JavaScript as a language. It is in fact the same JavaScript engine as is used in Google’s Chrome browser but now packaged as a standalone executable in order to be used as a generic programming language.

Please refer to https://nodejs.org for instructions on installing Node.js, there are native binaries for Windows and MacOS available. When using Linux you should look into using your package manager.

Node.js comes with a package manager called ‘npm’ which is useful for installing and managing libraries or add-ons you require in your program. For this example we require the SAP HANA client package to natively and easily connect Node.js to HANA. Use command ‘npm install @sap/hana-client’ from your working directory where the scripts are saved.

Save the below code to a file (eg. extract-models.js) and run it using ‘node extract-models.js’. The program will set up a connection to the HANA system and will write the model from the table to a single file. Note: do not forget to modify the connection parameters to your HANA system.

var hana = require('@sap/hana-client');
var hanaStream = require('@sap/hana-client/extension/Stream');
var fs = require('fs');

var connOptions = { serverNode: 'hanasystem:32015', UID: 'USER_APL', PWD: '012345', 
   sslValidateCertificate: 'false' };

var connection = hana.createConnection();

connection.connect(connOptions, function(err) {
    if (err) {
        return console.error(err);
    }

    var stmt = connection.prepare('select * from USER_APL.OUT_EXPORT_JSON_REGRESSION');
    var rs = stmt.execQuery();
    var stream = hanaStream.createObjectStream(rs);
    var writeStream = fs.createWriteStream('model-output.json');

    stream.on('readable', function (data) {
        var data;
        while (null !== (data = stream.read())) {
            // write JSON model record to file
            writeStream.write(data.VALUE);    // model is in the VALUE column
        }
    });

    stream.on('end', function () {
        // done
        connection.close();
    });
});

To extract the future days you will be forecasting create a new file called ‘extract-forecast-days.js’ and use the below code. I have highlighted the lines that have been changed compared to the previous extraction program.

var hana = require('@sap/hana-client');
var hanaStream = require('@sap/hana-client/extension/Stream');
var fs = require('fs');

var connOptions = { serverNode: 'hanasystem:32015', UID: 'USER_APL', PWD: '012345',
   sslValidateCertificate: 'false' };
var connection = hana.createConnection();

connection.connect(connOptions, function(err) {
    if (err) {
        return console.error(err);
    }

    var stmt = connection.prepare('select DATE, DISCOUNT_INDICATOR, DISCOUNT_VALUE 
        from USER_APL.MOVEMENT_HISTORY_TABLE');

    var rs = stmt.execQuery();
    var stream = hanaStream.createObjectStream(rs);
    var writeStream = fs.createWriteStream('forecast-days-output.json');

    stream.on('readable', function (data) {
        var data;
        while (null !== (data = stream.read())) {
            // write JSON model record to file
            var struct = [
              { variable: 'DATE', value: data.DATE },
              { variable: 'DISCOUNT_INDICATOR', value: data.DISCOUNT_INDICATOR },
              { variable: 'DISCOUNT_VALUE', value: data.DISCOUNT_VALUE }
            ];

            writeStream.write(JSON.stringify(struct));
        }
    });

    stream.on('end', function () {
        // done
        connection.close();
    });
});

After running the modified code you should have an additional file called ‘forecast-days-output.json’ together with the ‘model-output.json’ file created in the previous program.

Apply the model onto future days to get the forecast

The both files created in the previous steps need to be applied to one another to generate the forecast. The model encodes information about the expected target based on the independent variables DATE (and its derived variables), DISCOUNT_INDICATOR and DISCOUNT_AMOUNT. These dependent variables are part of the future days in the forecasting horizon and are expected to be known upfront.

Before running the code download the APL package from https://support.sap.com by navigating to ‘Software Downloads’ and searching for ‘APL’. After extracting the downloaded package you will find the scoring runtime in the ‘samples/runtimes/javascript’ directory. Copy the files from this directory into a folder called ‘lib’ which resides in your own working directory where you have stored the extraction scripts.

Now save the below code into a file called ‘predict.js’:

const runtime = require("./lib/autoRuntime");
const fs = require('fs');

let modelDefinition = JSON.parse('model-output.json');
let predictRow = JSON.parse('forecast-days-output.json');
let forecastEngine = runtime.createEngine(modelDefinition);
let prediction = forecastEngine.getScore(predictRow);
let score = prediction["score"];

console.log('number of goods movements predicted: ' + score);

After running the code you will see a statement logging the prediction for the future day you want to get the forecast for. These steps will need to be repeated for each day in the forecasting horizon, so you will possibly need to create a ‘forecast-days-output.json’ file for each of those days. The prediction program should now loop over those files to get a full prediction, possibly logging them into a local file or writing them back into the HANA database.

Wrapup


I have shown you an approach to use a model which was trained in HANA using APL in a standalone environment based on Node.js by exporting the model to a specific format and applying this to the forecasting horizon. The end result in our solution was a huge runtime improvement over the in-database apply functions that APL provides in HANA itself which went down from hours to minutes. This makes the solution very useful for large-scale processing of forecasts.

Of course there are many improvements that could be made to this setup which require a more in-depth knowledge of Node.js engine and programming model, like managing a larger number of models or enabling multithreading to allow for parallel model scoring. If interested you should search the internet for resources on this as the current approach gives you a baseline setup.

No comments:

Post a Comment