Wednesday 21 February 2018

SAP HANA Text Mining Functions – Part 1

In this blog, we’ll discuss Text Mining Functions.  Functions available to find top ranked related and relevant documents and terms.

Figure 1 shows the permutations and combinations available for doing Text Mining.


SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 1: Text Mining Functionality

Available Functions in SAP HANA Text Mining
  • First Block is to identify related and suggested terms. Functions available for these operations
    • TM_GET_RELATED_TERMS
    • TM_GET_SUGGESTED_TERMS
  • Second Block is to identify relevant or similar documents. Function available for this operation
    • TM_GET_RELATED_DOCUMENTS
    • TM_GET_RELEVANT_DOCUMENTS
  • Third Block is to identify relevant terms of a documents. Function available for this operation
    • TM_GET_RELEVANT_TERMS
  • Fourth Block is to categorize or classify documents. Function available for this operation
    • TM_CATEGORIZE_KNN

Document Functions

TM_GET_RELATED_DOCUMENTS

This text mining function returns the top-ranked related documents for a query document within a search request and stores these documents (including metadata) in the return table.

Syntax:

TM_GET_RELATED_DOCUMENTS
( <tm_document>
<tm_search>
<tm_return_document> )

where
<tm_document> := 
DOCUMENT { <string>  [ LANGUAGE <string> ] [ MIME TYPE <string> ]    
| (  <subquery> )   [ LANGUAGE <string> ] [ MIME TYPE <string> ]    
| IN FULLTEXT INDEX WHERE <condition> }

Either provide text as string or provide a select query or specify query document part of full text index using where clause for restriction.

<tm_search> := 
SEARCH <reference column> FROM <reference table> [ WHERE <condition> ]
[ WITH TERM TYPE <string>, ... ]

Specifies the set of reference documents by specifying <reference column> and <reference table>. Specified reference column must be of type TEXT or must have full text index. To restrict the set of reference documents to be used in calculations specify where condition. Further restriction can be introduced using WITH TERM TYPE example 'proper*','noun' in which will only consider the proper names or nouns. 

<tm_return_document> :=
RETURN      [ PRINCIPAL COMPONENTS <pc int> ]  -- output FACTORS, ROTATED_FACTORS     [ CLUSTERING [<string>] ]                                   -- output CLUSTER_LEVEL, CLUSTER_LEFT, CLUSTER_RIGHT 
[ CORRELATION ]                         -- output CORRELATIONS       
[ HIGHLIGHTED ]                                -- output HIGHLIGHTED_DOCUMENT,                                           
                                               -- HIGHLIGHTED_TERMTYPES     
TOP { <top int> | DEFAULT }   
[ <column> [as <alias>], ... ]           -- output columns out of <table> in
                                               -- <tm_search>

Principal Component keyword when specified, a principal components analysis (factor analysis) is calculated on the correlation matrix of the found documents. Factor Analysis is data reduction method. This is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible. It is always performed on a symmetric correlation or covariance matrix.  The <top int> factors will be returned as arrays in the column FACTORS of the result table and the rotated factors will be returned as ARRAYs in the column ROTATED FACTORS. Below graphic in figure 2 shows the transformation of 3-dimensional data to 2-dimensional data (from High to low dimension) using PCA.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 2 PCA Example

Clustering: If [CLUSTERING <string>] is specified, a hierarchical bottom-up cluster analysis will be performed on the found related documents.

In data mining, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters using bottom up approach.  Set of nested clusters organised as a hierarchical tree. This can be visualized as a dendrogram. Two types of Hierarchical clustering algorithms are Agglomerative and Divisive. Agglomerative starts with data points are individual clusters and at each step, merge the closest pair of clusters till the point just k clusters are formed. Decisive starts with one, all-inclusive cluster, at each step split the cluster till the point k clusters are formed. Figure 3 shows the graphic for HCA.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 3 Nested Cluster and Dendrogram

Find below the list of algorithms available –

‘SINGLE_LINKAGE’: In single linkage, we define the distance between two clusters to be the minimum distance between any single data point in the first cluster and any single data point in the second cluster. This algorithm is sensitive to noise and outliers. Figure 4 shows the graphic for single linkage algorithm.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 4: Single Linkage Nested Cluster and Dendrogram

‘COMPLETE_LINKAGE’: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. This is more balanced cluster with approximately equal diameter. Figure 5 shows the graphic for complete linkage algorithm.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 5: Complete Linkage Nested Cluster and Dendrogram

‘AVG_DISTANCE_WITHIN’ and ‘AVG_DISTANCE_BETWEEN’: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. Figure 6 shows graphic for Average distance algorithm.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 6: Average Distance Nested Cluster and Dendrogram

‘WARD’: This method looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association. As per this method, the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them. Figure 7 shows graphic for Ward Algorithm.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 7: Ward Nested Cluster and Dendrogram

The result of the cluster analysis is stored in the columns CLUSTER_LEVEL, CLUSTER_LEFT, and CLUSTER_RIGHT of the result table. Correlation is the association between two variables.

Correlation keyword returns the correlation matrix between the found documents as arrays in the column CORRELATIONS of the result table.  Highlighted keyword returns the document texts with highlighted information.

First Example: In this case, we are pinning it down to one document “Federal_award_id_number = 1304684”. Input is query as part of full text index with document number which is run against the term document matrix/text mining index to fetch top 5 related/similar documents. Score depicts the similarity between the documents, higher the value more similar are the documents. Figure 8a shows the result of function TM_GET_RELATED_DOCUMENTS.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 8a: Result Set of TM_GET_RELATED_DOCUMENTS

First two top ranked documents have same score value ‘1’ means both documents exactly match( Award_abstract column has same content for both the documents). Further list shows that the documents isn’t similar and score value is reducing.

Second Example: This example does statistical analysis.  In this case, we are pinning it down to one document “Federal_award_id_number = 1304684”. Input is query as part of full text index with document number which is run against the term document matrix/text mining index to fetch top 5 related/similar documents. We have principal components, clustering algorithm and correlation matrix. Figure 8b shows the result of function TM_GET_RELATED_DOCUMENTS.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 8b: Result Set of TM_GET_RELATED_DOCUMENTS

TM_GET_RELEVANT_DOCUMENTS

This text mining function returns the top-ranked documents that are relevant to a term.

Syntax:
TM_GET_RELEVANT_DOCUMENTS ( 
<tm_term> 
<tm_search>
<tm_return_document> )

where 
<tm_term> :=   TERM <string>   [ LANGUAGE <string> ]

In this case, specify the term and language to be processed. 

<tm_search> :=   
SEARCH <reference column> FROM <reference table> [ WHERE <condition> ] [ WITH TERM TYPE <string>, ... ]
Specifies the set of reference documents in <reference column> and <reference table>. The specified column must be of type text or must have a full-text index. The set of reference documents can be restricted by WHERE <condition> or With Term Type.

<tm_return_document> :=   RETURN      
[ PRINCIPAL COMPONENTS <pc int> ] -- output FACTORS, ROTATED_FACTORS    
[ CLUSTERING [<string>] ]          -- output CLUSTER_LEVEL, CLUSTER_LEFT, 
                                        -- CLUSTER_RIGHT     
[ CORRELATION ]                    -- output CORRELATIONS       
[ HIGHLIGHTED ]                    -- output HIGHLIGHTED_DOCUMENT,                                        
                                        -- HIGHLIGHTED_TERMTYPES     
TOP { <top int> | DEFAULT }  
[ <column> [as <alias>], ... ]    -- output columns out of                                        
                                        -- <table> in <tm_search> 

For explanation of above options, refer to previous section. If specified, the options PRINCIPAL COMPONENTS, CLUSTERING, CORRELATION, and [HIGHLIGHTED] must be used in this order. TOP must always be specified as the last option.

Example: Input as “Ocean” which is run against the document matrix/text mining index to fetch top 5 relevant documents. Figure 9 shows the result set of function TM_GET_RELAVANT_DOCUMENTS.

SAP HANA Mining, SAP HANA Certifications, SAP HANA Guides, SAP HANA Tutorials and Materials, SAP HANA Learning

Figure 9: Result Set of TM_GET_RELAVANT_DOCUMENTS

No comments:

Post a Comment