Wednesday 20 May 2020

Working with Esri shape files, Python and SAP HANA

Working with SHP files is not always as easy as it seems to be.

For example, we have a lot of interesting geo-spatial capabilities embedded in SAP HANA, but in one way or another we need to ingest this kind of information to the system. Usually, this spatial information is distributed on what is known as Shape files. Taking wikipedia definition, a shape file is:

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products. The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. Each item usually has attributes that describe it, such as name or temperature.

So, we have a great information richness when we find a good source of shape files and the next step is to query or to process this information on SAP HANA. You may start working with this information importing it using SAP HANA Studio… or you may start importing it with a Python script.

Note: This is a very naive approach for importing spatial information using Python. You may modify it for your convenience.

In order to import this information into SAP HANA we are going to use a Python library (PySHP) to read ESRI files, and we are going to combine it with HDBCli python implementation to comunicate with SAP HANA database.

Note: We are going to use SAP HANA Express Edition as an example, but the main thing is to establish a connection with your HANA system.

We start reading our shape file with python importing pyshp library in a python script. You may want to include unicode library in case your information no-geometry data contains special characters, this is my case.

import shapefile
import unicodedata

I prefer to define a variable with file path. In case of SHP files, you actually need more than one file to read the whole information, so it is sufficient to declare path file without extension; python library will do the rest. In this example, I’m going to work with file 01m.shp, so my FILE variable will be this:

FILE = "C:/spatialdata/data/00ent"

I will use with notation and make use of fields function

with shapefile.Reader(FILE, encoding="latin1") as shp:
fields = shp.fields
print(fields)

For this file, this is the console output:

[('DeletionFlag', 'C', 1, 0), ['CVEGEO', 'C', 2, 0], ['CVE_ENT', 'C', 2, 0], ['NOMGEO', 'C', 80, 0]]

If we want to import SHP file content to HANA, we need DDL instructions. If we don’t have it a priori, we can get it from that function result. From this last output we can see every field data type. According to pyshp library, this is the meaning:

◉ “C”: Characters, text.
◉ “N”: Numbers, with or without decimals.
◉ “F”: Floats (same as “N”).
◉ “L”: Logical, for boolean True/False values.
◉ “D”: Dates.

With this, we can build a basic DDL definition. For the completeness of SAP HANA DDL definition, we need to know which is the spatial reference identifier for this data set. So, this is our python code so far:

import shapefile
import unicodedata

FILE = "C:/spatialdata/data/00ent"
ESQUEMA = 'INEGI'
TABLA = '00ent'
SRID = 6372

def setVal(field):

if field[1] == 'C':
DEF = '"'+str(field[0])+'" NVARCHAR('+str(field[2])+'),'

elif field[1] == 'D' or field[1] == 'F':
DEF = '"'+str(field[0])+'" DECIMAL('+str(field[2])+','+str(field[3])+'),'

return DEF


def getDDL(esquema, tabla, srid, fields):
DDL = 'CREATE COLUMN TABLE "'+esquema+'"."'+tabla+'" ('

for field in fields[1:]:
DDL += setVal(field)

DDL += '"SHAPE" ST_GEOMETRY('+str(srid)+'))'
return DDL


with shapefile.Reader(FILE, encoding="latin1") as shp:
fields = shp.fields
records = shp.shapeRecords()

DDL = getDDL(ESQUEMA, TABLA, SRID, fields)

print(DDL)

To get the actual records value we can use shapeRecords() function; with records value we can generate the data insertion.

def doInsert(esquema, tabla, srid, records):

for r in records:
INSERT = 'INSERT INTO "'+esquema+'"."'+tabla+'" VALUES ('
INSERT += "'{0}'".format("','".join(r.record[:]))+","

if len(r.shape.parts) > 1:
INSERT += "NEW ST_MultiPolygon('MultiPolygon("

for t in range(len(r.shape.parts)-1):
l = ["%f %f" % p for p in r.shape.points[r.shape.parts[t]:r.shape.parts[t+1]]]
INSERT += "(({0})),".format(','.join(l))

l = ["%f %f" % p for p in r.shape.points[r.shape.parts[-1]:]]
INSERT += "(({0}))".format(','.join(l))

else:
INSERT += "NEW ST_POLYGON('POLYGON("
l = ["%f %f" % p for p in r.shape.points]
INSERT += "({0})".format(','.join(l))

INSERT += ")',"+str(srid)+"))"

return INSERT

And now we have

with shapefile.Reader(FILE, encoding="latin1") as shp:
fields = shp.fields
records = shp.shapeRecords()

DDL = getDDL(ESQUEMA, TABLA, SRID, fields)
INSERT = doInsert(ESQUEMA, TABLA, SRID, records)

Let’s generate the connection to SAP HANA so we can create the table and insert every shape founded in SHP Esri file. So:

from hdbcli import dbapi
connection = dbapi.connect(
    address='hxehost',
    port= 39015,
    user='SYSTEM',
    password='PASSWORD'
)
if connection.isconnected():
print("Conexion establecida")
else:
print("Error de conexion")

For SQL statements execution we must open a cursor and execute it

connection = dbapi.connect(
    address='hxehost',
    port= 39015,
    user='SYSTEM',
    password='PASSWORD'
)
if connection.isconnected():
print("Conexion establecida")

cursor = connection.cursor()
cursor.execute(DDL)
cursor.close()
else:
print("Error de conexion")

For the INSERT operation we have to modify our previous code to execute it on every record’s iteration.

Finally this is the whole Python code:

import shapefile
import unicodedata
from hdbcli import dbapi

FILE = "C:/spatialdata/data/00ent"
ESQUEMA = 'INEGI'
TABLA = 'INEGIPRUEBA'
SRID = 6372

def setVal(field):

if field[1] == 'C':
DEF = '"'+str(field[0])+'" NVARCHAR('+str(field[2])+'),'

elif field[1] == 'D' or field[1] == 'F':
DEF = '"'+str(field[0])+'" DECIMAL('+str(field[2])+','+str(field[3])+'),'

return DEF

def getDDL(esquema, tabla, srid, fields):
DDL = 'CREATE COLUMN TABLE "'+esquema+'"."'+tabla+'" ('

for field in fields[1:]:
DDL += setVal(field)

DDL += '"SHAPE" ST_GEOMETRY('+str(srid)+'))'
return DDL

def doInsert(esquema, tabla, srid, records, cursor):

for r in records:
INSERT = 'INSERT INTO "'+esquema+'"."'+tabla+'" VALUES ('
INSERT += "'{0}'".format("','".join(r.record[:]))+","

if len(r.shape.parts) > 1:
INSERT += "NEW ST_MultiPolygon('MultiPolygon("

for t in range(len(r.shape.parts)-1):
l = ["%f %f" % p for p in r.shape.points[r.shape.parts[t]:r.shape.parts[t+1]]]
INSERT += "(({0})),".format(','.join(l))

l = ["%f %f" % p for p in r.shape.points[r.shape.parts[-1]:]]
INSERT += "(({0}))".format(','.join(l))

else:
INSERT += "NEW ST_POLYGON('POLYGON("
l = ["%f %f" % p for p in r.shape.points]
INSERT += "({0})".format(','.join(l))

INSERT += ")',"+str(srid)+"))"

cursor.execute(INSERT)

with shapefile.Reader(FILE, encoding="latin1") as shp:
fields = shp.fields
records = shp.shapeRecords()

connection = dbapi.connect(
    address='hxehost',
    port= 39015,
    user='SYSTEM',
    password='PASSWORD'
)

if connection.isconnected():
print("Conexion establecida")
cursor = connection.cursor()

DDL = getDDL(ESQUEMA, TABLA, SRID, fields)
cursor.execute(DDL)

INSERT = doInsert(ESQUEMA, TABLA, SRID, records, cursor)

cursor.close()
else:
print("Error de conexion")

You can check the imported data from any client that is able to connect to your SAP HANA instance. If you want to visualize your data, it depends on SRID that you have used, but in general you need to transform it to SRID 4326. GEOJson format is also a good way to visualize your information.

For example, you can use this site to visualize your spatial data: geojson.tools. If you run this query on imported data you will have the necessary inputs for geojson.tools:

SELECT *, SHAPE.ST_TRANSFORM(4326).ST_ASGEOJSON() FROM INEGI.INEGIPRUEBA

These are the 32 polygons for each of every state in Mexico. This is how it looks Zacatecas on the map. Coordinates are coming from the imported data.

SAP HANA Tutorial and Material, SAP HANA Learning, SAP HANA Certification, SAP HANA Exam Prep

No comments:

Post a Comment