Me doing stuff: 2015

Wednesday, 28 October 2015

Environment Southland Information Management Conference

Regional councils and government agencies are increasingly under pressure to resolve data questions and discover how best to acquire, manage, collate, analyse, report and disseminate data, while managing the associated costs. Steering organisations through these complex issues requires a solid understanding of what technologies are available and the information demands of the future *(source).

I had the great opportunity to speak at the Environment Southland Information Management Conference in Invercargill. It was a great event, well organised and very informative. I believe I could contribute my part to the line up and fill a few more gaps in the whole picture.
This was not a business as usual conference, it was obvious that the speakers took it serious to cater their presentations to the needs of the stakeholders. And with 70 attendees from regional and central government, as well as visitors from research and industry.

It was great to see the emerging patterns around NZ and similar approaches to a holistic, comprehensive and modern data strategy. If you are interested, this is a link to programme, and please see below for my slides. Watch the talk on Youtube.

Sunday, 25 October 2015

A Spatial Data Infrastructure Approach for the Characterization of New Zealand's Groundwater Systems

I was very happy when I got informed that our latest research article was published in "Transactions in GIS". While being embedded in our New Zealand SMART Aquifer Characterisation programme, it was a great joint effort which results went beyond SMART.

Kmoch, A., Klug, H., Ritchie, A. B. H., Schmidt, J. and White, P. A. (2015), A Spatial Data Infrastructure Approach for the Characterization of New Zealand's Groundwater Systems. Transactions in GIS. doi: 10.1111/tgis.12171

It explains technical and methodological approaches of web groundwater data infrastructure and uses NIWA's CLIDB and GNS's NGMP as examples and describes the long way from stakeholder interaction, workshops and meetings towards an NZ data standard for the National Environmental Monitoring Standards (NEMS) framework - the Environmental Observation Data Profile (EODP) as data access and transfer blueprint for NEMS.

Wiley Online Library "Transactions in GIS" Journal

Alexander Kmoch on ResearchGate.

The technical work with colleagues, collaborators and partners from New Zealand research institutes (CRIs) like GNS, NIWA, Landcare Research, New Zealand regional councils like Horizons and Waikato (WRC), Hawke's Bay (HBRC) and and Bay of Plenty (BOPRC) regional councils which culminated in a draft OGC profile and is being incorporated into national standards in New Zealand, GitHub EODP and are based on the work over the last two years including a great review paper on OGC standards for groundwater, customising the 52°North SOS server to interchangeably encode WaterML2 time-series data along O&M2 through a Google Summer of Code (GSoC) programme, and then customising the SOS server database access to demonstrate it as an adaptor on legacy databases of national significance, in specific an "NGMP/GGW-SOS" and a "CLIDB-SOS" demo through a NZ eResearch programme.

Friday, 9 October 2015

Digital Earth Conference 2015, Canada

"Towards a One-World Vision for the Blue Planet" was the slogan for the 9th Symposium of the International Society for Digital Earth (ISDE) from 5th - 9th of October, 2015, in Halifax, Nova Scotia, Canada.

At Digital Earth 2015 scientists, engineers, technologists, and environmental managers from around the world will meet to share concepts, research findings, technologies, and practical applications relating to the Digital Earth vision. *()

I presented an experimental approach towards integrating scientific legacy codes into the OGC web services framework, on the example of exposing USGS MODFLOW through a vanilla and open source WPS (without using proprietary tools like ESRI Arc Server and Arc Hydro Tools etc).

I was very honoured and greatly appreciated the "Best Student Poster" award. It seems I really had the right audience at ISDE.

Furthermore it was extremely informative, and in the session about "Discrete Global Grid Systems" I learned a lot about this new approach of unifying traditional raster/coverage data with an equal area per pixel advantage ... one of the great shortfalls of todays popular Web Mercator projections. In fact spinning this further it could be a great new approach and how groundwater, geology, ocean modelling and atmospheric sciences, which would benefit from an equal volume grid, particularly when re-mashing resources from different environmental/scientific/governmental/industrial domains.

There is even an DGGS OGC Standards Working Group on the way.

Sunday, 20 September 2015

Archived posts from my Github pages

July 16, 2015: Google Cloud Platform, Cassandra and Mesos
July 14, 2015: Insights into Raster Imagery Tiling with GeoTrellis
June 30, 2015: A GeoTrellis Spark Cassandra Experiment on Google Compute Engine
June 05, 2015: Setting up your IDE for development with GeoTrellis (or any other complex Scala project)
March 27, 2015: GeoTrellis Cassandra Backend to GeoTrellis-Spark
March 26, 2015: ZOO-Project Java-API and JGrasstools
April 13, 2014: Space Apps Challenge 2014 - TETRIS Space Suit HUD demo with a Raspberry
April 16, 2013: Geospatial web-enablement for environmental data in New Zealand
February 04, 2013: SOS for CLIDB Mid-term - we got data
January 21, 2013: A side note on geospatial data sharing and spatial data infrastructures (SDI)
January 14, 2013: A Climate Database Web Service - 1st Review
December 10, 2012: A Sensor Observation Service for the NIWA climate database
August 23, 2012: Demonstrating Exchangeable Encodings for SOS - A Quantum Leap for me
July 04, 2012: Exchangeable Encodings Getting in Shape
May 30, 2012: Dynamic Output Formats for the Sensor Observation Service
February 29, 2012: Exchangeable Encodings for SOS - extension support WaterML2

Disclaimer: There might be some duplication with older posts here, too :-)

Tuesday, 11 August 2015

ResearchGate milestone and GSoC finale ahead

What an exciting start of this week:

Google is re-organising itself into Alphabet, Google Summer of Code 2015 announced 'Pencils down' date in two weeks and ResearchGate informed me about reaching 200 publication downloads.

Well, being in the 3rd year of my PhD the number of publications, views and citations might not be outstanding but it is continuous progress along the early researcher path and a form of acknowledgement.

The GSoC project work with GeoTreliis and Azavea is also great opportunity to get more involved with Big Data technologies like Apache Spark and Cassandra and cloud technologies. The support from the project team, notably Rob and Chris, but also vibe on the GeoTrellis Gitter channel is fantastic.

Although it's time to wrap up GSoC in the next weeks, it is also the starting point of getting these new software development insights applied to my research in the SMART aquifer characterisation (SAC) programme, where I develop tools and web platforms for the SMART data portal. If all goes well, the SMART data portal will be overhauled end of this year and my PhD completed early next year :-)

Thursday, 16 July 2015

Google Cloud Platform, Cassandra and Mesos

Useful gcloud, gsutil command lines from the Google Cloud Platform (GCP) SDK to get started immediately. When you sign up right now for the Google Cloud Platform, you might get a 300 USD credit to be used within the next 2 montns to play with.

Before starting with the shell and later maybe also API examples, you have to create a project. The easiest is to do that is in the developer console. A project is like a workspace, where all used resources are billed. So for different clients or your different projects, particularly if they are going to be paid from different sources, you might want to have separate GCP projects. Alternatively if particularly API orchestration, would suffer under a strong resource separation, then well, you better have more stuff within one GCP project. To get started one GCP project is enough anyway.

So after you’ve created the project in the Google developer console you’d need to enable the different Google Cloud APIs for your project, so you can manage resources from outside the web console.

Keep the console open as a reference to see how you manipulate GCP objects and instances via command line and API.

# install google cloud sdk
$> curl https://sdk.cloud.google.com | bash

# list authenticated accounts, will likely ask you to login
$> gcloud auth list

# will send you to web site to verify credentials and retrieve security token
$> gcloud auth login

# list config items, important to see you can set default compute region/zone, and project
$> gcloud config list --all

[app]
admin_host (unset)
api_host (unset)
host (unset)
hosted_registry (unset)
[component_manager]
additional_repositories (unset)
disable_update_check (unset)
fixed_sdk_version (unset)
[compute]
region (unset)
zone (unset)
[container]
cluster (unset)
[core]
account = allixender@googlemail.com
credentialed_hosted_repo_domains (unset)
disable_color (unset)
disable_prompts (unset)
disable_usage_reporting = False
project = groundwater-cloud
user_output_enabled (unset)
verbosity (unset)

$> gcloud compute instances --help
$> gcloud compute instances list

# with and without project because default project from config
$> gcloud compute instances describe --project cloud-project1 --zone us-central1-f cloud-instance1
$> gcloud compute instances describe --zone us-central1-f sparkcas1
  
$> gcloud compute instances start --zone us-central1-f sparkcas1

$> gcloud compute instances describe --zone us-central1-f sparkcas1
$> gcloud compute instances list

$> gcloud compute --project "cloud-project1" copy-files myfile.bin user1@cloud-instance1:~ --zone "us-central1-f"

$> gcloud compute --project "cloud-project1" ssh --zone "us-central1-f" "cloud-instance1"

# given that service account was created and the json file contains the key export from creation
# on GCE instances it's better to use service accounts which authenticate the instance to access other resources like 
# cloud storage buckets
$> gcloud auth activate-service-account  --key-file cloud-project1.json

# then you can easily access those buckets from the instance
$> gsutil cp gs://big-geodata1/tasmax_day_CCSM4_rcp60_r1i1p1_20060101-20401231.nc .

# or rsync like with parallel threads
$> gsutil -m rsync -r _site/ gs://www.maerchenland-rostock.de/

Create full base instance commandline

$> gcloud compute --project "groundwater-cloud" instances create "mesos1" --description "Mesos Master and Zookeeper" 
--zone "us-central1-f" --machine-type "g1-small" --network "default" 
--metadata "mesos=master" "zookeeper=master" "cassandra=seed" "startup-script=#!/bin/bash\u000asudo apt-get update && apt-get upgrade -y" 
"sshKeys=akmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjHZb0zPCijcfSez05RcW0PhVSzMRKgqYTCeQX5gMzDxxiYoL6TKBEhoqbBxbWg35rgfjqVtgH8ViyncbYUP+XcxIe72qQakbCM0hwMRq7I/yKt3TOmXMdF2uDoY23slNaJSQA2J+CUzohFHNNymYr2SJTwRsQv/XjzCBWmu6zoaUkFZK4CTVKEiQul0T2MjKDpH/ly2l1c1R5p04m7+X1QauX67ESN2Z7u/srF/7irvGRtklPeZartVtpCQYq7NEK0pLOCTRAMZH1po1Mi+oBJt/VUeywbQY8pyzbJYxbrTRpShCfeSfZdTljth2792hQET356S9fmaBr2HuH517Z akmoch@acer1\u000aakmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVIIG7ZCeU5SOIgjmFc9W0MmNferqg1wNgzdvuU+fAp/ZN0UY7p/5V+XQQ3jEVESO8BFHJfhmt38Xca8IwP9FLIYHFWnYUajEH941oWYcQ+5idmBRNWvgYwsU7prlMnc5BeZ+AhZLbrz08xEu1FxncipLdWByrHABCunXaXJzrpggAgJ4ey7pQqzlkWyR9oZCUVJkv/sY+Y6hKMPTvP4/QuSQ2/TLad/32+mAZkvpV+MtpldUwAih8AD6Z40DBSy4lMZvwIKmZ+LlSq2YkDH2z8U5KJTsTXudqSP2c0DSVJIHAwbPSyFjk+2aJzPqZUofVRE/Cw4uWeWRPIpkf9NZ3 akmoch@acer1" 
--maintenance-policy "MIGRATE" --scopes "https://www.googleapis.com/auth/userinfo.email" "https://www.googleapis.com/auth/devstorage.read_only" 
"https://www.googleapis.com/auth/logging.write" --tags "http-server" "https-server" "mesos-master" "zk-master" 
--image "https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1404-trusty-v20150316" 
--no-boot-disk-auto-delete --boot-disk-type "pd-standard" --boot-disk-device-name "mesos1"

# newer gcloud command version tells to mention scope and tags as comma separated list
# also be aware if you want to automatically enable cloud storage (bucket) access via project service acount for the instance change 
# the scope to read_write (can only be modified when the instance is either stopped or at creation time IIRC)
$> gcloud compute --project "groundwater-cloud" instances create "mesos-3" --zone "us-central1-f" --machine-type "n1-standard-2" 
--network "default" --maintenance-policy "MIGRATE" 
--scopes https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/logging.write 
--tags http-server,https-server --disk name=mesos-3,device-name=mesos-3,mode=rw,boot=yes

or REST

POST https://www.googleapis.com/compute/v1/projects/groundwater-cloud/zones/us-central1-f/instances
{
  "name": "mesos1",
  "zone": "https://www.googleapis.com/compute/v1/projects/groundwater-cloud/zones/us-central1-f",
  "machineType": "https://www.googleapis.com/compute/v1/projects/groundwater-cloud/zones/us-central1-f/machineTypes/g1-small",
  "metadata": {
    "items": [
      {
        "key": "mesos",
        "value": "master"
      },
      {
        "key": "zookeeper",
        "value": "master"
      },
      {
        "key": "cassandra",
        "value": "seed"
      },
      {
        "key": "startup-script",
        "value": "#!/bin/bash\nsudo apt-get update && apt-get upgrade -y"
      },
      {
        "key": "sshKeys",
        "value": "akmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjHZb0zPCijcfSez05RcW0PhVSzMRKgqYTCeQX5gMzDxxiYoL6TKBEhoqbBxbWg35rgfjqVtgH8ViyncbYUP+XcxIe72qQakbCM0hwMRq7I/yKt3TOmXMdF2uDoY23slNaJSQA2J+CUzohFHNNymYr2SJTwRsQv/XjzCBWmu6zoaUkFZK4CTVKEiQul0T2MjKDpH/ly2l1c1R5p04m7+X1QauX67ESN2Z7u/srF/7irvGRtklPeZartVtpCQYq7NEK0pLOCTRAMZH1po1Mi+oBJt/VUeywbQY8pyzbJYxbrTRpShCfeSfZdTljth2792hQET356S9fmaBr2HuH517Z akmoch@acer1\nakmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVIIG7ZCeU5SOIgjmFc9W0MmNferqg1wNgzdvuU+fAp/ZN0UY7p/5V+XQQ3jEVESO8BFHJfhmt38Xca8IwP9FLIYHFWnYUajEH941oWYcQ+5idmBRNWvgYwsU7prlMnc5BeZ+AhZLbrz08xEu1FxncipLdWByrHABCunXaXJzrpggAgJ4ey7pQqzlkWyR9oZCUVJkv/sY+Y6hKMPTvP4/QuSQ2/TLad/32+mAZkvpV+MtpldUwAih8AD6Z40DBSy4lMZvwIKmZ+LlSq2YkDH2z8U5KJTsTXudqSP2c0DSVJIHAwbPSyFjk+2aJzPqZUofVRE/Cw4uWeWRPIpkf9NZ3 akmoch@acer1"
      }
    ]
  },
  "tags": {
    "items": [
      "http-server",
      "https-server",
      "mesos-master",
      "zk-master"
    ]
  },
  "disks": [
    {
      "type": "PERSISTENT",
      "boot": true,
      "mode": "READ_WRITE",
      "deviceName": "mesos1",
      "autoDelete": false,
      "initializeParams": {
        "sourceImage": "https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1404-trusty-v20150316",
        "diskType": "https://www.googleapis.com/compute/v1/projects/groundwater-cloud/zones/us-central1-f/diskTypes/pd-standard",
        "diskSizeGb": "10"
      }
    }
  ],
  "canIpForward": false,
  "networkInterfaces": [
    {
      "network": "https://www.googleapis.com/compute/v1/projects/groundwater-cloud/global/networks/default",
      "accessConfigs": [
        {
          "name": "External NAT",
          "type": "ONE_TO_ONE_NAT"
        }
      ]
    }
  ],
  "description": "Mesos Master and Zookeeper",
  "scheduling": {
    "preemptible": false,
    "onHostMaintenance": "MIGRATE",
    "automaticRestart": true
  },
  "serviceAccounts": [
    {
      "email": "default",
      "scopes": [
        "https://www.googleapis.com/auth/userinfo.email",
        "https://www.googleapis.com/auth/devstorage.read_only",
        "https://www.googleapis.com/auth/logging.write"
      ]
    }
  ]
}

Little exercise on Zookeeper and Mesos cluster

Just to be prepared to build anything along mesos, zookeeper, cassandra, spark, geotrellis, gdal …

$> sudo apt-get -y install libgdal-java libgdal-dev gdal-bin netcdf-bin libnetcdf-dev openjdk-7-jdk git build-essential autoconf automake 
libtool zlib1g-dev swig ant libstdc++6-4.6-dev libstdc++5 libstdc++-4.8-dev 
libc++-dev ruby-dev make autoconf nodejs nodejs-legacy python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev nginx

Can be factored out into e.g. Ansible roles or the likes

# Sourcing out the main install packages to central place in the project.
$> gsutil -m cp -e -c apache-cassandra-2.1.2-bin.tar.gz,apache-cassandra-2.1.5-bin.tar.gz,cassandra-mesos-2.1.2-1.tgz,jdk-8u45-linux-x64.gz,jdk-8u45-linux-i586.gz,jdk-7u80-linux-x64.tgz,jdk-7u80-linux-i586.tgz gs://install-swd5ef6/

Now also those install files can be placed in the machine

$> gsutil cp gs://install-swd5ef6/{cassandra-mesos-2.1.2-1.tgz,apache-cassandra-2.1.2-bin.tar.gz,jdk-7u80-linux-x64.tgz,mesos-0.22.1.tar.gz,spark-1.2.2-bin-hadoop2.4.tgz}

Zookeeper basics

$> sudo apt-get -y install zookeeper zookeeper-bin zookeeperd libzookeeper-java

# editing /etc/zookeeper/conf/myId
# setting up aliases for the cluster members in zoo.cfg and hosts (or via DNS, can always be more automatic certainly)
# restart services
# check status
$> echo stat | nc localhost 2181 | grep Mode
Mode: follower

# or leader depends, of course
# extract mesos, spark and cassandra, jdk and create ENV and export PATHs
$> vi .bash_profile
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_80
JRE_HOME=$JAVA_HOMR/jre
PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH

export JAVA_HOME
export JRE_HOME
export PATH

MESOS_HOME=/home/akmoch/mesos-0.22.1
SPARK_HOME=/home/akmoch/spark-1.2.2-bin-hadoop2.4
CASSANDRA_HOME=/home/akmoch/cassandra-mesos-2.1.2-1

export MESOS_HOME
export SPARK_HOME
export CASSANDRA_HOME

Mesos basics

gotta install it for Ubuntu?!

unzip, cd into and `./bootstrap

mkdir build, cd build and ../configure
make

… oh wait, get a binary install from Mesosphere via apt or package

# Setup
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
CODENAME=$(lsb_release -cs)

# Add the repository
echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
  sudo tee /etc/apt/sources.list.d/mesosphere.list
sudo apt-get -y update
sudo apt-get -y install mesos marathon

On the one test master start the mesos-master and on the slaves starting the mesos-slaves. Through their local zookeeper instance they’ll find the master. For Marathon, just start sudo service marathon start on all nodes, it’ll negotiate through zookeeper and mesos-master

Create image of a base install

$> gcloud compute --project "groundwater-cloud" images create "my-mesos-base-image" --description "ubu 14.04 LTS, updated, zookeeper installed, jdk available, mesos, cassandra and spark packages on disk" --source-disk https://www.googleapis.com/compute/v1/projects/groundwater-cloud/zones/us-central1-f/disks/mesos1

# and fire up 3 instances from that image
$> gcloud compute --project "groundwater-cloud" instances create "mesos-2" --description "Basic Mesos and Zookeeper" --zone "us-central1-f" 
--machine-type "g1-small" --network "default" --metadata "mesos=master,zookeeper=master,cassandra=seed,sshKeys=akmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjHZb0zPCijcfSez05RcW0PhVSzMRKgqYTCeQX5gMzDxxiYoL6TKBEhoqbBxbWg35rgfjqVtgH8ViyncbYUP+XcxIe72qQakbCM0hwMRq7I/yKt3TOmXMdF2uDoY23slNaJSQA2J+CUzohFHNNymYr2SJTwRsQv/XjzCBWmu6zoaUkFZK4CTVKEiQul0T2MjKDpH/ly2l1c1R5p04m7+X1QauX67ESN2Z7u/srF/7irvGRtklPeZartVtpCQYq7NEK0pLOCTRAMZH1po1Mi+oBJt/VUeywbQY8pyzbJYxbrTRpShCfeSfZdTljth2792hQET356S9fmaBr2HuH517Z akmoch@acer1\u000aakmoch:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVIIG7ZCeU5SOIgjmFc9W0MmNferqg1wNgzdvuU+fAp/ZN0UY7p/5V+XQQ3jEVESO8BFHJfhmt38Xca8IwP9FLIYHFWnYUajEH941oWYcQ+5idmBRNWvgYwsU7prlMnc5BeZ+AhZLbrz08xEu1FxncipLdWByrHABCunXaXJzrpggAgJ4ey7pQqzlkWyR9oZCUVJkv/sY+Y6hKMPTvP4/QuSQ2/TLad/32+mAZkvpV+MtpldUwAih8AD6Z40DBSy4lMZvwIKmZ+LlSq2YkDH2z8U5KJTsTXudqSP2c0DSVJIHAwbPSyFjk+2aJzPqZUofVRE/Cw4uWeWRPIpkf9NZ3 akmoch@acer1" 
--maintenance-policy "MIGRATE" --scopes "https://www.googleapis.com/auth/userinfo.email,https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write" 
--tags "http-server,https-server,mesos-master,zk-master" --image "https://www.googleapis.com/compute/v1/projects/groundwater-cloud/global/images/my-mesos-base-image" 
--no-boot-disk-auto-delete --boot-disk-type "pd-standard" --boot-disk-device-name "mesos-2-hdd"

TBC …

Mesos and Cassandra

Mesosphere folks currently recommend deployment via Marathon and
provide a marathon.json with a job description to deploy Cassandra. Their mesos cassandra framework is supposedly intelligent enough to bootstrap the cassandra cluster which does initially need known and available seed nodes. Subsequently further nodes can be added to the cluster, because the maximum recommended number of seed nodes is only 3.

Ok, almost that simple, the devil is in the details.

Vanilla Cassandra bootstrap cassandra.yaml

This is a vanilla cassandra.yaml config file. Most iportant details here are the listen_address / rpc_address and seed parameters. Also all nodes in the cluster should share the same cluster_name You want to bind the native, transport and rpc ports on a networked interface, not on localhost. And in vanilla cassandra the seed nodes (1 to max 3) need be defined beforehand and caveat - they need to be IP address apparently, not hostnames. However, listen_address and rpc_address I set to the actual resolvable dns hostname, which could alternatively be achieved by setting listen_interface to eth0 for example.

cluster_name: 'Google Cluster'
num_tokens: 256
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
batchlog_replay_throttle_in_kb: 1024
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer
permissions_validity_in_ms: 2000
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
disk_failure_policy: stop
commit_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
counter_cache_size_in_mb:
counter_cache_save_period: 7200
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "10.240.1.54"
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32
memtable_allocation_type: heap_buffers
index_summary_capacity_in_mb:
index_summary_resize_interval_in_minutes: 60
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address: mesos-2
start_native_transport: true
native_transport_port: 9042
start_rpc: true
rpc_address: mesos-2
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
incremental_backups: false
snapshot_before_compaction: false
auto_snapshot: true
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
column_index_size_in_kb: 64
batch_size_warn_threshold_in_kb: 5
compaction_throughput_mb_per_sec: 16
sstable_preemptive_open_interval_in_mb: 50
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: SimpleSnitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
server_encryption_options:
    internode_encryption: none
    keystore: conf/.keystore
    keystore_password: cassandra
    truststore: conf/.truststore
    truststore_password: cassandra
client_encryption_options:
    enabled: false
    keystore: conf/.keystore
    keystore_password: cassandra
internode_compression: all
inter_dc_tcp_nodelay: false

Alternatively as explained, to not need to setting different hostname in each node’s yaml

listen_address:
listen_interface: eth0
rpc_address:
rpc_interface: eth0

And, maybe to be clear, of course each cassandra installation (maybe already included in a base image) should be existing and
identical.

For comparison, the mesos-cassandra pre-packaged cassandra.yaml config file. Most obvious, the path names and seed nodes placeholders. Also for some reason rpc_addressis set to localhost and listen_address is left free, which should result in either getting the right address viahostname or localhost. Initially this combination is ok on single node test, but in multinode cluster all nodes should be able to talk to each other (gossip) and from Spark whatever via application ports 9042.

The mesos-cassandra package does in fact not need to be pre-installed. Rather the package with customised mesos.yaml and cassandra.yaml should reside somewhere web/network accessible, so the mesos scheduler can pull and deploy it on the nodes it will designate.

Problematic or deceiving is the initially provided mesos.yaml and cassandra.yaml config:

# mesos.yaml
mesos.executor.uri: 'http://downloads.mesosphere.io/cassandra/cassandra-mesos-2.1.2-1.tgz'
mesos.master.url: 'zk://localhost:2181/mesos'
state.zk: 'localhost:2181'
java.library.path: '/usr/local/lib/libmesos.so'
cassandra.noOfHwNodes: 1
cassandra.minNoOfSeedNodes: 1
resource.cpus: 1.0
resource.mem: 2048
resource.disk: 2000

# cassandra.yaml
cluster_name: 'MesosCluster'
num_tokens: 256
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
batchlog_replay_throttle_in_kb: 1024
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer
permissions_validity_in_ms: 2000
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
data_file_directories:
    - /tmp/cassandra/${clusterName}/data
commitlog_directory: /tmp/cassandra/${clusterName}/commitlog
disk_failure_policy: stop
commit_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
counter_cache_size_in_mb:
counter_cache_save_period: 7200
saved_caches_directory: /tmp/cassandra/${clusterName}/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "${seedNodes}"
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32
memtable_allocation_type: heap_buffers
index_summary_capacity_in_mb:
index_summary_resize_interval_in_minutes: 60
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7000
ssl_storage_port: 7001
listen_address:
start_native_transport: true
native_transport_port: 9042
start_rpc: true
rpc_address: localhost
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
incremental_backups: false
snapshot_before_compaction: false
auto_snapshot: true
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
column_index_size_in_kb: 64
batch_size_warn_threshold_in_kb: 5
compaction_throughput_mb_per_sec: 16
sstable_preemptive_open_interval_in_mb: 50
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
counter_write_request_timeout_in_ms: 5000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: SimpleSnitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
server_encryption_options:
    internode_encryption: none
    keystore: conf/.keystore
    keystore_password: cassandra
    truststore: conf/.truststore
    truststore_password: cassandra
client_encryption_options:
    enabled: false
    keystore: conf/.keystore
    keystore_password: cassandra
internode_compression: all
inter_dc_tcp_nodelay: false

The mesos.yaml provides zookeeper endpoint to find the mesos-master as well as a few scaling parameters … AND the package that will be deployed on the nodes in the cluster. This package then again contains the provided cassandra.yaml config that needs to be fitted for our deployment!

Reconciling so far. Updating the cassandra.yaml and mesos.yaml to reflect our deployment, including the designated URL from where to pull
our customised package, uploading the now configured cassandra-mesos folder to that location, and then one can initiate the framework via bin/cassandra-mesos (just like bin/cassandra).

So far so good. That deploys and starts a cassandra cluster with multiple instances throughout our mesos cluster and you don’t have to think about seed nodes yourself, all bootstrapping included. You still need the cassandra nodetool and cqlsh commands handy from your “work directory” if you want to have a look into the cluster yourself.

The deployment happens with the predefined size, BUT there are no handy means to add more nodes/instances via standard mesos techniques (TBC?).

Here is where Mesos Marathon comes into play again. Marathon basically handles deployments as long running apps and keeps a handle on them so you can scale them up and down through their lifetime. Also Maraton provides a REST api where you POST your app definition as described by Mesosphere…

Mesosphere folks currently recommend deployment via Marathon and
provide a marathon.json with a job description to deploy Cassandra.

Different Mesos Cassandra examples

So here the literature gets a bit blurry. And in fact all the examples are easier to reproduce for single nodes. In multiple nodes I just can’t seem to get it deployed properly …

There is this latest approach via Marathon, where an exectutor and framework jarfile are deployed together with a vanilla cassandra package. But there often tend to be resource conflicts. So I tried with biger and bigger machines, a GCE n1-standard-2 with 2 vCPU and 3,7 GB RAM.

And most important for the deploy JSON description:

"id": "marathon-cas12",
"instances": 1,
"cpus": 0.5,
"mem": 512,
"ports": [
 0
],

The upper part seems to provide resources to deploy cassandra, not to run cassandra .. but then for the rescaling in Marathon it doesn’t quite seem to make sense! Therefore be generous where cassandra’s resources are declared, cassandra will eventually at least claim 1 - 1.2 CPU units and 2GB+ RAM.

And if you use that Marathon.json you have to you the provided jre too,, the framework apparently relies on that.

Conclusion so far

I find the current state of Cassandra-on-Mesos difficult to reproduce in production. The scaling is not quite easy to follow up on, and how reliable it is or so…

And thinking of resource hungry cassandra database, you might just add the vanilla install with a template configuration to a base image or the likes, and when bootstrapping the cluster set one or two seed nodes known from your cloud environment. The bootstrap and seed node specialty only happens once a lifetime of the database cluster. If if that seed node goes down later, scaling up by calling out different seed node (just some that are alive basically) will do to dynamically add nodes to the cluster.

Tuesday, 30 June 2015

A GeoTrellis Spark Cassandra Experiment on Google Compute Engine

TL;DR: I manually created a GCE instance (8 vCPU, 30 GB RAM, 100 GB SSD) in the Google Web Console and installed Spark, Cassandra and tested a spark-submit geotrellis ingest command of a NetCDF climate file into the Cassandra database, using a a current GeoTrellis 0.10.0-SNAPSHOT with Cassandra support.

Link to a GitHub Gist capturing most cmdline stuff

Some thoughts on the process

Initially I had only provided driver memory of 3 GB, because I started first experiments on my laptop. The NetCDF file to ingest at first was “only” 1.8 GB, but my local Spark run would always abort and say I don’t have sufficient space on my HDD. Maybe it was a RAM issue? So I tried with --conf spark.shuffle.consolidateFiles=true but it didn’t change anything.

So I created the Google Compute Engine instance, with 8 vCpu, 30 GB RAM and a 100 GB local SSD. Then I re-iterated the whole lot of install and config manually, to see what needs to be done etc.

I particularly struggled long time with the GDAL Java bindings until I realised, that the Spark workers / executors also need to know the LD_LIBRARY-PATH or -Djava-library.path for the native lbraries. So in the meantime I also compiled the whole GDAL stack to manually create the GDAL Java bindings, to then compile and link GeoTrellis Gdal modules gainst this local install. Not sure if that’s actually necessary.

I configured sort of a stand-alone Spark cluster, with starting the local master sbin/start-master.sh and starting two workers with ..

bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077 >> worker1.log &

I installed an Nginx reverse proxy to see the spark master web console.

location / {
      proxy_pass http://127.0.0.1:8080;    
}

This GCE instance would cost roughly 200 USD per month. So for the fun of it I cannot afford to have it running all the time. Presumably the whole run incl. installation, compile times, ingests and then benchmark as of now took me maybe already 10-20 hours, which would amount to about 20 USD, if I don’t push it much further. For a company it’s peanuts, for s student it’s still great though to be as resourceful as possible.

The actual ingest complained also about not having enough memory to cache RRDs. I now believe, that this is because I forgot to eliminate the driver and executor memory limits of 3 GB which came from my lapt top tests. Those spark-submit cmdlines grow really long parameter lists. However, in the second run of the ingest I was more careful, but the same messages appeared sometimes:

[Stage 4:=================================================================>      (23 + 1) / 24]
11:02:11 MemoryStore: Not enough space to cache rdd_10_23 in memory! (computed 648.7 MB so far)

The second ingest took a bout 30 minutes.

The data size in Cassandra increased a lot.

user1@cloud-instance1:~$ du -sh apache-cassandra-2.1.7/data/*
4.0G    apache-cassandra-2.1.7/data/commitlog
15G     apache-cassandra-2.1.7/data/data\
396K    apache-cassandra-2.1.7/data/saved_caches

BTW, Cassandra determined its own memory requirements. And as it was written somewhere beyond 8GB heap garbage collection in cassandra becomse a real problem. But ps -ef shows ` -Xms7540M -Xmx7540M` which seems good (based on the machine with 30 GB RAM overall)

After the second ingest, I could try to run the benchmark and every two to five seconds top would basically switch between these two views:

… much Cassandra (a short burst)

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2476 user1     20   0 12.029g 6.373g  30160 S 800.0 21.6  89:14.95 java (i.e. Cassandra)
4572 user1     20   0 12.633g 614064  34084 S   1.0  2.0   0:12.81 java (i.e. geotrellis spark driver)
2030 user1     20   0 3515344 256996   7284 S   0.7  0.8   0:24.71 java (spark master)
2145 user1     20   0 3450064 235220   7244 S   0.3  0.8   0:20.30 java (spark worker)
2264 user1     20   0 3450064 223608   7204 S   0.3  0.7   0:20.33 java (spark worker)

… no Cassndra, but also the spark workers don’t have much to do.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2030 user1     20   0 3515344 256996   7284 S   0.7  0.8   0:25.20 java (i.e. Cassandra)
4572 user1     20   0 12.633g 616512  34084 S   0.7  2.0   0:14.03 java (i.e. geotrellis spark driver)
2145 user1     20   0 3450064 235220   7244 S   0.3  0.8   0:20.73 java (spark worker)
2264 user1     20   0 3450064 223608   7204 S   0.3  0.7   0:20.76 java (spark worker)

After maybe half an hour, the driver program ran with about 100% CPU, presumably doing the actual first run of benchmarking. First result appeared another 2000 seconds later…

   12:12:28 CasBenchmark$: YEARLY AVERAGE OF DIFFERENCES ArraySeq((2024-01-01T00:00:00.000Z,-5.742096174281091), 
       (2040-01-01T00:00:00.000Z,-4.082999541556946), (2022-01-01T00:00:00.000Z,-6.916643693113496), 
       (2020-01-01T00:00:00.000Z,-0.12434280860278991), (2032-01-01T00:00:00.000Z,-2.076048279608781), 
       (2037-01-01T00:00:00.000Z,-3.251362344770221), (2016-01-01T00:00:00.000Z,-2.9338103677725176), 
       (2008-01-01T00:00:00.000Z,-4.526800691703725), (2025-01-01T00:00:00.000Z,-4.444756317590644), 
       (2015-01-01T00:00:00.000Z,-1.7933621930024923), (2018-01-01T00:00:00.000Z,1.8317703519666457), 
       (2030-01-01T00:00:00.000Z,-11.56568235966663), (2029-01-01T00:00:00.000Z,-5.528298940438058), 
       (2014-01-01T00:00:00.000Z,8.163989162639911), (2012-01-01T00:00:00.000Z,-0.05441557471953986), 
       (2006-01-01T00:00:00.000Z,-2.6361151595649175), (2007-01-01T00:00:00.000Z,-2.2866075483874537), 
       (2010-01-01T00:00:00.000Z,-3.5406655772197655), (2011-01-01T00:00:00.000Z,-5.0540570398991695), 
       (2036-01-01T00:00:00.000Z,-2.0227119828134787), (2023-01-01T00:00:00.000Z,-2.6087549888935255), 
       (2028-01-01T00:00:00.000Z,-9.611160488520209), (2035-01-01T00:00:00.000Z,-6.24592506571773), 
       (2027-01-01T00:00:00.000Z,-4.980359765178967), (2009-01-01T00:00:00.000Z,-1.114338171308529), 
   ...
   12:12:28 CasBenchmark$: Benchmark: {type: MultiModel-localSubtract-Average, 
        name: philadelphia, layers: List(Layer(name = "tasmaxrcp60ccsm4", zoom = 3), Layer(name = "tasmaxrcp60ccsm420", zoom = 3))} in 2050063 ms

The calculated data I realised later, was not really meaningful. However, the setup is going in the right direction :-)

Future work

Concluding, something is really slow it seems. Rob from Azavea has mentioned that before, too.

For GeoTrellis already exists an AWS EC2 Cluster deployment project and the Cassandra integration is being developed. Hmm, I should test that actually :-)

Furthermore, based on some, I’d call them relevant, technology leaders’ information on Spark with Mesos on GCE, Mesosphere or Docker e.g.

… it’ll be certainly intriguing to get larger GeoTrellis Cassandra Spark clusters up and an running with those automagical mass deployment tools together, e.g. with Mesos, not necessarily with Docker, on the Google Cloud Platform.

However, as someone nicely pointed out

quote start

What about other deploy modes?

Local mode: OK for testing, but you can also just run standalone mode with one or two workers on your laptop.
EC2 scripts: A convenience for frequently bring up / shutting down spark, which is probably not what you’re doing if you’re co-installing spark with cassandra.
YARN: Don’t do this unless you have a YARN cluster already
Mesos: Don’t do this unless:
- you have a Mesos cluster already AND
- need fine-grained sharing of CPU resources, AND
- like looking through twice as many developer mailing lists when something doesn’t work the way you expect.

quote end

yeah, right

Friday, 8 May 2015

Google Summer of Code 2015 with GeoTrellis, Cassandra and Spark!

Awesome news :-) Found this in my inbox:

<snip>
Congratulations! Your proposal 'GeoTrellis: Cassandra Backend to GeoTrellis-Spark' submitted to The Eclipse Foundation has been accepted for Google Summer of Code 2015.

Welcome to Google Summer of Code 2015! We look forward to having you with us.

With best regards,
The Google Summer of Code Program Administration Team
</snip>

So, what is this about (Full proposal) ?

Cassandra Backend to GeoTrellis-Spark

1. Introduction

GeoTrellis is a Scala-based LocationTech project that is a framework for fast, parallel processing of geospatial data. Recent development efforts have allowed GeoTrellis to give the Apache Spark cluster compute engine geospatial capabilities, focusing on the processing of large scale raster data.

GeoTrellis's recent integration to with Apache Spark currently supports Hadoop HDFS and Accumulo as backends to store and retrieve raster data across a cluster. Cassandra is another distributed data store that could provide a rich set of features and performance opportunities to GeoTrellis running on top of Spark. It's also a popular distributed data store that a number of people interested in doing large scale geospatial computations are already using. A prototypical GeoTrellis catalog implementation for raster data in Cassandra is in development, yet it doesn't filter in the way we need.

This project would improve the GeoTrellis Catalog implementation for Cassandra, which allows us to save and load raster layers as Spark RDD's, as well as metadata. An important factor for distinction of GeoTrellis to other geospatial libraries the focus on performance. A performance-based indexing scheme needs to be integrated for being able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Eventually we will also be storing vector data in these data stores, and this project should support the efficient storing, indexing and retrieving of vector data with high-performance spatial and spatio-temporal filtering as well.

2. Background

GeoTrellis is Scala and Akka based high-performance geospatial processing framework. With the linking to Spray/Akka-Http GeoTrellis functionality can easily be exposed via Web Services, and with the latest integration with Apache Spark, GeoTrellis can now be run in massive large scale big data environments. Originally GeoTrellis had an own raster file type and a native catalog implementation that allowed for fast tile access, filtering and additional metadata. Now GeoTrellis also supports the widely used open GeoTIFF format.
To efficiently read, ingest, process and write data on Spark, data sources need to be exposed as SparkRDDs (Resilient Distributed Datasets). This allows to use Spark's great native algorithms to widely parallelise data processing and manipulation. GeoTrellis as of now supports the Hadoop HDFS filesystem for distributed file access and storage, and the Accumulo database for the GeoTrellis catalogue and advanced raster array access. A Cassandra raster integration is in an early stage development, yet filtering and indexing need to be improved as well.

3. The idea

Apache Cassandra is zero-dependency distributed NoSQL schema-less data base. It doesn't support referential integrity or joins like a relational database. It also has no spatial capabilities. With GeoTrellis on top of Cassandra this high-performance data store could be used in massively big data application with spatio-temporal requirements. There's is an active development community around the Apache Spark Cassandra connector, and the use of Cassandra data stores in the big data framework Spark can almost be considered main stream production.

GeoTrellis has its roots in high-performance raster data processing, and as rasters are in fact basically arrays of values with spatial and non-spatial metadata, GeoTrellis has its own (metadata) catalog implementation and can flexibly support an arbitrary variety of data stores for array data. Recent developments are prototyping raster storage and retrieval with Cassandra and vector support is planned. As Cassandra doesn't support any spatial indexing on its own (as of now), a high-performance indexing scheme needs to be implemented to be able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Here the vector and raster-based indexing and filtering methods against Cassandra and the GeoTrellis catalog need to be optimized for vector data in general and Cassandra in particular. Cassandra does support custom indexes on columns by reference to (presumably Java-based) an Index-class. Here typical discrete spatial indexes (in reconciliation with GeoTrellis catalog) like R-Trees, QuadTrees or possibly Space Filling Curves might be directly added to Cassandra to support spatial data indexing. The raster indexing/filtering is dependent on the GeoTrellis raster/array ingestion and referencing via the GeoTrellis catalog (documentation is a bit sparse, so I could also support in documenting the functionality in the course).

4. Future ideas / How can your idea be expanded?

There are several other great GIS, hydro-climate and geo-science (FOSS4G) toolkits that are based on Java and it would be great, if they or their functionality could also be subsequently be exposed and used under the GeoTrellis framework (like JTS is already, or the alternative GeoTIFF reader), particularly under large scale parallel processing capabilities on top of Apache Spark. This would allow for big data (business as well environmental science) applications under an enormously powerful framework.

Many scientific codes have been and are still developed and coded in Fortran, mainly because if its superior fast and efficient Matrix manipulation functionality. Yet, parallelising Fortran is hard, and in fact only viable on actual HPC facilities. With Scala, Akka and Breeze (or directly Spark and Java native BLAS...etc) parallelising scientific codes should be nicely scalable across commodity cloud servers. Maybe it could be evaluated at some point how to easily port popular Fortran codes to Scala/Spark and run them in unprecedented simplicity in main stream IT/server/cloud deployments at very comparable rate of performance.

About Google Summer of Code

Google Summer of Code (GSoC) is a global programme that offers stipends to students to write code for open source projects. Google works with the open source community to identify and fund exciting projects. Alexander Kmoch, a student supported by the SMART Aquifer Characterisation programme (SAC), has been accepted as one of 1051 students in this year’s Google Summer of Code programme. Alex will help to improve the GeoTrellis database implementation for the cloud database Cassandra to allow processing of raster layers and vector data via the fast cluster engine for large-scale data processing Apache Spark. The GeoTrellis software will also be incorporated into the groundwater data portal that is being developed through the SMART project.

Thursday, 23 April 2015

An approach towards a unified hydro-climate metadata search across New Zealand

The SMART project was presenting at the joint Water Symposium 2014 of the New Zealand Hydrological Society, New Zealand Freshwater Sciences Society and the IPENZ Rivers Group. The Symposium was held at the Marlborough Convention Centre, 24 - 28 November 2014. The theme was “Integration: ‘The Final Frontier’ ~ Whakakotahi te amine rohenga" - integration recognises the continuity of hydrological processes in space and time.In that context we talked about our approach towards a unified hydro-climate metada search across New Zealand.

Acquisition of relevant and useful data for hydrological, meteorological or hydrogeological assessments and research projects in New Zealand’s regions is primarily based on personal communication and reliance on the knowledge of domain specialists. To support the task of identification and subsequently retrieval of available datasets for an area of interest, a new unified method for metadata and data search in New Zealand with a focus on hydro-climate and geo-scientific datasets is presented. The method is implemented within an online accessible search website which searches the many distributed New Zealand based data providers and web portals of scientific and governmental organisations.

The research work undertaken comprises of two parts. First, the publicly accessible abstracts from the website of the Journal of Hydrology New Zealand (NZHS 2014) are analysed. The title, date, authors and abstract text were indexed and where possible a spatial context was derived. Additional keywords were selected for a thesaurus based on occurrence of domain specific terms in the abstract text. Subsequently, metadata records in the New Zealand metadata standard format are created and provided online. Secondly, the web search algorithm distributes the search to a selection of pre-registered geo- and hydro-climate data portals like “LINZ Data Service” (LINZ 2014), “Geodata.govt.nz” (NZGO 2014), “NIWA DC” (NIWA 2014), “Landcare LRIS”(Landcare 2014), “Koordinates” (Koordinates 2014) and “data.govt.nz” (NZ_DIA 2014). The search query can be enriched with related terms from a thesaurus or glossary and spatial context with place names or coordinates. Results are ranked based on spatial and semantic correlations with New Zealand place names register and hydrological terms from the developed glossary.

The slides of the presentation are now publicly accessible. Please click here to view the presentation slides,

Thursday, 26 March 2015

ZOO-Project WPS Java-API and JGrasstools Java Hydrological Toolbox

The ZOO-Project (http://zoo-project.org/) is a solid Open Geospatial Consortium (OGC) Web Processing Service (WPS - http://www.opengeospatial.org/standards/wps) standard server implementation with an open flexible API that works well with many different programming languages. The Java bindings have never been tested in advanced configurations and complex data types, and to date only implement the minimum necessary interfaces. The JGrasstools project is a modular processing library and its highly annotated nature makes it possible to adapt quite easily to other toolboxes. JGrasstools contains a wide variety of powerful and efficient GIS, hydrology and geomorphological tools and processes, that can be exposed to and used by other libraries and toolkits. One example has been the adaptation to the Geotools Process API. The JGrasstools project, as well as other java based projects (as JTS, Sextante or even Geotools) would benefit greatly from the possibility to be used within a web-enabled WPS execution environment, as well as being integrated with the open standards suite of the OGC. Some time ago Moovida tried integrating the JGrasstools libraries with the ZOO-Project Java binding to expose them as native WPS processes. This would allow them to work inside the ZOO-Project and serve its modules under the WPS standard.

Some Background

The ZOO-Project WPS implementation is a flexible, modular high performance HTTP CGI implementation. ZOO-Kernel is a powerful server-side C Kernel which makes it possible to manage and chain Web services, by loading dynamic libraries and handling them as on-demand Web services. The ZOO Kernel is written in C language, and supports several common programming languages in order to connect to numerous libraries and models (http://zoo-project.org/trac/wiki/ZooWebSite/ZooKernel). The generic ZOO API is basically accessible for every possible programming and web scripting language that can be run under the CGI interface. Main API implementations, the ZOO services, are available for C/C++, Python, JavaScript, PHP, Fortran and Java. Some API bindings are more advanced and complete and make the full ZOO-API (http://zoo-project.org/trac/wiki/ZooWebSite/ZOOAPI/Classes#ZOOAPIClasses) accessible to the ZOO service in the particular programming language (e.g. C, Python or JavaScript). In comparison the the Java API binding only exposes the minimum functionality to run from the ZOO Kernel.

JGrasstools (http://moovida.github.io/jgrasstools/) is a powerful GIS toolkits with functionality reaching from standard geoprocessing algorithms to advanced processing features used in hydrology and geomorphology. JGrasstools is based on a Maven (http://maven.apache.org/) build process, which takes care of dependency resolution and creating the succinct jar packages with the compiled classes. Maven is a defacto standard for managing (source and dependencies) building and deploying (jar packaging, resources, copying, publishing, archiving and installing) Java-based software projects. JGrasstools is also used as a toolbox in the uDig desktop GIS software (http://udig.refractions.net/). If JGrasstools could be exposed via a open standards and interfaces, web-based processing and execution environment (like ZOO-Project provides) it can be widely used in WebGIS deployments and large scale cloud based processing chains.

The next level

Andrea from Moovida said he didn't have enough time to continue developing this idea. He developed a generator which would programmatically scan through the annotated JGrasstools modules and generate respective ZooJavaWps classes per JGrasstools module/method and the corresponding ZOO-Project .zcfg config file. The only struggle I had was getting the CLASSPATH properly set up, as the ZOO-Project is basically an HHTP CGI application which will start a JVM per request. When I picked up on this in preparation for a GSoC proposal, I found that there were a few little botches with the parameter mapping from ZOO Java API into the very nicely annotated JGrasstools methods. So I took one example generated (WPS-ified) JGrasstool process and adjusted the parameter mapping and got it running with ZOO-Project. Additionally I adjusted the JGrasstools Maven config files to download the necessary dependencies in the target folders to copy them collectively in the ZOO-Project Java CLASSPATH.

Unfortunately I also didn't have time to drive this further still. However, it is just soooo close really :-) Alternatively a 52North WPS implementation based on the super practical JGrasstools annotations and Andrea's generator would also be relatively straightforward.