Friday 8 May 2015

Google Summer of Code 2015 with GeoTrellis, Cassandra and Spark!

Awesome news :-) Found this in my inbox:

<snip>
 Congratulations! Your proposal 'GeoTrellis: Cassandra Backend to GeoTrellis-Spark' submitted to The Eclipse Foundation has been accepted for Google Summer of Code 2015.

Welcome to Google Summer of Code 2015! We look forward to having you with us.

With best regards,
The Google Summer of Code Program Administration Team
</snip>


So, what is this about (Full proposal) ?

Cassandra Backend to GeoTrellis-Spark


1. Introduction


GeoTrellis is a Scala-based LocationTech project that is a framework for fast, parallel processing of geospatial data. Recent development efforts have allowed GeoTrellis to give the Apache Spark cluster compute engine geospatial capabilities, focusing on the processing of large scale raster data.

GeoTrellis's recent integration to with Apache Spark currently supports Hadoop HDFS and Accumulo as backends to store and retrieve raster data across a cluster. Cassandra is another distributed data store that could provide a rich set of features and performance opportunities to GeoTrellis running on top of Spark. It's also a popular distributed data store that a number of people interested in doing large scale geospatial computations are already using. A prototypical GeoTrellis catalog implementation for raster data in Cassandra is in development, yet it doesn't filter in the way we need.

This project would improve the GeoTrellis Catalog implementation for Cassandra, which allows us to save and load raster layers as Spark RDD's, as well as metadata. An important factor for distinction of GeoTrellis to other geospatial libraries the focus on performance. A performance-based indexing scheme needs to be integrated for being able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Eventually we will also be storing vector data in these data stores, and this project should support the efficient storing, indexing and retrieving of vector data with high-performance spatial and spatio-temporal filtering as well.

2. Background


GeoTrellis is Scala and Akka based high-performance geospatial processing framework. With the linking to Spray/Akka-Http GeoTrellis functionality can easily be exposed via Web Services, and with the latest integration with Apache Spark, GeoTrellis can now be run in massive large scale big data environments. Originally GeoTrellis had an own raster file type and a native catalog implementation that allowed for fast tile access, filtering and additional metadata. Now GeoTrellis also supports the widely used open GeoTIFF format.
To efficiently read, ingest, process and write data on Spark, data sources need to be exposed as SparkRDDs (Resilient Distributed Datasets). This allows to use Spark's great native algorithms to widely parallelise data processing and manipulation. GeoTrellis as of now supports the Hadoop HDFS filesystem for distributed file access and storage, and the Accumulo database for the GeoTrellis catalogue and advanced raster array access. A Cassandra raster integration is in an early stage development, yet filtering and indexing need to be improved as well.

3. The idea


Apache Cassandra is zero-dependency distributed NoSQL schema-less data base. It doesn't support referential integrity or joins like a relational database. It also has no spatial capabilities. With GeoTrellis on top of Cassandra this high-performance data store could be used in massively big data application with spatio-temporal requirements. There's is an active development community around the Apache Spark Cassandra connector, and the use of Cassandra data stores in the big data framework Spark can almost be considered main stream production.

GeoTrellis has its roots in high-performance raster data processing, and as rasters are in fact basically arrays of values with spatial and non-spatial metadata, GeoTrellis has its own (metadata) catalog implementation and can flexibly support an arbitrary variety of data stores for array data. Recent developments are prototyping raster storage and retrieval with Cassandra and vector support is planned. As Cassandra doesn't support any spatial indexing on its own (as of now), a high-performance indexing scheme needs to be implemented to be able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Here the vector and raster-based indexing and filtering methods against Cassandra and the GeoTrellis catalog need to be optimized for vector data in general and Cassandra in particular. Cassandra does support custom indexes on columns by reference to (presumably Java-based) an Index-class. Here typical discrete spatial indexes (in reconciliation with GeoTrellis catalog) like R-Trees, QuadTrees or possibly Space Filling Curves might be directly added to Cassandra to support spatial data indexing. The raster indexing/filtering is dependent on the GeoTrellis raster/array ingestion and referencing via the GeoTrellis catalog (documentation is a bit sparse, so I could also support in documenting the functionality in the course).

4. Future ideas / How can your idea be expanded?


There are several other great GIS, hydro-climate and geo-science (FOSS4G) toolkits that are based on Java and it would be great, if they or their functionality could also be subsequently be exposed and used under the GeoTrellis framework (like JTS is already, or the alternative GeoTIFF reader), particularly under large scale parallel processing capabilities on top of Apache Spark. This would allow for big data (business as well environmental science) applications under an enormously powerful framework.

Many scientific codes have been and are still developed and coded in Fortran, mainly because if its superior fast and efficient Matrix manipulation functionality. Yet, parallelising Fortran is hard, and in fact only viable on actual HPC facilities. With Scala, Akka and Breeze (or directly Spark and Java native BLAS...etc) parallelising scientific codes should be nicely scalable across commodity cloud servers. Maybe it could be evaluated at some point how to easily port popular Fortran codes to Scala/Spark and run them in unprecedented simplicity in main stream IT/server/cloud deployments at very comparable rate of performance.



About Google Summer of Code

Google Summer of Code (GSoC) is a global programme that offers stipends to students to write code for open source projects.  Google works with the open source community to identify and fund exciting projects. Alexander Kmoch, a student supported by the SMART Aquifer Characterisation programme (SAC), has been accepted as one of 1051 students in this year’s Google Summer of Code programme. Alex will help to improve the GeoTrellis database implementation for the cloud database Cassandra to allow processing of raster layers and vector data via the fast cluster engine for large-scale data processing Apache Spark. The GeoTrellis software will also be incorporated into the groundwater data portal that is being developed through the SMART project.