<snip>
Congratulations! Your proposal 'GeoTrellis: Cassandra Backend to GeoTrellis-Spark' submitted to The Eclipse Foundation has been accepted for Google Summer of Code 2015.
Welcome to Google Summer of Code 2015! We look forward to having you with us.
With best regards,
The Google Summer of Code Program Administration Team
</snip>
So, what is this about (Full proposal) ?
Cassandra Backend to GeoTrellis-Spark
1. Introduction
GeoTrellis's recent integration to with Apache Spark currently supports Hadoop HDFS and Accumulo as backends to store and retrieve raster data across a cluster. Cassandra is another distributed data store that could provide a rich set of features and performance opportunities to GeoTrellis running on top of Spark. It's also a popular distributed data store that a number of people interested in doing large scale geospatial computations are already using. A prototypical GeoTrellis catalog implementation for raster data in Cassandra is in development, yet it doesn't filter in the way we need.
This project would improve the GeoTrellis Catalog implementation for Cassandra, which allows us to save and load raster layers as Spark RDD's, as well as metadata. An important factor for distinction of GeoTrellis to other geospatial libraries the focus on performance. A performance-based indexing scheme needs to be integrated for being able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Eventually we will also be storing vector data in these data stores, and this project should support the efficient storing, indexing and retrieving of vector data with high-performance spatial and spatio-temporal filtering as well.
2. Background
To efficiently read, ingest, process and write data on Spark, data sources need to be exposed as SparkRDDs (Resilient Distributed Datasets). This allows to use Spark's great native algorithms to widely parallelise data processing and manipulation. GeoTrellis as of now supports the Hadoop HDFS filesystem for distributed file access and storage, and the Accumulo database for the GeoTrellis catalogue and advanced raster array access. A Cassandra raster integration is in an early stage development, yet filtering and indexing need to be improved as well.
3. The idea
GeoTrellis has its roots in high-performance raster data processing, and as rasters are in fact basically arrays of values with spatial and non-spatial metadata, GeoTrellis has its own (metadata) catalog implementation and can flexibly support an arbitrary variety of data stores for array data. Recent developments are prototyping raster storage and retrieval with Cassandra and vector support is planned. As Cassandra doesn't support any spatial indexing on its own (as of now), a high-performance indexing scheme needs to be implemented to be able to do spatial and spatio-temporal queries against Cassandra data as fast as possible. Here the vector and raster-based indexing and filtering methods against Cassandra and the GeoTrellis catalog need to be optimized for vector data in general and Cassandra in particular. Cassandra does support custom indexes on columns by reference to (presumably Java-based) an Index-class. Here typical discrete spatial indexes (in reconciliation with GeoTrellis catalog) like R-Trees, QuadTrees or possibly Space Filling Curves might be directly added to Cassandra to support spatial data indexing. The raster indexing/filtering is dependent on the GeoTrellis raster/array ingestion and referencing via the GeoTrellis catalog (documentation is a bit sparse, so I could also support in documenting the functionality in the course).
4. Future ideas / How can your idea be expanded?
Many scientific codes have been and are still developed and coded in Fortran, mainly because if its superior fast and efficient Matrix manipulation functionality. Yet, parallelising Fortran is hard, and in fact only viable on actual HPC facilities. With Scala, Akka and Breeze (or directly Spark and Java native BLAS...etc) parallelising scientific codes should be nicely scalable across commodity cloud servers. Maybe it could be evaluated at some point how to easily port popular Fortran codes to Scala/Spark and run them in unprecedented simplicity in main stream IT/server/cloud deployments at very comparable rate of performance.
About Google Summer of Code
No comments:
Post a Comment
Note: only a member of this blog may post a comment.