PySpark for &quot;big&quot; atmospheric data analysis

Banihirwe, Anderson; Paul, Kevin; Del Vento, Davide

Banihirwe, A., Paul, K., & Del Vento, D. (2018). PySpark for "big" atmospheric data analysis. In Eighth Symposium on Advances in Modeling and Analysis Using Python. American Meteorological Society: Austin, TX, US.

File Viewed: 581 times

Download PDF

In collections

Conference Posters & Presentations

Abstract

Using NCAR's high performance computing systems, scientists perform many kinds of atmospheric data analysis with a variety of tools and workflows. Some of these, such as climate data analysis, are time intensive from both a human and computer point of view. Often these analyses are "embarrassingl... Show moreUsing NCAR's high performance computing systems, scientists perform many kinds of atmospheric data analysis with a variety of tools and workflows. Some of these, such as climate data analysis, are time intensive from both a human and computer point of view. Often these analyses are "embarrassingly parallel," and many traditional approaches are either not parallel or excessively complex for this kind of analysis. Therefore, this research project explores an alternative approach to parallelizing them. We used the PySpark Python interface to Apache Spark, a modern framework aimed at performing fast distributed computing on Big Data. We have been successful installing, configuring, and utilizing PySpark on NCAR's HPC platforms, such as Yellowstone and Cheyenne. For this purpose, we designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We applied PySpark to several atmospheric data analysis use cases, including bias correction and per-county computation of atmospheric statistics (such as rainfall and temperature). In this presentation, we will show the results of using PySpark with these cases, comparing it to more traditional approaches from both the performance and programming flexibility points of view. We will show comparison of the numerical details, such as timing, scalability, and code examples. Show less

Details

Author(s)

Anderson Banihirwe-NCAR/UCAR

Kevin Paul-NCAR/UCAR

Davide Del Vento-NCAR/UCAR

Title

PySpark for "big" atmospheric data analysis

Date

2018-01-08

Resource Type

conference material

Conference

Eighth Symposium on Advances in Modeling and Analysis Using Python

Meeting Sponsor

American Meteorological Society

Peer Review

Non-refereed

Copyright Information

OpenSky Citable URL

http://n2t.net/ark:/85065/d77m0bjt

Advanced Search

You are here

PySpark for "big" atmospheric data analysis

In collections

Abstract