From Climate Simulations to Statistics – Introducing the wux Package

We present the R package wux, a toolbox to analyze projected climate change signals by numerical climate model simulations and the associated uncertainties. The focus of this package is to automatically process big amounts of climate model data from multi-model ensembles in a user-friendly and flexible way. For this purpose, climate model output in common binary format (NetCDF) is read in and stored in a data frame, after first being aggregated to a desired temporal resolution and then being averaged over spatial domains of interest. The data processing can be performed for any number of meteorological parameters at one go, which allows multivariate statistical analysis of the climate model ensemble.


Introduction
The human influence on the climate system is often assessed using numerical climate simulations (General Circulation models or GCMs).These are models representing physical processes in the atmosphere, ocean, cryosphere and land surface.However, due to their relative coarse resolution in the order of several hundred kilometers, they are not able to cover important processes at smaller spatial scales.For a more regional analysis, the models can be refined using regional climate models (RCMs) (Giorgi and Mearns 1991) and statistical downscaling techniques (Maraun, Wetterhall, Ireson, Chandler, Kendon, Widmann, Brienen, Rust, Sauter, Themeßl, Venema, Chun, Goodess, Jones, Onof, Vrac, and Thiele-Eich 2010).Under certain assumptions of future greenhouse gas (GHG) emissions, those models can project climate into future periods.We define a climate change signal of a particular meteorological parameter (from climate simulations) as the measure of change between a future climate projection and the past climate.
However, those projected climates are subject to different sources of uncertainty stemming from the natural variability of the climate system, unknown future GHG emissions, and errors and simplifications in GCMs and from regionalization methods.The resulting uncertainties can be partly assessed by analyzing so-called multi-model ensembles (i.e.climate projections which are generated by various GCMs and RCMs), which aim to sample the various sources of uncertainty.However, those ensembles do not systematically sample components of model Introducing the wux Package uncertainty (e.g.physical parametrizations), and thus do not stem from an experimental design in a statistical sense (Knutti, Furrer, Tebaldi, Cermak, and Meehl 2010).They cannot be expected to represent unbiased distributions of possible future climate states.Also, interdependence between GCMs may induce additional biases in the sample, which makes a proper statistical analysis even more difficult.Several publications address those problems, for example Tebaldi and Knutti (2007); Smith, Tebaldi, Nychka, and Mearns (2009); Pirtle, Meyer, and Hamilton (2010); Bishop and Abramowitz (2012); Collins, Chandler, Cox, Huthnance, Rougier, and Stephenson (2012) ;Fischer, Weigel, Buser, Knutti, Künsch, Liniger, Schür, and Appenzeller (2012); Kang, Cressie, and Sain (2012); Stephenson, Collins, Rougier, and Chandler (2012); Chandler (2013); Rougier, Goldstein, and House (2013); Stephenson et al. (2012); Mendlik and Gobiet (2015).
This paper introduces the R package wux (Wegener Center Climate Uncertainty Explorer) (Mendlik, Heinrich, and Leuprecht 2015), a toolbox which enables multi-model handling for statistical analysis of climate scenarios.It is intended to be used to interpret climate model output and provide uncertainty information for the end-user of the climate simulations.Having in mind the heterogeneous target audience, we want this tool to perform following tasks: 1. Enable easy statistical descriptive analysis of user-defined climate model ensembles.
2. Be expandable to any kind of statistical analysis (to push the development of new statistical methods for climate multi-model analysis).
3. Easily process climate simulations to a common data format usable for statistical analysis.This enables reproducing data for any analysis needed.
Descriptive statistics of climatic changes from ensembles (point 1) are crucial to understand the underlying data.In practice people sometimes tend to forget this important step and prefer to directly address their complex research questions without having an overview of the data beforehand.A lot of valuable information lies in this analysis.Having some ready-touse tools already implemented in wux should encourage users to perform this sort of analysis more often.
However, such a tool should not restrict the user to a pre-defined set of standard methods, on the contrary, development of new methods for statistical inference on climate simulations should be strongly supported, as this is still ongoing research (Knutti et al. 2010).Having set up this tool directly in R, allows to explore an extremely broad pool of ready-to-use methods, also from other disciplines using different approaches (point 2).
One of the most time consuming and frustrating tasks when analyzing climate simulations can be the step of processing data (point 3).The user of this tremendously big amount of datasets will find him-/herself challenged, when trying to aggregate them to the desired format (typically some sort of data frame) or get the desired statistics of the ensemble for certain geographical regions of interest.The challenge here is definitely a technical one: Processing ensembles of data in a binary-format usually requires dedicated programming work.The upside is that the data comes in the handy NetCDF file format 1 , where a lot of meta-information about the data is stored in its header, however, life is more complicated in practice.Quite often it happens that meta-information between individual climate simulation output files differ substantially.For this reason it quickly becomes a nuisance when treating large samples of these files in an automated way.Up to now, no such tool is available which processes userdefined climate simulations in an automated way and which allows sophisticated statistical analysis.Furthermore, it is very difficult to reproduce statistical analysis from the scientific community when either the data set from the publication is not available, or the user wishes to apply the method with his/her own climate data.Providing a software which takes this burden, allows the user to solely focus on the interpretation of the climate model output without spending too many resources on technicalities.We consider it a great strength of this package to perform this task in an automated way.
Several powerful tools already exist to process climate model outputs, such as CDO2 , NCO (Zender 2008) The structure of this paper is as follows.Section 2 gives an overview of the functionalities of the wux package.Section 3 describes how climate data are being processed to a suitable data frame step-by-step.We introduce the statistical functionalities implemented in the package in Section 4 and provide an example application in Section 5 to show possible extensions of the implemented statistical functionalities.We conclude in Section 6.

Package overview
wux is meant to be an interfacing toolbox for scientists performing statistical analysis on climate models.Its focus is to provide a simple data frame for the user to make statistical inference on the ensemble.In particular, this package performs following actions, which are depicted in Figure 1 and described

Climate data processing
The central role of the wux package is to automatically read in binary climate model output data from NetCDF files and process them to a data frame for statistical analysis.This task is performed by the function models2wux.The resulting data frame (further called wux.df, as it is technically a wux.df object) contains the climate change signals for user-specified periods, regions, seasons, and parameters for each of the climate models.One example wux.df is shown at the end of Section 3.1.Alternatively, also time series data can be obtained.

From climate model output to wux data frame
This is what models2wux is doing for each specified climate model: 1. Read in a three dimensional array (longitude, latitude, time) from binary climate model output.
2. Temporal aggregation of the fields according to user-specified climate periods and seasons.Aggregation statistics can also be specified by the user.Temporal aggregation can be performed several times serially, going from fine temporal resolution to coarser resolution, each time using another statistic for aggregation.For example, daily temperature of a climate model output could first be aggregated to monthly resolution using the mean function and as a second step the warmest month in the year can be calculated with max.This would result in a climate change signal of the warmest monthly averages.We can thus calculate a vast amount of sufficient statistics to explore the climate data.Also, the user has the possibility to retrieve the full time-series of the climate model instead of the Introducing the wux Package climate change signal.This can, however, result in quite a large data frame.The lowest time resolution currently implemented for time-series data is on monthly basis.
Being able to flexibly perform spatial aggregation over a specified domain is one of the key strengths of this program.Several ways exist for the user to identify the region of interest.For example a rectangular region defined by the longitude-latitude corners can be specified.
For more flexibility, polygons can be defined using ESRI shapefiles5 to cut out and aggregate over the desired subregion domain.The spatial aggregation is always performed using the arithmetic mean over geographical regions of any complexity.However, this process is not as trivial as it first may seem.One problem lies in the geographical projection of the climate model.Averaging over pixels of a model on a Mercator projection (angle preserving) will result in a different value than averaging over pixels in an area-preserving projection.GCMs usually do not come on an area-preserving projection.Therefore, the pixels should be weighted by the cosine of their latitudes, otherwise areas near the poles would gain much more weight then areas near the equator.When aggregating over a certain subregion, another problem arises from the gridpoints which are associated with the subregion.Instead of either considering a gridpoint to be within a region or not (0 and 1 weight), we may want to weight all the model cells that contribute even partly to the considered subregion, i.e. seize the fraction of the cell corresponding to the area covered by the subregion.

Setting up models2wux
To process a climate multi-model ensemble of your choice, models2wux needs two input arguments userinput and modelinput, each being a named list object or a file containing a named list.
modelinput stores general information about your climate data, i.e. the locations of the NetCDF files and their filenames.It also saves certain meta-information for the specific climate simulations (e.g. a unique acronym for the simulation, the developing institution, the radiative forcing).Usually the modelinput information should be stored in a single file on your system and should be updated when new climate simulations come in.It is advisable to share this file with your colleagues if you work with the same NetCDF files on a shared IT infrastructure.
The second input argument, userinput, defines which meteorological parameters of which climate simulations defined in modelinput should be analyzed.This is simply done by calling the models acronym, as all meta-information is already stored in the modelinput file.Also the geographical regions of interest and the temporal statistics are specified in this file.This file typically changes depending on the type of analysis performed.

Getting started
We explain models2wux in more detail by considering an example of a typical workflow for climate data processing.We start with downloading a couple of global climate simulations (GCMs) from the CMIP5 project (Taylor, Stouffer, and Meehl 2012), then we specify their meta-information and the output statistics and finally we run models2wux to process the binary data to an object of class wux.df.
To obtain CMIP5 climate simulations you can get started with downloading some example NetCDF files directly from an ESGF (Earth System Grid Federation) node6 or using the CMIP5fromESGF function from the wux package (Linux only).
In order to run models2wux, you need to specify the two input arguments explained above: A modelinput file to define which climate simulations you have on your hard-disk and a userinput file which controls models2wux itself.An example for the model specification can be obtained in the package itself: : chr " monthly " $ NorESM1 -M -r1i1p1 _ rcp85 : List of 11 ... This input specifies the simulations which have just been downloaded.It is a named list with the name being an unique acronym of the climate simulation.The example input here specifies two simulations, but for the sake of brevity we only display the first one, being the CanESM2-r1i1p1_rcp85 model.As this is a GCM, the rcm tag has no entry.The other tags specify the model in more detail: This simulation is run number 1 of the GCM CanESM2 and has been developed by the CCCma institution7 .The corresponding anthropogenic forcing is rcp85.file.path.altdefines the file locations for both temperature and precipitation files as well as for historical runs and future scenario projections.In this case the historical and the future scenario runs are located in different directories, whereas both meteorological parameters are saved in the same path.file.namegives information for the corresponding file names.The files which are necessary to define the geographical longitude and latitude information are specified in gridfile.pathand gridfile.filename.The data is on a monthly timescale, which is defined in what.timesteps and the horizontal resolution is not specified here as it is optional.
It is advisable to store this list as a single file on your system.You should share this file with colleagues using the same IT infrastructure to use synergies.Such a file can also be Introducing the wux Package created in an automated way using the function CMIP5toModelinput, for data obtained with CMIP5fromESGF (see the manual for more details).Next, we want to tell models2wux to get climate change signals of both simulations we just defined above.In this example we are specifically interested in the temperature changes for the Alpine area at the end of the 21st century.Therefore we specify a user input file which contains a named list with all the necessary information: The userinput argument tells models2wux to process air_temperature (parameter.names) for both models CanESM2-r1i1p1_rcp85 and NorESM1-M-r1i1p1_rcp85 (climate.modelstag).We define our base period (tag reference.period)to be 1971-2000 and the projected future period of interest (tag scenario.period)for the climatic change to be 2071-2100.We want the data to be aggregated to seasons summer (June, July, August: JJA), autumn (SON), winter (DJF) and spring (MAM).For each of those seasons models2wux returns the climate change signal defined by the user by calculating scenario.periodminus reference.period(for precipitation, changes are in addition calculated relative to reference.period).When setting the attribute time.series to TRUE, the output would be a transient time series instead of climate change.
We want to aggregate over the spatial extend of the Alpine area (AL, see Christensen and Christensen (2007)), which is defined in the subregions tag.Here it is a named vector of longitude and latitude coordinates and it defines a rectangular region (western, eastern, northern and southern coordinates of the corners).There are plenty of other ways to define a subregion, like reading in shapefiles.To analyze which model grid cells lie within the specified region, we can specify plot.subregion(see Figure 2).We usually want to aggregate all model cells which lie within the specified region, however, sometimes we would like to down-weight those cells which only partly contribute to the considered region.Setting area.fraction as TRUE weights the cells corresponding to the area covered by the subregion (Figure 2).Furthermore, area.fraction=TRUE is necessary, if the size of the subregion is in the same order of magnitude as the grid cell.Such cases should be handled with care, since the grid point interpretation of climate models is problematic.In most cases, the analyzed subregions should be much larger than the grid size of the models and the error produced by setting area.fraction to FALSE is negligible and processing gains a massive speed up.The data frame will also be saved as a comma-separated file to /tmp/wuxexample.
The CLIMATE MODEL STATISTICS output shows a descriptive analysis of the continuous variables in the data set based on all n = 22 climate simulations available.In this case the continuous variables are the relative change of precipitation (perc.delta.precipitation_amount) in percent and the absolute change of temperature (delta.air_temperature) in • C. The precipitation change in the GAR is not significant for either season, but there is a tendency in DJF for a slight increase of total precipitation.In contrast to that, the change signal for temperature is significant for all seasons showing quite an uniform warming, where MAM seems to have the smallest trend.
Also, functions for a graphical overview of the climate model ensemble are available in wux.
The method plot for a wux.df object draws one or more scatterplots containing climate change signals of selected meteorological parameters.This draws a simple scatterplot which accounts for certain meta-information of the climate change data frame and allows to highlight certain models.One of the scatterplots produced by this call is shown on the left side of Figure 3.This is a very useful plot as it gives a good overview on the model behavior and the climate change uncertainty.In our example, some models project an increase in precipitation change, whereas some project a decline.No correlation between temperature and precipitation change is visible on this small spatial scale.

Data reconstruction methods
Due to limited computational capacities, even in large-scale climate modeling projects such as CMIP5 or CORDEX (Jacob, Petersen, Eggert, Alias, Christensen, Bouwer, Braun, Colette, Déqué, Georgievski, Georgopoulou, Gobiet, Menut, Nikulin, Haensler, Hempelmann, Jones, Keuler, Kovats, Kröner, Kotlarski, Kriegsmann, Martin, Meijgaard, Moseley, Pfeifer, Preuschmann, Radermacher, Radtke, Rechid, Rounsevell, Samuelsson, Somot, Soussana, Teichmann, Valentini, Vautard, Weber, and Yiou 2013) only a limited number of climate simulations can be realized and it is a question of the experimental design which uncertainty components are primarily tackled within the ensemble.Therefore, missing realizations within climate projection ensembles are a common problem and even simple ensemble estimates such as mean and variability for e.g.temperature changes are potentially biased due to unequal sampling of the uncertainty components.In order to avoid such biases, Déqué, Rowell, Lüthi, Giorgi, Christensen, Rockel, Jacob, Kjellström, Castro, and Hurk (2007) introduced an iterative data reconstruction method which assumes additivity between uncertainty components in order to estimate the missing climate change signals.This reconstruction method was further applied in several studies in order to obtain a balanced design for the analysis of variance components (Déqué et al. 2007;Heinrich, Gobiet, and Mendlik 2014;Prein, Gobiet, and Truhetz 2011;Déqué, Somot, Sanchez-Gomez, Goodess, Jacob, Lenderink, and Christensen 2011;Mendlik and Gobiet 2015).In wux, we implemented the method of Déqué et al. (2007) for a two-factorial design (reconstruct) such as realized in the ENSEMBLES project (van der Linden and Mitchell 2009).In ENSEMBLES, a set of 21 high resolution RCM simulations with a horizontal grid spacing of about 25 km was produced.The ensemble consists of 8 GCMs and 16 RCMs only forced by the A1B emission scenario, but due to limited computational resources, only a small fraction (16.4 % of the possible GCM-RCM combinations) could be realized.The result of such a reconstruction is shown in Figure 3.In that case, filling up the missing GCM-RCM combinations does not alter the distribution of temperature and precipitation change.However, as the method relies on an implicit formulation of the uncertainty components, it cannot be used to extend the ensemble to GCMs that have not been used as driver for any RCM in the ensemble.Further reconstruction methods which are able to extend the ensemble to GCMs outside of the original design are investigated in Heinrich et al. (2014).

Example: Further statistical analysis
It is one of the key strengths of this package to be directly implemented in R and for that reason to have direct access to a huge magnitude of statistical methods to analyze climate data.We provide an example application in this section to show possible extensions based fully on the wux.df.We use a linear mixed effects model from the lme4 package (Bates, Mächler, Bolker, and Walker 2015) to estimate the average summer temperature trend over the Greater Alpine Region based on individual time-series of 16 GCMs from the CMIP5 ensemble under a moderate stabilization scenario (RCP 4.5).
To generate the appropriate wux.df, the timeseries tag in the userinput file was set TRUE (see Section 3).The aim here is to get an average linear trend while accounting for the unbalanced model design.Several of the GCMs were run a couple of times (up to 10 times) with different initial conditions, which induces a dependency structure in the data set.We assess for this dependency by putting random effects in the linear model: where Y ijk is the average summer temperature projected by i = 1, . . ., 16 GCMs with j = 1, . . ., n i runs per GCM and k = 1, . . ., 130 yearly time steps.The random effects are defined as We use the lmer function from the lme4 package for our analysis to estimate the fixed effects β0 , β1 and to predict the individual random effects b0 = ( b0,1 , . . ., b0,16 ) , b1 = ( b1,1 , . . ., b1,16 ) .
The time-series data and the trends are shown in Figure 4 plotted with the lattice package (Sarkar 2008).

Conclusion
It is crucial in climate research not only to analyze outcomes of single climate models, but to consider entire multi-model ensembles, as it is virtually demanded in every climate impact related study to assess the associated uncertainties of the projected changes.There is, however, definitely a technical challenge to process large amount of climate simulations at once and not many tools exist to assess this problem.Another more general problem arises from the measure of uncertainty in multi-model ensembles.It is somewhat uncomfortable to make statistical inference on multi-model ensembles, as they do not stem from a designed experiment (Knutti et al. 2010), are utterly unbalanced (Déqué et al. 2007), and are known to be biased (Maraun et al. 2010;Themeßl, Gobiet, and Leuprecht 2011).
The focus here is not to show solutions for sophisticated statistical analyses of climate datasets, but merely to present a flexible and easy-to-use tool which is able to pre-process the datasets for further statistical analysis.This way, the user can focus on solving the grand challenges of statistical inference of multi-model datasets and does not need to spend valuable resources on technical data issues.However, this package also provides some functions for a first exploratory data analysis, as e.g. a summary function and some plotting routines.Such simple analysis provide very valuable information on the multi-model ensemble.In addition, we also provide a couple of methods to address the issue of unbalanced experiment designs.Several methods from literature are implemented to fill up the incomplete data matrix (Déqué et al. 2007;Heinrich et al. 2014).
It should be kept in mind, that also other software packages exist which partly fulfill similar tasks (e.g.climate explorer, CDO, NCL).The climate explorer can be a very convenient way to have a quick descriptive analysis of a multi-model ensemble.It is easy to use, but it is also restricted to a non-programming environment.Also, one can analyze only models which are implemented in the system, and the statistical methods are restricted as well.It should be noted, that no spatial analysis is currently possible within wux, as the emphasize lies on averaged domains.For spatial maps, tools as CDO or NCL are far better suited.Another limitation can be the hardware needed to process large datasets.R is not the most memoryefficient environment and one can run into trouble when reading climate simulations with a very high spatial resolution.
To sum it up, wux is a very flexible tool dealing with different aspects of climate model uncertainty in climate change impact investigations and enables a quick analysis of climate scenario uncertainty, which typically demands a considerable technical effort as well as fundamental knowledge about climate modeling.It can be used to achieve a quick overview on the involved uncertainties to identify the most important sources of uncertainty or to select representative sub-ensembles to be used as input for impact studies.wux is fully flexible regarding the meteorological parameter and region under consideration and is able to assess uncertainties based on multiple user-defined parameters.

3.
Spatial aggregation (arithmetic mean) over geographical domain.4. Computing climate change signal for specified periods.The resulting climate change signals for each climate model are returned to a data frame.

Figure 2 :
Figure 2: Grid cells of the NorESM1-M climate model being aggregated.On the left side area.fraction is switched off, taking all cells with their centroids lying within the AL region and weight them equally.The right figure has area.fractionon: The smaller the circles, the smaller the coverage of the model cells and the smaller their weight.

Figure 3 :
Figure 3: Projected changes of summer precipitation and temperature of the ENSEMBLES models from 1961-1990 to 2021-2050 in the Greater Alpine Region.The left plot shows the originally available 22 RCMs, whereas the right plot depicts a reconstructed dataset filled up with the function reconstruct.

Figure 4 :
Figure 4: Time-series of GCMs from the CMIP5 ensemble for summer temperature in the Alpine region.The estimated average trend β1 is shown as a bold line, the predicted random effects trends are shown as a dashed line.The simulations are ordered from low trend (lower left panel) to high trend (upper right panel).
The function models2wux fulfills exactly this task by processing magnitudes of binary climate model data to a R data frame of climate change signals.Subse-Introducing the wux Package quently, the user can take advantage of the vast amount of methods available in R, to analyze this data set.
Ulden, Haarsma, Sterl, Severijns, Hazeleger, and Dijkstra 2009)sma, Sterl, Severijns, Hazeleger, and Dijkstra 2009)and NCL 4 .All of those tools are designed to perform some sort of descriptive analysis and/or process the data to a desired format, however, none of those tools combines both easy multi-model handling and flexibility in statistical analysis. Fr example the climate explorer allows very straight forward processing of multi-model ensembles without any programming work. Te user specifies what climate models to analyze simply by clicking on their names and the desired statistics.Such web-based tools however, being simple to use, lack of flexibility for a real programming interface.In addition it is not possible to extend those tools for own climate simulations which are not implemented.Also, statistical analysis is restricted to available methods.More programming-oriented tools like CDO and NCO also provide possibilities to analyze ensembles of climate simulations.However, the user has to specify the location of the data each time when calling a function and the data have to be pre-formatted for the program to understand its meaning.Changing local NetCDF files too much is a restriction to reproducible research.Even though programming is possible, we are restricted to pre-defined CDO statistics operators.The main difference of wux compared to those tools is the easy way it can read in a multitude of climate simulations and simply the fact that this tool is embedded in R, which allows to apply a very broad range of sophisticated statistical tools and is not restricted only by methods implemented in the toolbox itself.

Table 1 :
in Table 1: Most important functionalities of the wux package.