2. Basics of accessing data

2.1. Synopsis

  • How to open a netcdf file or remote dataset with netCDF4
  • List variables
  • Query variable shape
  • Read variable data

2.2. About netcdf

Most data used in the ESM world is gridded and stored using the netCDF format. Unfortunatly, weather models use other formats which we will simply have to ignore here. The netCDF format is a protocol for putting data in files which are not human readable but the protocol means you can read them on any computer with the appropriate (freely available) tools.

The most widely used python package for reading and writing netCDF files is “netCDF4” (obtained with conda install netcdf4). There are other packages but netCDF4 seems to be the most versatile and compatible with all formats. The “scipy.io.netcdf” package is included in scipy but is only compatible with version 3 files. Version 4 files use HDF5 under-the-hood for which netCDF4 works well.

Access the netCDF4 package, using the following (usually at the top of your notebook):

[1]:
import netCDF4

Here we will walk through some specific operations using netCDF4.Dataset. To find out more about the netCDF4 package you can do

help(netCDF4)

to get extensive information.

2.3. Opening a dataset

Before looking at model output, let’s look at some regularly gridded data. We’ll use data served by the Lamont-Doherty Earth Observatory who have a good collection of products and gridded data. You will need to be connected to the Internet to do the following:

[2]:
nc = netCDF4.Dataset('http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NODC/.WOA05/.Grid-5x5/.Annual/.mn/.temperature/dods')

Several things to note:

  • The argument to netCDF4.Dataset() is a URL, rather than a file name. Often you will be working with local files in which case the argument would be a path to the file(s) of interest.
  • The cell should have executed quickly (maybe with no delay at all). This is because netCDF4.Dataset() fetches/reads only the meta-data and not the actual data. Usually, the meta-data is small compared to the actual data.
  • We stored the results of netCDF4.Dataset() in the variable nc. The name of this variable is arbitrary - we could just as easily have used fred = netCDF4.Dataset(...). Here, I chose nc to be short for netCDF container.

Let’s look at what netCDF4.Dataset() returned.

[3]:
print(nc)
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF3_CLASSIC data model, file format DAP2):
    Conventions: IRIDL
    dimensions(sizes): X(72), Y(36), Z(33)
    variables(dimensions): float32 Y(Y), float32 Z(Z), float32 X(X), float32 temperature(Z,Y,X)
    groups:

netCDF4.Dataset() returned an object that is a “class” defined by the netCDF4 package. We can see the following details:

  • The file format is actually using the older NETCDF3 protocol so scipy.io.netcdf would work on this file.
  • The conventions used in the file are “IRIDL” meaning IRI Data Library. More on conventions later.
  • The data could be up to three dimensional with the dimensions being “X”, “Y”, and “Z”, and with lengths, 72, 36 and 33, respectively.
  • There are four variables “X”, “Y”, “Z” and “temperature”.
  • “X”, “Y”, and “Z” are one-dimensional with the lengths of their namesake dimension, i.e. X has length X or 72. This is a part of the IRIDL convention and is shared with other conventions such as CF. Such variables that share names with a dimension and with length of that dimension are referred to as “dimension variables”. They are special and are convenient to use as coordinates when the coordinate system is orthogonal and Cartesian.
  • “temperature” has dimensions “Z,Y,X” so is three dimensional. The order of dimensions matters.

2.4. Variables

Here’s another way to see what variables are in the dataset:

[4]:
print(nc.variables)
OrderedDict([('Y', <class 'netCDF4._netCDF4.Variable'>
float32 Y(Y)
    pointwidth: 5.0
    standard_name: latitude
    gridtype: 0
    units: degree_north
unlimited dimensions:
current shape = (36,)
filling off
), ('Z', <class 'netCDF4._netCDF4.Variable'>
float32 Z(Z)
    positive: down
    gridtype: 0
    units: m
unlimited dimensions:
current shape = (33,)
filling off
), ('X', <class 'netCDF4._netCDF4.Variable'>
float32 X(X)
    pointwidth: 5.0
    standard_name: longitude
    gridtype: 1
    units: degree_east
unlimited dimensions:
current shape = (72,)
filling off
), ('temperature', <class 'netCDF4._netCDF4.Variable'>
float32 temperature(Z, Y, X)
    units: Celsius_scale
    missing_value: -99.9999
    long_name: Temperature
unlimited dimensions:
current shape = (33, 36, 72)
filling off
)])

We see that nc.variables is a python dictionary (OrderedDict) with keys “X”, “Y”, “Z” and “temperature”, each pointing to a netCDF4.Variable object. Those objects contain information such as units, data shape, and so on, which is not displayed very cleanly.

Knowing that nc.variables is a python dictionary we can get a more concise summary with:

[5]:
print(nc.variables.keys())
odict_keys(['Y', 'Z', 'X', 'temperature'])

or

[6]:
for v in nc.variables:
    print(v)
Y
Z
X
temperature

To access one entry of a python dictionary use the dict[key] syntax. For nc.variables the keys are ‘X’, ‘Y’, ‘Z’ and ‘temperature’.

Let’s examine the ‘Z’ object:

[7]:
print(nc.variables['Z'])
<class 'netCDF4._netCDF4.Variable'>
float32 Z(Z)
    positive: down
    gridtype: 0
    units: m
unlimited dimensions:
current shape = (33,)
filling off

This is what we saw before when we looked at nc.variables but now we have a more manageable volume of information (just the one variable). - The data type is 32-bit float (or real*4 in Fortran). - Positive Z means downward (Z is more like depth!). - Z has units of meters (“m”). - There are 33 Z values.

2.5. Reading all or part of a 1-d data array

So far we have only seen meta-data, either about the file or about variables within the file.

To see actual values of a variable we use python slices (i.e. things like [:], or [:2]). Remember that when using the [s:e] notation, the vector starts with index “s” but stops before index “e”. A missing index implies the full extent.

For an one-dimensional array or list, A, with size n: - A[s:e] gives elements s, s+1, …, e-2, e-1 - A[:e] gives elements 0, 1, …, e-2, e-1 - A[s:] gives elements s, s+1, …, n-2, n-1 - A[:] gives elements 0, 1, …, n-2, n-1 - A[:-1] gives elements 0, 1, …, n-3, n-2 - A[-3:] gives elements n-3, n-2, n-1 - A[-4:-1] gives elements n-4, n-3, n-2

Only now, when we specify a slice of data, will the data values actually be read by python:

[8]:
nc.variables['Z'][:]
[8]:
masked_array(data=[   0.,   10.,   20.,   30.,   50.,   75.,  100.,  125.,
                    150.,  200.,  250.,  300.,  400.,  500.,  600.,  700.,
                    800.,  900., 1000., 1100., 1200., 1300., 1400., 1500.,
                   1750., 2000., 2500., 3000., 3500., 4000., 4500., 5000.,
                   5500.],
             mask=False,
       fill_value=1e+20,
            dtype=float32)

The returned data is in the form of a “masked_array” which is a numpy class (https://docs.scipy.org/doc/numpy/reference/maskedarray.html). This means the file/variable can support missing data. For coordinate data you more often get a simple “array” (https://docs.scipy.org/doc/numpy/reference/arrays.html).

Masked arrays have a .mask attribute which is usually a boolean array. Where True, the data is masked out or missing.

The fill_value is the value the array will appear to have where it is missing. You should never need to know this but when plotting data, if you ever see scales up to “1e+20” then the masking isn’t working properly.

Note that I omitted the ``print()`` command. This is a convenient feature but be warned jupyter only displays the results of the last command. To be sure of seeing output it is generally safer to always use print(). However, the output is sometimes different as we can see here:

[9]:
print(nc.variables['Z'][:])
[   0.   10.   20.   30.   50.   75.  100.  125.  150.  200.  250.  300.
  400.  500.  600.  700.  800.  900. 1000. 1100. 1200. 1300. 1400. 1500.
 1750. 2000. 2500. 3000. 3500. 4000. 4500. 5000. 5500.]

Aside: The display of ``nc.variables[‘Z’][:]`` is dependent on context: within a print or ``__str__()`` context then just the values are returned, but for the ``__repr__()`` context the underlying object is returned. This is a detail of interest only if you are building/extending classes and likely of no use to you other than as an explanation of why there is a difference between the commands ``print(nc.variables[‘Z’][:])`` and ``nc.variables[‘Z’][:]``.

To read only the first five depths:

[10]:
print(nc.variables['Z'][:5])
[ 0. 10. 20. 30. 50.]

The last value can be found using negative indexing:

[11]:
print(nc.variables['Z'][-1])
5500.0

Python uses C-conventions so the first value is obtained with index 0:

[12]:
print(nc.variables['Z'][0])
0.0

2.6. Shape of data

numpy arrays have shape and the .shape attribute returns a tuple with the shape of the variable or data. .shape works on netCDF4.Variable objects without reading the data. Here are some examples:

[13]:
print( "nc.variables['Z'].shape =", nc.variables['Z'].shape )
print( "nc.variables['Z'][:].shape =", nc.variables['Z'][:].shape )
print( "nc.variables['Z'][:5].shape =", nc.variables['Z'][:5].shape )
print( "nc.variables['Z'][2:5].shape =", nc.variables['Z'][2:5].shape )
nc.variables['Z'].shape = (33,)
nc.variables['Z'][:].shape = (33,)
nc.variables['Z'][:5].shape = (5,)
nc.variables['Z'][2:5].shape = (3,)

2.7. Reading multi-dimensional data

Now we’ll look at the shape of slices from the three dimensional variable:

[14]:
nc.variables['temperature'].shape
[14]:
(33, 36, 72)

It is generally good practice to not load all data unnecessarily. The above command returned the shape without reading all the data into memory. The following reads all the data and then returns the shape:

[15]:
nc.variables['temperature'][:,:,:].shape
[15]:
(33, 36, 72)

The multiple colons are explicitly saying “all elements” along each dimension for which there is a colon (three dimensions in this case).

The same can be managed with implicit ranges simply by omission:

[16]:
nc.variables['temperature'][:,:].shape
[16]:
(33, 36, 72)
[17]:
nc.variables['temperature'][:].shape
[17]:
(33, 36, 72)

So not specifying the slice on a dimensions defaults to the whole dimension.

Now suppose we want a multi-dimensional subset of the data. The following returns the shape of surface temperature because:

  • the first dimension for “temperature” is Z;
  • Z points downward so the first value is the shallowest;
  • the first index in a dimension is 0.
[18]:
nc.variables['temperature'][0,:,:].shape
[18]:
(36, 72)

Because of implicit slices on omitted dimensions, the following give the same result:

[19]:
nc.variables['temperature'][0,:].shape
[19]:
(36, 72)
[20]:
nc.variables['temperature'][0].shape
[20]:
(36, 72)

We can do the same along other dimensions. To sample all data corresponding to the 5’th Y element:

[21]:
nc.variables['temperature'][:,4,:].shape
[21]:
(33, 72)
[22]:
nc.variables['temperature'][:,4].shape
[22]:
(33, 72)

As an extra, if you want implicit “:” for all but the last dimensions then the following works:

[23]:
nc.variables['temperature'][...,0].shape
[23]:
(33, 36)

2.8. Summary

  • Open a file with nc = netCDF4.Dataset(filename) or a remote dataset with nc = netCDF4.Dataset(url)
  • Look at the list of variables with print(nc.variables) or print(nc.variables.keys()
  • Access a variable with var = nc.variables['var']
  • Look at the shape of a variables with var.shape or nc.variables['var'].shape
  • Read the variable data with var[:] or nc.variables['var'][2,:]