Development environment

Most useful Packages with python

Data processing

This section list all the packages used for data reading, stats and big data processing.

Pandas

https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

Pandas allow a lot of operations :
- Read csv files into Dataframes
- Resample time series easily
- Make statistics
- Plot data
```
conda install -c conda-forge pandas
```
Xarray

https://xarray.pydata.org/en/stable/

Xarray is a package which handle arrays as Pandas does for the time series :
- Open/read/write Netcdf files
- Labels array
- Easy management of index and dates
- Resample, interpolatei
- Plot data easily
```
conda install -c conda-forge xarray
```
XOA

https://xoa.readthedocs.io/en/latest/index.html

XOA is a package to help reading and manipulating observed and simulated ocean data. It is heavily based on xarray. For those who know it, it is the successor of vacumm.
```
conda install -c conda-forge xoa
```
DASK

https://www.dask.org/

Dask is a package that allow the user to easily parallelize your data processing, especially when dealing with big data volumes.
```
conda install dask -c conda-forge
```
Xskillscore

https://xskillscore.readthedocs.io/en/latest/

Its a package to compute metrics from xarray structures and that return xarray data.
```
conda install -c conda-forge xskillscore
```

Plot and figures

Here are the packages used to do all the plots.

Matplotlib

https://matplotlib.org/stable/index.html

The most popular library to make plot and maps
conda install -c conda-forge matplotlib

Cartopy

https://scitools.org.uk/cartopy/docs/latest/gallery/index.html

Cartopy is an extension to Matplotlib dedicated to produce maps from geospatial data

It handle various projections.
```
conda install -c conda-forge cartopy
```
Seaborn

https://seaborn.pydata.org/

Seaborn is a useful package to plot statistical data in Pandas Dataframe format. It’s used here in the part CORA
conda install seaborn -c conda-forge

Python on Datarmor

What tool ?

On datarmor there is a platform to use jupyter notebooks directly on compute nodes

First connect to the following link using the intranet login

https://datarmor-jupyterhub.ifremer.fr

Note

The user can log in once he is connected on the Ifremer network (on the site or via the VPN)

The documentation is available here:

https://w3z.ifremer.fr/intraric/Mon-IntraRIC/Calcul-et-donnees-scientifiques/Datarmor-Calcul-et-Donnees/Datarmor-calcul-et-programmes/Pour-aller-plus-loin/Conda-sur-Datarmor
Once connected the user should choose the job profile :
- Choose one core if you don’t need to use Dask or if you want to use Dask with distributed job (see next par)
- Choose 28 cores if you want to use dask on 28 nodes
Then you have to choose your conda environment

Warning

In order to Jupyterlab to work the user has to install the following packages in the conda environment.

conda install ipykernel jupyterhub jupyterlab nodejs -c conda-forge -y
pip install dask_labextension

Distributed work with Dask on Datarmor

A specification of Dask is, that it’s able to distribute the tasks on different PBS jobs instead of the local one you are connected.

First install the dask-jobqueue package
```
conda install dask-jobqueue -c conda-forge
```
To see how it works see the following page:

https://jobqueue.dask.org/en/latest/howitworks.html

So you will sometimes see on top of scripts jobqueue directives for calcul.

The only thing to do is to tell dask-jobqueue the configuration of Datarmor

https://jobqueue.dask.org/en/latest/configuration.html

Here is the configuration file to use and to put here

  vi  ~/.config/dask/jobqueue.yaml

.. code-block::

       jobqueue:
     pbs:
       name: dask-worker

       # Dask worker options
       # number of processes and core have to be equal to avoid using multiple
       # threads in a single dask worker. Using threads can generate netcdf file
       # access errors.
       cores: 28
       processes: 6
       # this is using all the memory of a single node and corresponds to about
       # 4GB / dask worker. If you need more memory than this you have to decrease
       # cores and processes above
       memory: 120GB
       interface: ib0
       death-timeout: 900
       # This should be a local disk attach to your worker node and not a network
       # mounted disk. See
       # https://jobqueue.dask.org/en/latest/configuration-setup.html#local-storage
       # for more details.
       local-directory: $TMPDIR

       # PBS resource manager options
       queue: mpi_1
       project: myPROJ
       walltime: '04:00:00'
       resource-spec: select=1:ncpus=28:mem=120GB
       # disable email
       job-extra: ['-m n']

   distributed:
     scheduler:
       bandwidth: 1000000000     # GB MB/s estimated worker-worker bandwidth
     worker:
       memory:
         target: 0.90  # Avoid spilling to disk
         spill: False  # Avoid spilling to disk
         pause: 0.80  # fraction at which we pause worker threads
         terminate: 0.95  # fraction at which we terminate the worker
     comm:
       compression: null