Advanced data analysis on Panthaion

Updated April 10, 2026

This guide is for researchers and analysts who are comfortable with the Panthaion workspace environment and want to build production-grade pipelines, apply statistical and machine learning methods to climate and environmental data, and produce analyses that are fully reproducible and citable.

This guide is for researchers and analysts who are comfortable with the Panthaion workspace environment and want to build production-grade pipelines, apply statistical and machine learning methods to climate and environmental data, and produce analyses that are fully reproducible and citable.

Step 1: Structure your workspace as a pipeline

At the advanced level, a Panthaion workspace is not a scratchpad — it is a documented pipeline that someone else should be able to run top to bottom, on a different machine, and get the same result. Structure it into clear stages: configuration, imports, data loading, cleaning, analysis, visualisation, and outputs. All parameters go in the configuration cell at the top. No hardcoded values anywhere else.

Use assert statements to catch problems early:

assert "temperature_c" in df.columns                   # expected columns exist
assert len(df_merged) == len(df_primary)            # merge didn't drop rows
Tip

Add a provenance cell near the top recording the dataset DOIs, the date last run, and key library versions with pd.__version__. This is essential for reproducibility and makes citation straightforward.

Step 2: Work efficiently with large datasets

The Panthaion Ecosystem includes datasets that span decades and cover global spatial extents. Load only what you need:

pd.read_parquet("global_ocean.parquet", columns=["date""region""sst_c"])

pq.read_table("data.parquet", filters=[("year"">="2000)])  # filter before loading

For multi-dimensional climate data in NetCDF format, use xarray rather than pandas — it is built for labelled arrays with time, latitude, and longitude dimensions and integrates cleanly with Panthaion's atmospheric and ocean datasets.

Tip

Use df.memory_usage(deep=True).sum() / 1e6 to check how many megabytes your dataframe occupies. Casting float64 columns to float32 where full precision isn't needed can cut memory use in half.

Step 3: Apply statistical analysis

Descriptive statistics are a starting point, not a conclusion. Use inferential and time series methods to draw defensible conclusions:

from scipy import stats

slope, intercept, r, p, se = stats.linregress(df["year"], df["temperature_c"])

stats.ttest_ind(df_pre["temperature_c"], df_post["temperature_c"])

stats.mannwhitneyu(a, b, alternative="two-sided")  # for non-normal data

result = seasonal_decompose(df.set_index("date")["temperature_c"], model="additive", period=12)

Report slope, p-value, and confidence interval together — never slope alone.

Tip

Always state your null hypothesis before running a test, not after. Choosing a test because it gives you a significant result is p-hacking. Choose the test because it matches the structure of your data.

Step 4: Build and evaluate a predictive model

Start with a clean train-test split to avoid data leakage, then fit a baseline before trying anything more complex:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.metrics import mean_absolute_error, r2_score

# For time series — never shuffle, test set must be most recent
train = df[df["date"] < "2022-01-01"]
test  = df[df["date"] >= "2022-01-01"]

Evaluate with MAE, RMSE, and R-squared together. A model with high R-squared but poor MAE on held-out data is overfitting.

Tip

Use cross-validation rather than a single train-test split wherever possible. cross_val_score gives you a distribution of performance estimates — far more informative when publishing results.

Step 5: Build modular, importable code

Once you have functions used across multiple analyses, move them into a module. Create panthaion_utils.py in the same directory as your workspaces and import from it:

from panthaion_utils import monthly_mean, plot_trend

# For larger projects, organise into submodules
from utils.stats import detect_trend
Tip

Write at least one test for every function you move into a module. Run !pytest utils/ from a terminal cell. Catching a bug in a shared function before it propagates across five workspaces saves significant time.

Step 6: Create publication-ready visualisations

Set a consistent style at the top of your workspace to remove chart junk without effort on individual plots:

plt.rcParams.update({"font.size"11"axes.spines.top"False,
                      "axes.spines.right"False"figure.dpi"150})

plt.fill_between(x, lower, upper, alpha=0.2)  # always show uncertainty

fig.savefig("fig1_temperature_trend.pdf", bbox_inches="tight", dpi=300)

For spatial data, use cartopy to project Panthaion's geographic datasets onto proper map projections. A flat scatter plot of lat-long coordinates is not a map.

Step 7: Automate quality checks before publishing

Build a validation function that runs as the final cell of every workspace before submitting to the Panthaion Ecosystem:

assert df["sst_c"].between(-235).all(), "SST values outside plausible ocean range"
assert ~df.duplicated(subset=["date""region"]).any(), "Duplicate rows detected"
Tip

Log validation output to a markdown cell using IPython.display.Markdown. This creates a human-readable sign-off at the end of the workspace that confirms it is ready for Panthaion submission.

Troubleshooting

My pipeline produces different results each time I run it.
Check for any random operations that are not seeded. Set random_state=42 in all scikit-learn calls and add import random; random.seed(42) at the top of your workspace.
 
My model performs well during training but poorly on test data.
This is overfitting. Reduce model complexity, add regularisation, or increase training data. For time series, check that no future data is leaking into your training set through feature engineering.
 
My xarray operations are running very slowly.
Make sure your data is chunked appropriately for the operation you are running. Loading a global NetCDF as a single chunk and then slicing it is much slower than loading it with chunks aligned to your slice dimensions.