Advanced data analysis on Panthaion
This guide is for researchers and analysts who are comfortable with the Panthaion workspace environment and want to build production-grade pipelines, apply statistical and machine learning methods to climate and environmental data, and produce analyses that are fully reproducible and citable.
This guide is for researchers and analysts who are comfortable with the Panthaion workspace environment and want to build production-grade pipelines, apply statistical and machine learning methods to climate and environmental data, and produce analyses that are fully reproducible and citable.
Step 1: Structure your workspace as a pipeline
At the advanced level, a Panthaion workspace is not a scratchpad — it is a documented pipeline that someone else should be able to run top to bottom, on a different machine, and get the same result. Structure it into clear stages: configuration, imports, data loading, cleaning, analysis, visualisation, and outputs. All parameters go in the configuration cell at the top. No hardcoded values anywhere else.
Use assert statements to catch problems early:
assert len(df_merged) == len(df_primary) # merge didn't drop rows
Add a provenance cell near the top recording the dataset DOIs, the date last run, and key library versions with pd.__version__. This is essential for reproducibility and makes citation straightforward.
Step 2: Work efficiently with large datasets
The Panthaion Ecosystem includes datasets that span decades and cover global spatial extents. Load only what you need:
pq.read_table("data.parquet", filters=[("year", ">=", 2000)]) # filter before loading
For multi-dimensional climate data in NetCDF format, use xarray rather than pandas — it is built for labelled arrays with time, latitude, and longitude dimensions and integrates cleanly with Panthaion's atmospheric and ocean datasets.
Use df.memory_usage(deep=True).sum() / 1e6 to check how many megabytes your dataframe occupies. Casting float64 columns to float32 where full precision isn't needed can cut memory use in half.
Step 3: Apply statistical analysis
Descriptive statistics are a starting point, not a conclusion. Use inferential and time series methods to draw defensible conclusions:
slope, intercept, r, p, se = stats.linregress(df["year"], df["temperature_c"])
stats.ttest_ind(df_pre["temperature_c"], df_post["temperature_c"])
stats.mannwhitneyu(a, b, alternative="two-sided") # for non-normal data
result = seasonal_decompose(df.set_index("date")["temperature_c"], model="additive", period=12)
Report slope, p-value, and confidence interval together — never slope alone.
Always state your null hypothesis before running a test, not after. Choosing a test because it gives you a significant result is p-hacking. Choose the test because it matches the structure of your data.
Step 4: Build and evaluate a predictive model
Start with a clean train-test split to avoid data leakage, then fit a baseline before trying anything more complex:
from sklearn.metrics import mean_absolute_error, r2_score
# For time series — never shuffle, test set must be most recent
train = df[df["date"] < "2022-01-01"]
test = df[df["date"] >= "2022-01-01"]
Evaluate with MAE, RMSE, and R-squared together. A model with high R-squared but poor MAE on held-out data is overfitting.
Use cross-validation rather than a single train-test split wherever possible. cross_val_score gives you a distribution of performance estimates — far more informative when publishing results.
Step 5: Build modular, importable code
Once you have functions used across multiple analyses, move them into a module. Create panthaion_utils.py in the same directory as your workspaces and import from it:
# For larger projects, organise into submodules
from utils.stats import detect_trend
Write at least one test for every function you move into a module. Run !pytest utils/ from a terminal cell. Catching a bug in a shared function before it propagates across five workspaces saves significant time.
Step 6: Create publication-ready visualisations
Set a consistent style at the top of your workspace to remove chart junk without effort on individual plots:
"axes.spines.right": False, "figure.dpi": 150})
plt.fill_between(x, lower, upper, alpha=0.2) # always show uncertainty
fig.savefig("fig1_temperature_trend.pdf", bbox_inches="tight", dpi=300)
For spatial data, use cartopy to project Panthaion's geographic datasets onto proper map projections. A flat scatter plot of lat-long coordinates is not a map.
Step 7: Automate quality checks before publishing
Build a validation function that runs as the final cell of every workspace before submitting to the Panthaion Ecosystem:
assert ~df.duplicated(subset=["date", "region"]).any(), "Duplicate rows detected"
Log validation output to a markdown cell using IPython.display.Markdown. This creates a human-readable sign-off at the end of the workspace that confirms it is ready for Panthaion submission.
Troubleshooting
random_state=42 in all scikit-learn calls and add import random; random.seed(42) at the top of your workspace.