Intermediate data analysis on Panthaion

Updated April 10, 2026

Once you are comfortable running cells and loading data in your Panthaion notebook, you are ready to go deeper. This guide covers the techniques that turn a basic script into a repeatable, shareable analysis — from reshaping messy datasets to building multi-panel charts and writing cleaner code.

Raw datasets from the Panthaion Ecosystem are often close to analysis-ready — but real work still requires cleaning, reshaping, and combining data before you can draw conclusions. This guide covers the techniques you'll use in almost every analysis.

Step 1: Reshape and clean your data

The most common tasks are renaming columns, handling missing values, filtering rows, and converting data types.

df.columns = ["date""temperature_c""humidity_pct"]
df.dropna(subset=["temperature_c"])                     # drop rows with missing key fields
df["humidity_pct"].fillna(df["humidity_pct"].mean()) # fill gaps with column mean
df["date"] = pd.to_datetime(df["date"])              # parse string to datetime
Tip

Run df.info() and df.describe() at the start of every analysis. These two commands give you a full picture of column types, missing value counts, and basic statistics before you write a single line of cleaning code.

Step 2: Filter, group, and aggregate

Filtering and grouping are the core of most climate and environmental analyses. Select only the rows you need, then summarise:

df_recent = df[df["date"] >= "2020-01-01"]

df.groupby("region")["temperature_c"].mean()

df.groupby("region").agg(
  mean_temp=("temperature_c""mean"),
  max_temp=("temperature_c""max")
)

df.set_index("date").resample("M")["temperature_c"].mean() # monthly averages
Tip

Chain operations rather than creating a new variable at every step. df.dropna().groupby("region").mean() is easier to follow and produces cleaner workspaces.

Step 3: Merge and join datasets

One of the most powerful features of the Panthaion Ecosystem is the ability to combine datasets from different sources. Once you have two dataframes loaded, merge them on a shared column:

pd.merge(df_climate, df_land, on="region_id", how="left")

Use how="left" to keep all rows from your primary dataset, how="inner" for only matching rows, and how="outer" to keep everything and investigate gaps. After merging, always check the row count and look for unexpected nulls — a silent duplicate is one of the most common sources of errors in data analysis.

Step 4: Build multi-panel charts

Single charts are fine for exploration, but publication-ready analyses usually need several panels. Use matplotlib subplots to arrange them:

fig, axes = plt.subplots(12, figsize=(124))
df.plot(ax=axes[0])
df2.plot(ax=axes[1])
fig.suptitle("Regional temperature comparison")
plt.tight_layout()

For interactive multi-panel charts, use Plotly with facet columns — each region gets its own panel with hover, zoom, and comparison built in:

px.line(df, x="date", y="temperature_c", facet_col="region")
Tip

Always label axes with units. A chart that says "temperature" is ambiguous. One that says "temperature (°C)" is citable.

Step 5: Write reusable functions

Once you find yourself repeating the same cleaning or plotting steps across multiple cells, wrap them in a function. Define it once at the top of your workspace and call it anywhere:

def monthly_mean(df, col):
  """Return monthly mean for a given column."""
  return df.set_index("date").resample("M")[col].mean()

monthly_mean(df_ocean, "salinity_ppt")

Keep functions short and focused on one task. If a function is doing three things, split it into three functions.

Step 6: Add narrative with markdown

A workspace that is only code is hard to share and harder to cite. Use markdown cells to turn your analysis into a readable document — covering what question each section answers, where the data comes from, and what the output shows. Aim for at least one markdown cell before each major code block.

Step 7: Parameterise and re-run cleanly

Hardcoded values scattered through a workspace make it fragile. Define all parameters in a single cell near the top:

START_DATE = "2015-01-01"
END_DATE   = "2024-12-31"
REGION     = "North Atlantic"
DATA_FILE  = "ocean_temp_v2.parquet"

Every cell below reads from these variables. To re-run for a different region or time window, you change one cell and hit Run all.

Troubleshooting

My merge produced far more rows than expected.
This usually means a many-to-many join on a non-unique key. Check for duplicates with df["region_id"].duplicated().sum() before merging.
 
My resample is returning NaN for most periods.
Make sure the date column is set as the index and is a proper datetime type.
Run df.index.dtype to confirm.
 
My chart labels are overlapping.
Try plt.tight_layout() after your plot commands, or rotate labels with plt.xticks(rotation=45).
 
My function is not updating when I edit it.
Re-run the cell that defines the function. Panthaion only uses the most recently run version, not the most recently written one.
 
My workspace runs on my screen but fails for a colleague.
The most likely cause is a file path or library version. Use relative paths and add a %pip install cell at the top for any non-standard libraries.