Python ML pipeline mistakes — what I wish I knew

The first machine learning pipeline I wrote had a test accuracy of 97%.

It was completely wrong.

Not wrong in a subtle way. Wrong in the way that meant the model had seen the answers before the exam. I had preprocessed the full dataset before splitting it — a classic data leakage mistake — and spent two weeks being quietly proud of a number that meant nothing.

Here is what I wish someone had told me before I started.

1. The split comes first, always

Before you touch the data — before you impute, scale, encode, or do anything else — split it. Fit your transformers only on the training set. Apply them to the test set. This is not a best practice. It is the only practice.

A standard scikit-learn Pipeline enforces this automatically. Use it from the start, not as an afterthought.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

The scaler sees only X_train. X_test is transformed using those training statistics. That is the correct behaviour.

2. Silent bugs in preprocessing will ruin you

Pandas does not warn you when a merge silently drops rows. It does not warn you when a fillna changes the distribution of a column in ways that will not generalise. It does not warn you when a dtype mismatch causes a column to be ignored.

Add shape assertions after every major transformation step:

python

df = df.merge(metadata, on='id', how='left')
assert df.shape[0] == expected_rows, f"Row count changed: {df.shape[0]}"

df['feature'] = df['feature'].fillna(df['feature'].median())
assert df['feature'].isna().sum() == 0, "Missing values remain after fillna"

These two lines have saved me hours. They catch a category of error that unit tests miss.

3. Random seeds are not optional

Your results need to be reproducible. Set seeds early, set them in multiple places, and document them. This matters less for your own sanity and more for anyone who reads your work — including future you.

python

import random
import numpy as np
import torch

SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

If you use scikit-learn estimators that have randomness, pass random_state=SEED explicitly. Do not rely on global state.

4. Your baseline is more important than your model

Before you try any model, establish a dumb baseline. A majority-class classifier. A mean predictor. A simple rule based on the most obvious feature.

python

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.3f}")

If your gradient boosting model cannot convincingly beat this, the problem is almost certainly in the data, not the algorithm. Most beginner pipelines skip this step and spend days tuning a model that is solving the wrong problem.

5. Logging beats print statements

Use Python's logging module from the beginning. When your pipeline runs for four hours and something goes wrong at hour three, you will want structured log output with timestamps — not a long list of print statements you forgot to remove.

python

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s — %(levelname)s — %(message)s'
)
log = logging.getLogger(__name__)

log.info("Starting training run")
log.info(f"Training samples: {X_train.shape[0]}")
log.warning("High null rate in feature_x — check imputation")

You can set the log level to DEBUG during development and INFO in production without changing your code.

A note on 97% accuracy

That first pipeline result eventually got fixed. The real accuracy, after a proper train/test split, was 71%. Still decent for the problem. But the two weeks I spent thinking I had built something extraordinary taught me more than the model itself did.

The mistakes are the curriculum. Write them down.

The first machine learning pipeline I wrote had a test accuracy of 97%.

It was completely wrong.

Here is what I wish someone had told me before I started.

1. The split comes first, always

A standard scikit-learn Pipeline enforces this automatically. Use it from the start, not as an afterthought.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

The scaler sees only X_train. X_test is transformed using those training statistics. That is the correct behaviour.

2. Silent bugs in preprocessing will ruin you

Add shape assertions after every major transformation step:

python

df = df.merge(metadata, on='id', how='left')
assert df.shape[0] == expected_rows, f"Row count changed: {df.shape[0]}"

df['feature'] = df['feature'].fillna(df['feature'].median())
assert df['feature'].isna().sum() == 0, "Missing values remain after fillna"

These two lines have saved me hours. They catch a category of error that unit tests miss.

3. Random seeds are not optional

python

import random
import numpy as np
import torch

SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

If you use scikit-learn estimators that have randomness, pass random_state=SEED explicitly. Do not rely on global state.

4. Your baseline is more important than your model

Before you try any model, establish a dumb baseline. A majority-class classifier. A mean predictor. A simple rule based on the most obvious feature.

python

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.3f}")

5. Logging beats print statements

python

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s — %(levelname)s — %(message)s'
)
log = logging.getLogger(__name__)

log.info("Starting training run")
log.info(f"Training samples: {X_train.shape[0]}")
log.warning("High null rate in feature_x — check imputation")

You can set the log level to DEBUG during development and INFO in production without changing your code.

A note on 97% accuracy

The mistakes are the curriculum. Write them down.

What I wish I knew before writing my first Python ML pipeline

1. The split comes first, always

2. Silent bugs in preprocessing will ruin you

3. Random seeds are not optional

4. Your baseline is more important than your model

5. Logging beats print statements

A note on 97% accuracy

What I wish I knew before writing my first Python ML pipeline

1. The split comes first, always

2. Silent bugs in preprocessing will ruin you

3. Random seeds are not optional

4. Your baseline is more important than your model

5. Logging beats print statements

A note on 97% accuracy