The first machine learning pipeline I wrote had a test accuracy of 97%.
It was completely wrong.
Not wrong in a subtle way. Wrong in the way that meant the model had seen the answers before the exam. I had preprocessed the full dataset before splitting it — a classic data leakage mistake — and spent two weeks being quietly proud of a number that meant nothing.
Here is what I wish someone had told me before I started.
1. The split comes first, always
Before you touch the data — before you impute, scale, encode, or do anything else — split it. Fit your transformers only on the training set. Apply them to the test set. This is not a best practice. It is the only practice.
A standard scikit-learn Pipeline enforces this automatically. Use it from the start, not as an afterthought.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
The scaler sees only X_train. X_test is transformed using those training statistics. That is the correct behaviour.
2. Silent bugs in preprocessing will ruin you
Pandas does not warn you when a merge silently drops rows. It does not warn you when a fillna changes the distribution of a column in ways that will not generalise. It does not warn you when a dtype mismatch causes a column to be ignored.
Add shape assertions after every major transformation step:
df = df.merge(metadata, on='id', how='left')
assert df.shape[0] == expected_rows, f"Row count changed: {df.shape[0]}"
df['feature'] = df['feature'].fillna(df['feature'].median())
assert df['feature'].isna().sum() == 0, "Missing values remain after fillna"
These two lines have saved me hours. They catch a category of error that unit tests miss.
3. Random seeds are not optional
Your results need to be reproducible. Set seeds early, set them in multiple places, and document them. This matters less for your own sanity and more for anyone who reads your work — including future you.
import random
import numpy as np
import torch
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
If you use scikit-learn estimators that have randomness, pass random_state=SEED explicitly. Do not rely on global state.
4. Your baseline is more important than your model
Before you try any model, establish a dumb baseline. A majority-class classifier. A mean predictor. A simple rule based on the most obvious feature.
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Baseline accuracy: {dummy.score(X_test, y_test):.3f}")
If your gradient boosting model cannot convincingly beat this, the problem is almost certainly in the data, not the algorithm. Most beginner pipelines skip this step and spend days tuning a model that is solving the wrong problem.
5. Logging beats print statements
Use Python's logging module from the beginning. When your pipeline runs for four hours and something goes wrong at hour three, you will want structured log output with timestamps — not a long list of print statements you forgot to remove.
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s — %(levelname)s — %(message)s'
)
log = logging.getLogger(__name__)
log.info("Starting training run")
log.info(f"Training samples: {X_train.shape[0]}")
log.warning("High null rate in feature_x — check imputation")
You can set the log level to DEBUG during development and INFO in production without changing your code.
A note on 97% accuracy
That first pipeline result eventually got fixed. The real accuracy, after a proper train/test split, was 71%. Still decent for the problem. But the two weeks I spent thinking I had built something extraordinary taught me more than the model itself did.
The mistakes are the curriculum. Write them down.