Our First ML Model: Decision Tree from Scratch

8/23/2025

decision-tree · tutorial · supervised-learning · bigml · BigML Analyst Certification I

Hands-on tutorial • 8-10 min read • 30-60 min to run

TL;DR: Decision trees split data by feature thresholds to create easy-to-interpret rules. We’ll train a small tree on a toy dataset and evaluate accuracy.

Why decision trees

Highly interpretable rules.
Work with mixed data types without heavy preprocessing.
Fast to train for small-to-medium datasets.

Minimal working example (Python + scikit-learn)

Prerequisites

# Windows PowerShell
python -m venv .venv; .\.venv\Scripts\Activate.ps1; pip install scikit-learn pandas

Code (minimal)

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Toy dataset
X = pd.DataFrame({
    'age': [22, 25, 47, 52, 46, 56, 55, 60],
    'income': [20000, 30000, 50000, 60000, 52000, 80000, 70000, 90000]
})
Y = [0, 0, 1, 1, 1, 1, 1, 1]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, preds))

Interpreting splits

Each internal node tests a feature threshold.
Leaves contain predicted labels and class probabilities.
Use export_text or plot_tree (sklearn) to visualize rules.

Visual: decision tree example

placeholder: decision tree diagram

Replace with an SVG showing a small tree (root → branches → leaves) and one highlighted prediction path.

Production notes

Trees can overfit; prefer pruning or ensembles for robust performance.
Save the preprocessing pipeline with the model.

We’ll explain classification vs. regression and when to use each.

Appendix: BigML decision-tree quickstart (from transcript)

Reproduce the transcript demo steps quickly on BigML:

Create a source
- Upload diabetes.csv (or create from a public URL). Let BigML infer field types; adjust locale/decimal separator if needed.
Inspect dataset
- Confirm instance count (e.g., 768) and class balance for the target field. Hover histograms to see mean/median/std.
Choose objective
- Ensure diabetes (or your target) is selected as the objective (BigML defaults to last column).
Build a decision tree
- From the dataset view, open the actions menu (cloud icon) → Model. One-click builds a default tree.
- Use the gears icon to configure max_depth, pruning, and whether to treat missing as a value.
Explore predictions
- Click any node to view the prediction, confidence, and prediction path (the sequence of splits producing the prediction).
- Use alternative visualizations: tree branches, sunburst, and 2D PDP (plasma glucose vs BMI heatmap).
Export / programmatic predictions
- Download sample Python code from the model’s Action menu to call the prediction endpoint or embed the model.

Notes

Confidence in BigML’s decision trees reflects node purity and instance counts (higher instance backing → higher confidence).
You can filter the UI to show only branches that predict a particular class for quick rule extraction.