train_test_split: Random Train/Test Data Splitter¶
The train_test_split function provides a simple and flexible way to split your dataset into random training and testing subsets. It accepts NumPy arrays, Pandas DataFrame/Series, and can also handle lists or tuples as input. This function is essential for evaluating machine learning models on unseen data and is a core utility in most ML workflows.
Overview¶
Splitting your data into training and testing sets is a fundamental step in machine learning. The train_test_split function allows you to:
- Randomly partition your data into train and test sets.
- Specify the proportion or absolute number of test samples.
- Shuffle your data for unbiased splitting.
- Use a random seed for reproducibility.
- Split both features (
X) and targets (y) in a consistent manner.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
X |
array-like | — | Feature data to be split. Accepts NumPy arrays or Pandas DataFrame. Must be indexable and of consistent length. |
y |
array-like or None | None | Target data to be split alongside X. Accepts NumPy arrays or Pandas Series/DataFrame column. Must be same length as X. |
test_size |
float or int | 0.25 | If float, fraction of data for test set (0.0 < test_size < 1.0). If int, absolute number of test samples. |
shuffle |
bool | True | Whether to shuffle the data before splitting. |
random_seed |
int or None | None | Controls the shuffling for reproducibility. |
Returns¶
-
X_train, X_test:
np.ndarray
Train-test split of X. -
y_train, y_test:
np.ndarrayorNone
Train-test split of y. If y is None, these will also be None.
Raises¶
-
ValueError
If inputs are invalid ortest_sizeis not appropriate. -
TypeError
Iftest_sizeis not a float or int.
Example Usage¶
import numpy as np
from machinegnostics.models import train_test_split
# Create sample data
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, shuffle=True, random_seed=42
)
print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)
Notes¶
- If
yis not provided, onlyXwill be split andy_train,y_testwill beNone. - If
test_sizeis a float, it must be between 0.0 and 1.0 (exclusive). - If
test_sizeis an int, it must be between 1 andlen(X) - 1. - Setting
shuffle=Falsewill split the data in order, without randomization. - Use
random_seedfor reproducible splits.
Author: Nirmal Parmar
Date: 2025-05-01