Skip to content

make_starwars_check_data: Star Wars Characters Dataset

The make_starwars_check_data function generates a synthetic Star Wars-like dataset containing demographics for 87 characters. Inspired by the dplyr dataset in R, this utility is perfect for practicing categorical analysis, grouping operations, and basic data exploration.


Overview

This utility creates a character dataset similar to the original R dataset:

  • Structure: 87 observations (characters).
  • Variables: Height, Mass, Species, and Character Names (synthetic).
  • Characteristics:
    • Species distribution is skewed towards 'Human' (approx. 55%).
    • Physical traits like height and mass are statistically distinct between species (e.g., Wookiees are taller/heavier, Hutts are very heavy).
  • Purpose: Ideal for data manipulation tasks (filtering, grouping), joining tables, and categorical visualization.
  • Reproducibility: Uses a fixed seed (default 42).

Parameters

Parameter Type Description Default
n int Number of characters to generate. 87
seed int Random seed for reproducibility. 42

Returns

Return Type Description
height_cm numpy.ndarray Character heights in cm. Shape (n,).
mass_kg numpy.ndarray Character masses in kg. Shape (n,).
species list[str] Species label for each entry (e.g., 'Human', 'Wookiee', 'Droid').
names list[str] List of placeholder character names (e.g., 'Character 1').

Example Usage

from machinegnostics.data import make_starwars_check_data
import pandas as pd

# Generate character data
h, m, s, names = make_starwars_check_data()

# Create a DataFrame for easy viewing
df = pd.DataFrame({
    'Name': names,
    'Species': s,
    'Height': h,
    'Mass': m
})

print(df.head())
# Output (approx):
#           Name Species      Height       Mass
# 0  Character 1   Human  176.452312  81.231231
# 1  Character 2  Droid   168.123123  84.512341
# ...

# Find the average mass of Humans
human_mass = df[df['Species'] == 'Human']['Mass'].mean()
print(f"Avg Human Mass: {human_mass:.2f} kg")