Baseball data mapping

An introduction of mapping baseball data using ReNom TDA.

In this tutorial, we visualize baseball data using ReNom TDA module. you can learn following points.

  • How to analyse topology.

Requirement

In [1]:
import numpy as np

import pandas as pd

from renom_tda.topology import Topology
from renom_tda.lens import PCA

Import baseball data

We get 2016 baseball hitter stats from https://github.com/nyk510/baseball_dataset/tree/master/data .
And we calculate sabermetrics measurements.
  • OPS(On-base Plus Slugging)
OPS = OBP + SLG
OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
SLG = (1B + 2 2B + 3 3B + 4*HR) / AB
  • IsoP(Isolated Power)

IsoP = SLG - AVG

  • BABIP(Batting Average on Balls In Play)

BABIP = (H – HR)/(AB – K – HR + SF)

  • BB/K
  • PA/K
  • AB/HR
  • SecA(Secondary average)

SECA=(TB - H + BB + SB - CS) / AB

  • TA(Total Average)

TA = ( TB + BB + HBP + SB - CS ) / ( AB - H + CS + DP )

  • PS(Power-Speed-Number)

PS = ( HR × SB × 2) / ( HR + SB )

  • RC27(Runs Created per 27 outs)
RC = ( 2.4 × C + A ) × ( 3 × C + B ) ÷ (9 × C) - 0.9 × C
A = H + BB + HBP - CS - DP
B = TB + 0.26 ×(BB + HBP) + 0.53 × SF + 0.64 × SB - 0.03 × K
C = AB + BB + HBP + SF
In [2]:
file_path = "hitter_metrics.csv"
pdata = pd.read_csv(file_path).dropna()

Extract text data & number data

We extract text data like Team Name or Player Name and number data.

In [3]:
text_data = np.array(pdata.loc[:, pdata.dtypes=="object"])
number_data = np.array(pdata.loc[:, np.logical_or(pdata.dtypes=="float", pdata.dtypes=="int")])

Create topology instance

In [4]:
topology = Topology()

Load data

If you wan't to standardize data, you set standardize argument True.

In [5]:
topology.load_data(number_data, text_data=text_data, standardize=True)

Create point cloud

In [6]:
metric = None
lens = [PCA(components=[0,1])]
topology.fit_transform(metric=metric, lens=lens)
projected by PCA.

Mapping to Topological Space

In [7]:
topology.map(resolution=25, overlap=0.7, eps=0.3, min_samples=1)
created 145 nodes.
created 457 edges.

Colorize & show

Next, we colorize topology and show.

In [8]:
target = topology.number_data[:, 0]
topology.color(target, color_method="mean", color_type="rgb")
topology.show(fig_size=(10,10), node_size=10, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_16_0.png

Search player from node value

In [9]:
search_dicts = [{
    "data_type": "text",
    "operator": "like",
    "column": 1,
    "value": "大谷"
}]

target = topology.number_data[:, 0]
topology.color(target, color_method="mean", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=None, search_type="index")
topology.show(fig_size=(10,10), node_size=10, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_18_0.png

Search team

In [10]:
search_dicts = [{
    "data_type": "text",
    "operator": "like",
    "column": 0,
    "value": "ヤクルト"
}]

target = topology.number_data[:, 0]
topology.color(target, color_method="mean", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=None, search_type="index")
topology.show(fig_size=(10,10), node_size=10, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_20_0.png

Search from input data value

In [11]:
search_dicts = [{
    "data_type": "number",
    "operator": ">",
    "column": 0,
    "value": 0.9
}]

target = topology.number_data[:, 0]
topology.color(target, color_method="mean", color_type="rgb")
node_index = topology.search_from_values(search_dicts=search_dicts, target=None, search_type="index")
topology.show(fig_size=(10,10), node_size=10, edge_width=0.5)
../../../_images/notebooks_tda-case-study_baseball-data-mapping_notebook_22_0.png
Colored node has players that OPS larger 0.9.
You can show node id because search_from_values function return node indexes.
In [12]:
node_index
Out[12]:
[137, 138, 139, 140, 141, 142, 143, 144]

output csv file

Topology instance can create csv file from node indexes.
If text_data_columns and number_data_columns is not None, you can show output csv header with skip_header=False.
In [13]:
topology.output_csv_from_node_ids("output.csv", node_ids=node_index, skip_header=True)