Ph.D in Biotechnology who has joined a eCommerce startup in Hong Kong
GM Billy Beane used data analytics to pick undervalued players for Oakland As' 2002 draft
When Data Analytics Met Baseball Many have tried to explain baseball to me. Still I struggle to understand the game, on-base percentages (OBP) and on-base plus slugging (OPS) confuses me. That changed when I watched Billy Beane - the general manager of Oakland's A's in the movie Moneyball. The opening scene shows the Oakland A's losing the Division Series to the New York Yankees. Pissed at losing the game he turned to Sabermetrics (Data Analytics) that looked exclusively at players' OBP for forming the 2002 team draft using undervalued players, a departure from traditional baseball scouts. (read more)) Oakland A's didn't win the Major League Championship using Sabermetrics, but they did have a record consecutive 20 game winning streak.
Moneyball: The Page Turner I have always enjoyed Michael Lewis' books, some of which have become movie hits, examples include "The Big Short: Inside the Doomsday Machine" and my personal all time favorite Flash Boys, a New York's Bestsellers for 4 weeks, with Sony Pictures picking up the movie rights. In the book "Moneyball: The Art Of Winning An Unfair Game", Lewis the former lawyer writes with an acerbic wit and an uncanny ability to make the complicated seem very simple, even ordinary. A quantitative general manager for baseball didn't exist in 2002, today it's common practice (read more).
Abstract: We explore whether free agents in Major League Baseball meet the expectations set forth by newly signed contracts. The value and duration of these contracts are negotiated between the player (and his agent) and the signing team and are based primarily on the player's performance to date, projected future performance, and potential marketing value to the team. We develop two classes of models to explore this problem using a variety of regression‐ and tree‐based machine learning algorithms. The market model uses player and team data to predict the market value of a player's performance (i.e., average contract salary). The performance model uses the same data to predict wins above replacement as a surrogate for overall player performance. We translate this measure into dollars using position‐based conversion factors. Analysis of these models demonstrates that the performance model more consistently predicts and assesses player value with respect to their free agent contracts. Together, these models can be used to target or avoid free agents (or other players) whose performance‐based value differs significantly from their market value. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016
Pub.: 29 Jun '16, Pinned: 16 Apr '17
Abstract: Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.
Pub.: 14 Dec '16, Pinned: 16 Apr '17
Abstract: Injuries cost Major League Baseball teams over 1 billion dollars in 2014; that is enough to buy all but 4 of the 30 Major League Baseball teams outright.1 Improving performance and saving money by preventing injury in the players is a high priority, and one justification for preseason physical screening. On the surface, this seems sensible, but so did the player scouting practices of the last century, which the statisticians subsequently thoroughly debunked. In the Moneyball age,2 baseball players are bought and sold on fractional differences in performance statistics—there is little room for unfounded hunches. How does screening stand up in the Moneyball age' I argue that screening as we now do it is the same as player evaluation was years ago—it sounds like a good idea, but we are kidding ourselves if we think it is preventing injury. Dogma—what happens now and why...
Pub.: 30 Jun '16, Pinned: 13 Apr '17
Abstract: Athletes in most sports now have access to an abundance of information about the situational tendencies of their opponent(s) but it is currently unclear how effectively this information can be used or how to best present it. Three different methods for presenting situational information about a baseball pitcher were compared for college baseball players hitting in a batting simulator: Build-Up (shown cumulative pitch distributions), Full (shown complete distributions) and Control (shown no distributions). Initially, both the Build-Up and Full groups had significantly higher batting averages than the control group, however, the Full group had significantly lower batting performance when the pitcher was changed. Providing situational probability information gives a significant advantage to a batter, however, there is a trade-off (between short-term effectiveness and negative transfer) which depends on how the information is presented.
Pub.: 27 Oct '15, Pinned: 13 Apr '17