Dr. Chris Edwards’ Data Science Page
Disclaimer: Naturally, this page is always
under construction.
An analysis of Baseball Data, a Data Science exercise
The website retrosheet.org has downloads of
play-by-play information for over 100 years of baseball seasons. The
information can be used to learn about the game of baseball, at least as it is
played at the Major League level. My interest in the data is in creating a
method to simulate how the game plays. There are many table-top baseball games
(Strat-o-Matic, APBA, Statis Pro, among others) which allow fans to recreate
games, or seasons, or even conduct hypothetical matchups between Babe Ruth and
Nolan Ryan, for instance. The game of baseball is rich in record keeping, and
this is the appeal to many fans.
Other uses of this data could include learning
strategies to win more games. Do managers use the proper tactics? Could we
improve how the game is managed by learning what actions in the past produced
success? Without being able to properly experiment, we are forced to rely on
the record of observational data. However, if we can discover patterns, they
may be helpful in answering some of these questions on strategy.
A webscraping example
Many websites display information in tables using
html. This is wonderful for investigating and looking up facts, but it makes it
challenging to save the information if it happens to be more than a few
entries, or if we plan to access many different tables. If we can read the html
code from a website, and determine where the information we want is in the
lines of code (perhaps by discovering unique code that highlights the desired
data) we are able to efficiently “scrape” the data from websites. This skill is
an important part of being a data scientist, and this blog details one such
effort on my part.
Last updated: October 30, 2020