Dr. Chris Edwards’ Data Science Page

Disclaimer: Naturally, this page is always under construction.

An analysis of Baseball Data, a Data Science exercise

The website retrosheet.org has downloads of play-by-play information for over 100 years of baseball seasons. The information can be used to learn about the game of baseball, at least as it is played at the Major League level. My interest in the data is in creating a method to simulate how the game plays. There are many table-top baseball games (Strat-o-Matic, APBA, Statis Pro, among others) which allow fans to recreate games, or seasons, or even conduct hypothetical matchups between Babe Ruth and Nolan Ryan, for instance. The game of baseball is rich in record keeping, and this is the appeal to many fans.

Other uses of this data could include learning strategies to win more games. Do managers use the proper tactics? Could we improve how the game is managed by learning what actions in the past produced success? Without being able to properly experiment, we are forced to rely on the record of observational data. However, if we can discover patterns, they may be helpful in answering some of these questions on strategy.

Full article

A webscraping example

Many websites display information in tables using html. This is wonderful for investigating and looking up facts, but it makes it challenging to save the information if it happens to be more than a few entries, or if we plan to access many different tables. If we can read the html code from a website, and determine where the information we want is in the lines of code (perhaps by discovering unique code that highlights the desired data) we are able to efficiently “scrape” the data from websites. This skill is an important part of being a data scientist, and this blog details one such effort on my part.

Full article

Last updated: October 30, 2020