Bash for Data Analysis: Who Bashed Most Home Runs in a Season?

Learning MLB Bashing Stats with BASH

 

Now for those wondering what is BASH? Go to this link real intro and good all around tutorial of BASH: https://itnext.io/bash-scripting-everything-you-need-to-know-about-bash-shell-programming-cd08595f2fba 

Bash is essentially a shell program. 

 

A shell program is typically an executable binary that takes commands that you type and (once you hit return), translates those commands into (ultimately) system calls to

the Operating System API. Examples of using/accessing bash are the terminal on mac os, the command line on a linux os, or the windows subsystem for linux.

 

If you are not using a linux or mac machine, set up the windows subsystem for linux: instructions here: https://docs.microsoft.com/en-us/windows/wsl/install-win10 

Bash script can be used to quickly analyze and find key insights to data files without having to open the file in excel or looping through it with a python program.

 

It’s an incredibly powerful language with simple(ish) syntax all from the command line. And i'm going to use it to answer a question I don't know.

 

WHAT PLAYERS HAD A TOP 30 SEASON PERFORMANCE FOR TOTAL HR HITTING DURING A SEASON?

Basically, who has bashed most home runs within a season?

Let’s find out.

 

Complex data can be gathered and analysed from just the command line with no need to open any microsoft software tools or tableau.

You can download a free database of mlb stats from the internet from the command line by running this command

curl -LO https://github.com/chadwickbureau/baseballdatabank/archive/master.zip

Then unzip the download

Unzip master.zip

 

To see when the listed contents of directory were modified

Ls -l

Delete the .zip file

Rm master.zip

Change directories into the baseball master

Cd bas*HIT TAB TO AUTOCOMPLETE DIRECTORY NAME*

Another ls to see what is in the main directory 

And change into the ./core directory

Cd ./core

Also the ./ means specifying the current directory 

  • Now you can use ls to see all the new files

So lets investigate the data, looking for stats on the people and the stats on their hitting,you  head the Batting.csv file and the People.csv file to get a preview of the files you will be examining, you can head mutiple files at the same time using the syntax below:

Head Batting.csv People.csv

So when you analyze these two files you see that the playerID is being used as a Primary Key to link the People table to the Batting table, so you can see the connection from the batting stats to the player.

 

Now examining the files from the head command, you also see the “hr” column in the batting.csv file, what you want to do is sort the file by that value and see who has the most HRs.

Do a cut command to see which number field hr is.

To get just the first column for example and want to pipe that result into a head command would look like ths:

cut -d "," -f 1 Batting.csv | head

As you set the delimiter to be “,” which are commas, and you are asking for the first field with the playerId, now you can count to see which field would be HR.

After counting its the 12th column in the field so lets see it 

cut -d "," -f 12 Batting.csv | head

And you get the homeruns field

Now you extracted the data you want to collect from Batting, lets sort it

sort -t, -k12,12 -nr Batting.csv | head -n20

In order to see the top 20 player ids and batting info sorted by the 12th column, -t argument is delimiter to declare columns, -n sorts numeric and r is in reverse order, so you get most home runs first instead of. 

sort -t, -k12,12 -nr Batting.csv | cut -d "," -f1 | head -n30 > ids.csv

Now you combine a few command from before to get the sorted list, then you cut the first field, then you take the top 30 lines of the result and put that into a new csv file containing the Player Id’s of the top 30 season HR performers.

 

To get the first and last name of the player out of the People.csv file, to do this use the grep command with the file option

grep -Fwf ids.csv People.csv | cut -d "," -f 14,15

Now you get the unique list of all the people who had at Top 30 Season in terms of total home run hitting.

Now one can put all of this into a bash script to run incase someone wanted to save the steps for a future analysis when data gets updated in 2020 and just update the download link year.

 

To save it, lets create a new file with vim or vi (learn how to use vim/vi: https://www.linux.com/training-tutorials/vim-101-beginners-guide-vim/ )

Vim top30.sh

Once in vim, press “i” to start inserting and editing the file.

First add #!/bin/bash to the file to make it an executable

 

Now lets add the steps you just did above.

 

top30.sh

#!/bin/bash
curl -LO https://github.com/chadwickbureau/baseballdatabank/archive/master.zip

Unzip master.zip

Rm master.zip

cd ./baseballdatabank-master/core

sort -t, -k12,12 -nr Batting.csv | cut -d "," -f1 | head -n30 > ids.csv

grep -Fwf ids.csv People.csv | cut -d "," -f 14,15 > ../../FinalResults.txt

The only thing added is that the output of the last command is being saved to a File called FinalResults.txt

After you add those steps you just have to make it so you can execute this file with the proper permissions.

Run this command on the newly created shell script once done:

Chmod u+x top30.sh

And you can now execute it

./top30.sh

And in the FinalResults.txt file you will have a unique list in alphabetical order of the Top Players who had a Top 30 Season in Total HR.

Pete,Alonso
Jose,Bautista
Barry,Bonds
Chris,Davis
Jimmie,Foxx
Luis,Gonzalez
Hank,Greenberg
Ken,Griffey
Ryan,Howard
Ralph,Kiner
Mickey,Mantle
Roger,Maris
Mark,McGwire
David,Ortiz
Alex,Rodriguez
Babe,Ruth
Sammy,Sosa
Giancarlo,Stanton
Jim,Thome
Hack,Wilson

 

 

That’s how you use bash to see who bashed the most homeruns.

 


Thanks for reading!

Data Source: http://www.seanlahman.com/baseball-archive/statistics/ 



- Dom

July 2, 2020, 7:14 a.m.

1 LIKES