Learning MLB Bashing Stats with BASH
Now for those wondering what is BASH? Go to this link real intro and good all around tutorial of BASH: https://itnext.io/bash-scripting-everything-you-need-to-know-about-bash-shell-programming-cd08595f2fba
Bash is essentially a shell program.
A shell program is typically an executable binary that takes commands that you type and (once you hit return), translates those commands into (ultimately) system calls to
the Operating System API. Examples of using/accessing bash are the terminal on mac os, the command line on a linux os, or the windows subsystem for linux.
If you are not using a linux or mac machine, set up the windows subsystem for linux: instructions here: https://docs.microsoft.com/en-us/windows/wsl/install-win10
Bash script can be used to quickly analyze and find key insights to data files without having to open the file in excel or looping through it with a python program.
It’s an incredibly powerful language with simple(ish) syntax all from the command line. And i'm going to use it to answer a question I don't know.
WHAT PLAYERS HAD A TOP 30 SEASON PERFORMANCE FOR TOTAL HR HITTING DURING A SEASON?
Basically, who has bashed most home runs within a season?
Let’s find out.
Complex data can be gathered and analysed from just the command line with no need to open any microsoft software tools or tableau.
You can download a free database of mlb stats from the internet from the command line by running this command
curl -LO https://github.com/chadwickbureau/baseballdatabank/archive/master.zip
Then unzip the download
To see when the listed contents of directory were modified
Delete the .zip file
Change directories into the baseball master
Cd bas*HIT TAB TO AUTOCOMPLETE DIRECTORY NAME*
Another ls to see what is in the main directory
And change into the ./core directory
Also the ./ means specifying the current directory
Now you can use ls to see all the new files
So lets investigate the data, looking for stats on the people and the stats on their hitting,you head the Batting.csv file and the People.csv file to get a preview of the files you will be examining, you can head mutiple files at the same time using the syntax below:
Head Batting.csv People.csv
So when you analyze these two files you see that the playerID is being used as a Primary Key to link the People table to the Batting table, so you can see the connection from the batting stats to the player.
Now examining the files from the head command, you also see the “hr” column in the batting.csv file, what you want to do is sort the file by that value and see who has the most HRs.
Do a cut command to see which number field hr is.
To get just the first column for example and want to pipe that result into a head command would look like ths:
cut -d "," -f 1 Batting.csv | head
As you set the delimiter to be “,” which are commas, and you are asking for the first field with the playerId, now you can count to see which field would be HR.
After counting its the 12th column in the field so lets see it
cut -d "," -f 12 Batting.csv | head
And you get the homeruns field
Now you extracted the data you want to collect from Batting, lets sort it
sort -t, -k12,12 -nr Batting.csv | head -n20
In order to see the top 20 player ids and batting info sorted by the 12th column, -t argument is delimiter to declare columns, -n sorts numeric and r is in reverse order, so you get most home runs first instead of.
sort -t, -k12,12 -nr Batting.csv | cut -d "," -f1 | head -n30 > ids.csv
Now you combine a few command from before to get the sorted list, then you cut the first field, then you take the top 30 lines of the result and put that into a new csv file containing the Player Id’s of the top 30 season HR performers.
To get the first and last name of the player out of the People.csv file, to do this use the grep command with the file option
grep -Fwf ids.csv People.csv | cut -d "," -f 14,15
Now you get the unique list of all the people who had at Top 30 Season in terms of total home run hitting.
Now one can put all of this into a bash script to run incase someone wanted to save the steps for a future analysis when data gets updated in 2020 and just update the download link year.
To save it, lets create a new file with vim or vi (learn how to use vim/vi: https://www.linux.com/training-tutorials/vim-101-beginners-guide-vim/ )
Once in vim, press “i” to start inserting and editing the file.
First add #!/bin/bash to the file to make it an executable
Now lets add the steps you just did above.
#!/bin/bash curl -LO https://github.com/chadwickbureau/baseballdatabank/archive/master.zip Unzip master.zip Rm master.zip cd ./baseballdatabank-master/core sort -t, -k12,12 -nr Batting.csv | cut -d "," -f1 | head -n30 > ids.csv grep -Fwf ids.csv People.csv | cut -d "," -f 14,15 > ../../FinalResults.txt
The only thing added is that the output of the last command is being saved to a File called FinalResults.txt
After you add those steps you just have to make it so you can execute this file with the proper permissions.
Run this command on the newly created shell script once done:
Chmod u+x top30.sh
And you can now execute it
And in the FinalResults.txt file you will have a unique list in alphabetical order of the Top Players who had a Top 30 Season in Total HR.
Pete,Alonso Jose,Bautista Barry,Bonds Chris,Davis Jimmie,Foxx Luis,Gonzalez Hank,Greenberg Ken,Griffey Ryan,Howard Ralph,Kiner Mickey,Mantle Roger,Maris Mark,McGwire David,Ortiz Alex,Rodriguez Babe,Ruth Sammy,Sosa Giancarlo,Stanton Jim,Thome Hack,Wilson
That’s how you use bash to see who bashed the most homeruns.
Thanks for reading!
Data Source: http://www.seanlahman.com/baseball-archive/statistics/
July 2, 2020, 7:14 a.m.