Using Bash for Data Pipelines - Technology in Action
in ,

Using Bash for Data Pipelines

Using bash scripts to create data pipelines is incredibly useful as a data scientist. The possibilities with these scripts are almost endless, but here, I will be going through a tutorial on a very basic bash script to download data and count the number of rows and cols in a dataset.

For help with Python, Unix or anything Computer Science, book a time with me on EXL skills

Once you get the hang of using bash scripts, you can have the basics for creating IoT devices, and much much more as this all works with a Raspberry Pi. One cool project that you could use this for is to download all of your twitter messages using the twitter api and then predict whether or not a message from a user on Twitter is spam or not. It could run on a Raspberry Pi server from your room! That is a little out of the scope of this tutorial though, so we will begin by looking at a dataset for cars speed in San Francisco!

In addition, the ability to pull live data, being able to replicate results is a very common necessity for data science. In this tutorial, I will show you how to create a data pipeline using bash and Unix commands which you can then use in your work as a Data Scientist


Pre-requisites

  1. Familiarity with the command line (know how to make a directory, change directories and create new files)
  2. Linux or Apple computer
  3. Internet connection

Goals: Create a bash script that does the following

  • Download data from an online source
  • Count the number of rows in the data
  • Record the names of the columns in the data
  • Iterate through multiple files and give all of this information

Part 1: Downloading the data files

For the first part of this tutorial, let say that we are working with San Francisco Speed Limit Dataand we want to create our entire pipeline through the command line. We will need to start by creating a folder for this work, which I called cars_pipeline

To do this in Unix you need to execute these commands on the command line :

  • Make the directory

mkdir cars_pipeline

  • Move into the new director

cd cars_pipeline


Once we are in here, we are going to create a new bash script which I called: download_data.sh You can name your whatever you would like as long as it ends with the extension .sh

  • The command for creating the file is:

touch download_data.sh


Now that we have created this file we are going to download the data into the folder we are working in. To do this, we will use the text editor called nano.

  • To open our file in nano we execute the following command in the command line:

nano download_data.sh

Once you have this open, you will create your first bash script by pasting the following code:

  • The text with leads with a # is a comment except for #!/bin/bash which is called a shebang and is required in every bash script
#!/bin/bash

# Command below is to download the data
curl -o speed.csv https://data.sfgov.org/api/views/wytw-dqq4/rows.csv?accessType=DOWNLOAD

To save:

  • control + o
  • enter
  • control + x

Now that we saved our file, lets explore this large command we are using to download our data:

curl -o speed.csv https://data.sfgov.org/api/views/wytw-dqq4/rows.csv?accessType=DOWNLOAD

Now to download the we just need to run our bash script with the following command in the terminal:

bash download_data.sh

If you now look into that folder, you will see that you have downloaded the file cars.csv!

Please ask if you have any questions about this, and you can always download a different file by replacing the url. As well, you could pass a command line argument in to the script and make it so that you can use this script to download any data that you would like.

This script is pretty basic though, so let’s move on to one that is slightly more difficult that will tell us the number of rows and cols in our dataset.


Part 2: Parsing the files for the required information

Here we are going to create a much more complicated script, that will be more in depth. I will go through each line, explaining what is does, and then put it all together as a script at the end.

I named this second script: process_data.sh where once again you can name it whatever you would like as long as it ends with the .shextension.

You will use the same process as above to edit and save your bash script. You can follow along with the article, or scroll down and copy the entire script in using nano.

Line 1: The shebang

#!/bin/bash

We are going to pass the name of the file in as a command line argument and save out a text file with our results. Due to this we will use a command called echo which prints out the value following it. To access the arguments, we will enumerate them with the first being $1 and second being $2 and so on…

We want it to save the name of the file so we will use the command:

echo "The name of the file is:" $1

Now we will use the following command to count the number of rows in the csv:

lines=$(wc -l < $1)
echo "The file has" $lines "lines"
  • We create the variable lines which counts the number of lines in our file that we passed in as the command line argument! For some more information on the wc command go here
  • Then we print out the number of lines the file has!

Next up, we are going to get the column names from our csv file! To do this we will use the following command in our bash script.

colnames=$(head -n 1 < $1)

This creates a variable that has just the first line from our csv in it! Putting all of this together (and a little more that I added so that the date auto populates into the text file) we get the following script:

#!/bin/bash
echo "Data Processed by Elliott Saslow"
DATE=`date +%Y-%m-%d`
echo "Date is: "$DATE
echo ""
echo "The name of the file is:" $1
echo ""
lines=$(wc -l < $1)
echo ""
echo "The file has" $lines "lines"
echo ""
colnames=$(head -n 1 < $1)
echo ""
echo "Column names are: "
echo ""
echo $colnames

Now to run the script, we save it like we did above with nano, and run it in the command line with the following command:

bash process_data.sh speed.csv > text.txt

This command does the following:

  • Calls the script
  • Passes the file that we are looking at (speed.csv )
  • Passes the output to a text file called text.txt

If you run this, and have done everything else correctly, you will have a text file in your folder containing the beginning of quality control checks that you can use for your data pipeline!

Let me know in the comments where you get stuck and how I can help!

Cheers

For help with Python, Unix or anything Computer Science, book a time with me on EXL skills

What do you think?

890 points
Upvote Downvote

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Loading…

0

Comments

0 comments

Machine Learning — Train your first model in 10 minutes!

Unsupervised Machine Learning