Supercharge Your Data Analytics with gspread and Pandas: A Hands-On Tutorial (Part 2)
In Part 1, we completed our gspread installation and authentication We successfully downloaded the JSON file with our credentials. In this article, we will configure gspread for use with a service account and then we will torture our data for some insights using pandas.
Please read Part 1 before continuing.
If you have already read Part 1, Lezzzz go!
To properly configure gspread for use with a service account, the downloaded credentials file should be moved to the appropriate directory.
To move the downloaded file to the correct directory on Linux and macOS, you can use the following command:
mv /path/to/downloaded/file.json ~/.config/gspread/service_account.json
For Windows users, you can use the following command:
move \path\to\downloaded\file.json %APPDATA%\gspread\service_account.json
By moving the credentials file to the appropriate directory, gspread will be able to locate the file and use it to authenticate with the Google Sheets API.
If the directory does not exist, you can create it with the following command:
mkdir -p ~/.config/gspread/
We’re almost there….I know you want to see some action but hang in tight a little bit.
In order to access a Google Sheets spreadsheet from your application or script using gspread, it is very important that we share the spreadsheet with the client_email that can be found in the JSON file. This can be done in the same way that you would share a Google Sheets spreadsheet with any other Google account so head over to your Google Sheets spreadsheet and share it with the client_email.
Here’s how you can do that
1. Go to your google sheet
2. Click on Share
3. Enter the client_email value here
4. Click on Done
Finally! we can start developing our python script.
Open any python client and run the following code
import gspread
goocleclient = gspread.service_account()
sheet = goocleclient.open(sheet_name).worksheet(page)
print(sheet.get('A1'))
sheet_name is the title of the spreadsheet
page is the name of the sheet found on the bottom
Replace sheet_name and page with the appropriate values and the script should return a value.
But hey! what just happened?
The code imports the gspread
library and creates a Google client using gspread.service_account()
, then accesses a specific Google Sheet and worksheet using the open()
and worksheet()
methods respectively.
Finally, it prints the value of cell A1 using the get()
method. Since there were no errors raised, we can assume that the connection to the Google Sheet was successful and the value of cell A1 was printed to the console.
You did it!
It is now time to torture our data. I will be using a google sheet that contains product data. Feel free to use it for all your escapades.
Click here to get access to the google sheet I am using for this tutorial.
Content of the google sheet used for this tutorial
Order ID
- An Order ID is the number system that Amazon uses exclusively to keep track of orders.Product
- The product that has been sold.Quantity Ordered
- Ordered Quantity is the total item quantity ordered in the initial order (without any changes).Price Each
- The price of each product.Order Date
- This is the date the customer is requesting the order be shipped.Purchase Address
- The purchase order is prepared by the buyer, often through a purchasing department.
How do we introduce pandas?
The simplest way to get data from a sheet to a pandas DataFrame is with get_all_records()
:
import pandas as pd
df = pd.DataFrame(sheet.get_all_records())
Now that we have our data in pandas, let us torture it so that it can confess some insights to us.
Let us first convert the Order Date field to a date time
# Convert Order Date to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])
Insight 1: Total revenue
To get the total revenue, we can multiply the Quantity Ordered
by the Price Each
for each row and then sum the result:
# Add a Total Sales column
df['Total Sales'] = df['Quantity Ordered'] * df['Price Each']
total_revenue = df['Total Sales'].sum()
print(f"Total Revenue: ${total_revenue:.2f}")
Insight 2: Best-selling product
To find the best-selling product, we can group the data by Product
and sum the Quantity Ordered
for each group, then sort the result in descending order:
# Insight 2: Best Selling Product
best_selling_product = df.groupby('Product')['Quantity Ordered'].sum().idxmax()
print(f"Best Selling Product: {best_selling_product}")
# Visualization
plt.figure(figsize=(10, 6))
sns.barplot(x="Product", y="Quantity Ordered", data=df, order=df.groupby('Product')['Quantity Ordered'].sum().sort_values(ascending=False).index)
plt.xticks(rotation=90)
plt.title('Best Selling Product')
plt.show()
Insight 3: Most popular city
To find the most popular city for sales, we can extract the city name from the Purchase Address
column using the str.split()
and str.get()
methods, then group the data by city and sum the Revenue
for each group, and finally sort the result in descending order:
# Insight 3: Most Popular City
df['City'] = df['Purchase Address'].apply(lambda x: x.split(',')[1])
most_popular_city = df.groupby('City')['Order ID'].count().idxmax()
print(f"Most Popular City: {most_popular_city}")
# Visualization
plt.figure(figsize=(10, 6))
sns.countplot(x='City', data=df, order=df['City'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Most Popular City')
plt.show()
Our data has confessed enough!
And there you have it! With gspread and pandas, you can access and analyze your Google Sheets data like a boss. Who knows, you might even discover that the best-selling product in your store is a set of headphones that only play baby shark music. But don’t blame us if that becomes a bestseller! Now go forth and conquer your data, one spreadsheet at a time!
I hope you enjoyed this read. Thanks for your time.
You can find the full code on my GitHub.
Connect with me via Linkedin and Twitter.