Google Cloud Data Pipeline

A Simple Data Architecture Pipeline for Your Sales Data Using GCP

Posted by Maya Sandler on May 12, 2025

Intent

As a Data Engineer consultant I am constantly on a project, learning the client's use cases, needs and challenges, finding the best solution that will help them get to the next level, building the 'data product' that is best suitable for them and educating them so they will be able to do this on their own. I thrive on rapidly understanding client needs and challenges to architect and implement optimal data solutions. I love making teams and businesses grow and this work is very fulfilling when I see the outcomes of the work I did!

While my project-based work often involves contributing to specific stages of data pipelines, but never creating a full stack pipeline from the raw data files to visualization. Recognizing that my consulting engagements typically involve focused contributions within larger data pipelines, I initiated this personal project to develop comprehensive, full-stack expertise.

This project addresses a common e-commerce requirement:

Constructing an end-to-end pipeline to ingest raw sales data from multiple stores and dates
Perform analysis in a view
Generate visualizations for operational insights and KPI tracking by the CEO and relevant teams.

As the core objective of this project was to build a complete data pipeline from start to finish, the emphasis was placed on the cloud data engineering aspects rather than in-depth analytics and visualization storytelling. However, the resulting architecture offers a solid and efficient framework for future expansion into a secure, reliable, and cost-effective cloud environment. I enjoyed working on this project and I assure you it doesn't take a long time to build!

E-Commerce data was download from Kaggle, split into multiple files via a Python script and loaded into Google Cloud Storage. I used Google Cloud Run Function written in Python to create a raw data table in Google BigQuery and insert new records into it, using a trigger of finalizing the upload of a new file into the Google Cloud Storage bucket. A curated view was created in Google BigQuery containing the cleaned and transformed data, and automatically pulled and visualized in Google Looker.

Description of the Data

The dataset (.csv) was downloaded from Kaggle and contained transactional data from a UK-based, non-store online retailer, spanning the period between December 1st, 2010, and December 9th, 2011. This dataset captures sales occurring globally, offering a comprehensive view of the retailer's online transactions during that timeframe. The dataset contains 541,909 records.

The database contains the following data:

InvoiceNo - a string data type to accommodate both standard invoice numbers and cancellations. Cancellation records are uniquely identified by a "C" prefix appended to the original invoice number, allowing for clear distinction and tracking of order modifications.
StockCode - a string data type detailing the item code in the store.
Description - a string data type detailing the name of each item sold in the invoice.
Quantity - integer data type detailing the quantity for a single item sold in the invoice.
InvoiceDate - datetime data type in the format "%Y-%m-%d %H:%M:%S UTC".
UnitPrice - a number data type for the price of a single item sold in the invoice. amounts are in USD.
CustomerID - Integer data type unique per customer.
Country - a string data type detailing the country of the customer making the online purchase.

Preprocessing

Initial Processing of the Raw Data

The original file was split to multiple files to simulate real life stores uploading data in multiple dates over time. I know this data belongs to an on-line, but this was a good practice even just to experience the automatic aggregation of raw data coming from multiple stores. This was done via Python code that split the data into 2000 records in each batch:

I then changed the amount of the first two csv files to hold 9 and 37 records (by manually moving the rest of the records to batch3 file), as I was going to use them for testing of the Google Cloud Run Function and I wanted to save time and cost.

Creating objects in Google Cloud

For this project I needed to create several objects in Google Cloud:

A new project.
A bucket in Cloud Storage to place the multiple files coming from the multiple stores in different times.
A data set in BigQuery to hold the raw data coming from the bucket.
A Cloud Run Function triggered by the finalized uploading of a new file in Cloud Storage.
A reporting view in BigQuery

Now let's discuss each step.

Creating a New Google Project

In my Google Cloud account, I make sure to create a project for each purpose I do. It might seem simpler to just pile everything I do into one Google Cloud project, but creating separate projects for different endeavors offers some significant advantages:

Isolation & Organization: Each project acts as a logical container for my resources, which makes it much easier to organize, manage, and understand what belongs to which application and environment.
Security Boundaries: Google Cloud Identity and Access Management (IAM) policies are defined at the project level. By having separate projects, you can grant different levels of access to different teams or individuals for specific projects.
Billing & Lifecycle Management: Each Google Cloud project has its own billing account. This makes it incredibly straightforward to track the costs associated with each project, and once you are done with a project - shut it down to reduce payment.
Resource Limits: Google Cloud imposes quotas on various resources per project. Therefore, separating your projects helps prevent one resource-intensive application from potentially impacting other unrelated workloads.

Creating a New Bucket in Google Cloud Storage

I am creating a new bucket called 'sales_example_11' that will hold all of the multiple data files coming from the different stores. It mush be isolated from other buckets as I intent all the files in this bucket to hold sales data only. This way I will be able to create an automated function later to identify when a new file was loaded into this bucket and load the records from this file into the data table in BigQuery.

As I am working for my own project, I am trying to reduce cost and don't really mind of creating backups for my storage. Therefore, the parameters for this bucket are:

Regional - as I am in the San Francisco area and I want to reduce latency, I will choose us-west1 region.
Standard - as this bucket is needed for both writing and reading. An optimization that could be added here is that after x days that will allow enough time for processing, files will move to an archive as they are no longer needed in this costly tier.

Creating a New Data Set in Google BigQuery

As I am simulating the sales data coming from multiple stores it should have its own data set dedicated to to the company's sales. To create this data set called 'sales', I am using these parameters:

Regional - again, to reduce latency I will choose us-west1 region. I am trying to reduce cost and therefore, I am not selecting the multi-region option.
Encryption - I am choosing the simple and easy option of Google-managed encryption key
Time travel of 7 days (standard) to avoid any deletions or unwanted changes to the dataset.

As you can see, when I check to see that the data set had been created, I can see the name of the dataset. I don't need to create a table as the Cloud Function I am going to write next will create the table for me.

Creating a New Google Cloud Run Function

A cloud Run Function is an easy, serverless, event-driven computing service, as well as being cost effective as you pay only when it is executed. As the batch files I created are small it will be able to be executed very quickly and is therefore, suitable for a function. This is similar to AWS Lambda Function.

This Cloud Run Function should be able to to multiple things:

Identify the finalization of uploading of a file into the 'sales_example_11' Google Cloud Storage bucket as a trigger for this function to be executed. Only a .csv file should trigger this operation, as the script is written specifically for a csv file.
Read the data in the .csv file, while ignoring the headers of the table in the file.
Automatically identify the metadata of the table in the file.
If this is the first time reading a file and looking for a BigQuery table 'orders' and it is not there - create this table using the metadata it detects from he table in the file.
Update and insert the new records in the sales file into the 'orders' table in BigQuery.

Therefore, first I configure the function itself:

Function - the Cloud Run can be used for multiple purposes and we need to specify this is a function.
Regional - again, to reduce latency I will choose us-west1 region.
Runtime - Python 3.10.
Trigger & IAMs - 'sales_example_11' Google Cloud Storage bucket.
1. Allow access of the trigger to this bucket when file are finalized (meaning finished uploading).
2. It prompts the necessary IAMs needed for the function to run smoothly.
3. connecting to Cloud Pub/Sub and Load Balancer to allow a queue of files to be loaded, receive events and to publish tasks.
Enable Identify Aware Proxy (IAP) API - IAP acts as a central authorization layer and lets you manage access to your web applications and APIs running on Google Cloud while keeping it private and not on the internet (read more here).
Script will be written in the editor - usually for code versioning it is better to use GitHub, but I wanted a quick project and less fuss. Perhaps I will change it in the future if I want to do anything further with this project.

When created, it automatically produces a basic function that you can test the trigger function and print some values. This is the script it creates:

This automated Cloud Function is designed to:

Automatically triggered whenever a change occurs in a specific Google Cloud Storage bucket, and the specific type of the event (google.cloud.storage.object.v1.finalized)
Receive detailed information about the event as a cloud_event object.
Extract key details from the event data, such as the bucket name, file name, and timestamps.
Print these details to the Cloud Functions logs, which allow to see what event triggered the function and the specifics of that event.

A breakdown of this simple Python script:

functions_framework library. This library is essential for writing Google Cloud Functions in Python and allows the Python code to be correctly invoked by the Cloud Functions runtime.
cloud_event decorator tells the Cloud Functions runtime that the hello_gcs function should be treated as a Cloud Event handler, meaning it will receive event data in the format of a Cloud Event.
1. ID - every Cloud Event has a unique ID
2. Type = the type of the event (google.cloud.storage.object.v1.finalized when a new object is created or an existing one is overwritten)
Function hello_gcs :
1. This function receives one argument (cloud_event ) that contains all the information about the event that triggered the function (in this case, the GCS bucket change).
2. cloud_event.data is a dictionary containing the specific details about the GCS event that occurred:
  1. data["bucket"] - name of the bucket.
  2. data["name"] - file path within the bucket that was affected by the event.
  3. data["metageneration"] - The metageneration number of the object. This increases every time the object's metadata is updated.
  4. data["timeCreated"] - The timestamp when the object was created.
  5. data["updated"] - The timestamp when the object was last updated.
Print values from the bucket and file.

Testing the automatic Google Cloud Run Function

Google Cloud Run Function has a log that we can use for our understanding of the success/failure as well as see specific error messages (extremely useful for debugging!) of the different stages of the function. I loaded a demi file .xlsx to the 'sales_example_11' bucket and looked at the log to see that it triggered the Google Cloud Run Function and received the print I was expecting to receive. It looked good. I also checked to see what happens if I upload this file to a different bucket - and nothing happened - as it should! Great :)

Changing the script of the Google Cloud Run Function

After this test, I changed the script to be able to do what I intended it to do:

A breakdown of the script changes I made:

Cloud Function Trigger:
- The @functions_framework.cloud_event decorator specifies that this function is triggered by events from Cloud Storage. Specifically, it's triggered when a new object (file) is created or an existing object is modified (finalized) in a Cloud Storage bucket.
Function hello_gcs(cloud_event):
- This is the entry point for the Cloud Function.
- It takes a cloud_event argument, which contains information about the event that triggered the function.
- It extracts relevant information from the cloud_event , such as the bucket name, file name, and event timestamps.
- It then calls the append_to_bigquery_on_gcs_upload function to handle the loading of the data into BigQuery.
Function append_to_bigquery_on_gcs_upload(file_name):
- This function takes the name of the file that was uploaded to Cloud Storage.
- File Filtering - it first checks if the file name ends with ".csv". If not, it logs a message and exits, processing only CSV files.
- It creates a bigquery.Client object. This client is used to interact with the BigQuery service.
- It gets a reference to the BigQuery table where the data will be loaded using client.dataset(dataset_name).table(table_name).
- It creates a bigquery.LoadJobConfig object to configure how the data should be loaded into BigQuery:
  - source_format=bigquery.SourceFormat.CSV specifies that the data is in CSV format.
  - skip_leading_rows=1 skips the first row (header row) of the CSV file.
  - autodetect=True automatically detects the schema (data types of the columns) from the CSV file.
  - write_disposition="WRITE_APPEND" appends the data to the existing BigQuery table.
  - It constructs the URI (Uniform Resource Identifier) of the CSV file in Cloud Storage.
- It starts a BigQuery load job using client.load_table_from_uri(). This function takes the URI of the data, a reference to the destination table, and the load job configuration.
- load_job.result() meaning that the script waits for the load job to complete (instead of giving it a fixed timeframe). The function will pause here until the data is loaded into BigQuery.
- Error Handling: It includes a try...except block to handle potential errors during the BigQuery load process. If an error occurs, it prints an error message.
- Success Message: If the load job is successful, it gets the number of rows loaded and prints a success message.

In addition, the requirements.txt file in the same directory of the Cloud Run Function needed to be changed, as it only contained 'functions-framework' and needed to include BigQuery as well:

Testing My Google Cloud Run Function

After saving and redeploying my code, I tested it to see that everything runs well.

Added a demi .csv file to a different bucket and looked at the logs of the Cloud Run Function to see if it is triggered. It did not, which is good as we need it to be executed only when .csv files are loaded to the a specific bucket 'sales_example_11'.
Added a demi .xlsx file to 'sales_example_11' bucket and looked at the logs of the Cloud Run Function to see if it is triggered. It did not, which is good as it should be triggered only following .csv file uploads.
Added a real batch1_sales.csv file to 'sales_example_11' bucket (I kept only 9 records in this file and moved the other records to batch2 file, as this was a test), and it triggered the Cloud Run Function, created a table 'orders' in dataset 'sales' in BigQuery and added the first 9 rows to the table, as could be seeing from the logs:

4. Checked in BigQuery to see if the new table 'orders' was actually created under
the dataset 'sales' as well as the schema auto-detection:

5. Previewed the table to check that all the records are in the table with the
correct values:

6. A second upload of data file (37 records) to the 'sales_example_11' bucket led again to triggering the Cloud Run Function and to logs that indicated the insert of all the records from the file into the 'orders' table. I then queried the 'orders' table in BigQuery to see that it actually inserted all the records:

IT WORKED!

Creating a Reporting View in BigQuery

I created a simple view with a few transformation, that will be later on be pulled into Looker visualization app. As this is not the focus of this project I will share the logic of the view but will not discuss this further:

Creating a view in BigQuery is longer than other SQL based application, as instead of writing one line to create or update a view there are multiple mouse clicks involved, which I like less but it doesn't take too long to do. You can see the view created on the left hand side, under the dataset (screenshot not included).

Note: When filling up the properties of the view you need to specify the project it belongs to, so working on the same view in multiple projects requires to authorized the view to enable cross-project data sharing, which is a good security measure.

After executing the view:

Creating a Looker Report

Now that I created a view to transform all the data in 'orders' table, I can pull it into a visualization tool. Looker is free (it also has a much better paid version, but the free Looker Studio will work great for my needs) and I am all about cost effectiveness in this project.

Create a New Report and Select a Data Source

The good thing about using Google Cloud and Looker is that it prompts you the IAMs and other authorizations it needs to do an action. working with AWS, usually you have to know what role and authorization's you need for a project. This help is appreciated! When selecting BigQuery it asks you to authorize access for Looker to BigQuery and is very quick about the connection. I then selected the project, data set and table/view

Create visualizations

I then went to creating the visualizations themselves, KPIs, slicers for the user to use, and calculated fields to be used in this report. As this is not the focus of this project, not much work got into this part. Yes, I know it's ugly - but it is the pipeline that counts ;)

Viz properties:

I added some features for this report to make it more user friendly.

The report will be updated automatically every 30 min or by manual refreshes. This might be too much for other uses and can be changes in the properties.
The report can also be sent as an email for example to the operations team or to the CEO once every morning (see more instructions here)

Conclusions

This personal project served as an invaluable opportunity to cultivate a holistic understanding of the end-to-end data engineering pipeline, a perspective often fragmented within the scope of individual client engagements.

In this project I architected and implemented a common e-commerce requirement: constructing a complete pipeline to ingest raw sales data from multiple stores and dates, performing analysis in a view, and generating visualizations for operational insights and KPI tracking by the CEO and relevant teams.

I connected multiple tools and abilities:

I created a folder in Google Cloud Storage for store owners to upload their sales data.
- To simulate this, I split E-commerce data was downloaded from Kaggle into multiple files via a Python script.
I used a Google Cloud Run Function written in Python to create a raw data table in Google BigQuery and insert new records into it, triggered by the finalization of a new file upload to the Google Cloud Storage bucket.
Subsequently, I created a curated view in Google BigQuery, containing the cleaned and transformed data
The curated view is then automatically pulled, and refreshed every 30 minutes, and visualized in Google Looker.

By creating this full-stack pipeline, I've significantly increased my confidence in orchestrating comprehensive data solutions. This hands-on experience has not only solidified my existing cloud data engineering skills in areas like serverless functions, data warehousing with BigQuery, and data visualization, and it has also provided crucial insights into the seamless integration of these components.

Moving forward, this practical knowledge will undoubtedly enhance my ability to rapidly assess client needs, design robust architectures, and ultimately deliver even more impactful and self-sustaining data products, further fueling my passion for empowering teams and driving business growth through the strategic application of data.