Daniel Johnson: an exercise in SQL

Thank you for taking time to explore my project! I have created a minimal Linux/MariaDB simulator at the end of this brief demonstration where you can input basic commands to manipulate the data found in the tables included here.

Examples of SQL capabilities:

Using Google Cloud’s BigQuery, various datasets were used to perform various cleanup/transformations

Case 1:
Description:
Compared to New York Citibike data, San Francisco’s data was segmented into three separate tables. I used JOIN to combine the San Francisco bike share datasets to create a clean, analysis-ready table comparable to the New York Citibike dataset.

  • Data Cleaning: Removed incomplete or invalid trip records (null values in trip duration, start/end times, or station IDs).
  • Schema Standardization: Renamed columns and cast types to match the New York dataset, enabling cross-city comparisons.
  • Data Enrichment: Joined trip records with station metadata to add station names, latitude/longitude, and docking capacity for both start and end stations.
  • QA and Validation: Ensured all trips were linked to valid stations and checked for missing or inconsistent data.
  • Output: Created a single, optimized table in BigQuery containing cleaned and enriched trip records, ready for analytics, visualization, or further data exploration.

CREATE OR REPLACE TABLE `sql-portfolio-475719.citibike.san_francisco_cleaned` AS
SELECT
  t.duration_sec AS tripduration,
  t.start_date AS starttime,
  t.end_date AS stoptime,
  
  t.start_station_id,
  s_start.name AS start_station_name,
  s_start.latitude AS start_station_latitude,
  s_start.longitude AS start_station_longitude,
  
  t.end_station_id,
  s_end.name AS end_station_name,
  s_end.latitude AS end_station_latitude,
  s_end.longitude AS end_station_longitude,
  
  t.bike_number AS bikeid,
  t.subscriber_type AS usertype,
  NULL AS birth_year,   -- SF dataset does not have birth year
  NULL AS gender,       -- SF dataset does not have gender
  NULL AS customer_plan -- SF dataset does not have customer plan

FROM
  `bigquery-public-data.san_francisco.bikeshare_trips` AS t

LEFT JOIN
  `bigquery-public-data.san_francisco.bikeshare_stations` AS s_start
ON
  t.start_station_id = s_start.station_id

LEFT JOIN
  `bigquery-public-data.san_francisco.bikeshare_stations` AS s_end
ON
  t.end_station_id = s_end.station_id

WHERE
  t.duration_sec IS NOT NULL
  AND t.start_date IS NOT NULL
  AND t.end_date IS NOT NULL
  AND t.start_station_id IS NOT NULL
  AND t.end_station_id IS NOT NULL
  AND t.subscriber_type IS NOT NULL;

The resulting table:

san_francisco_trips_filtered

From an analytical perspective, I was curious to see how often bikes were returned to the same station versus a different station, which can be easily determined by utilizing CASE WHEN, COUNT, and GROUP BY functions:

CREATE OR REPLACE TABLE `sql-portfolio-475719.citibike.san_francisco_trips_summary` AS
SELECT
  CASE 
    WHEN start_station_id = end_station_id THEN 'Same Station'
    ELSE 'Different Station'
  END AS trip_type,
  COUNT(*) AS trip_count
FROM
  `sql-portfolio-475719.citibike.san_francisco_cleaned`
WHERE
  start_station_id IS NOT NULL
  AND end_station_id IS NOT NULL
GROUP BY
  trip_type;

resulting in the following table:

bikeshare_nums

I then filtered the data to only display the trips that started and ended at the same station simply by using WHERE and =. I also filtered out some redundant and irrelevant data through the SELECT function and used CONCAT to consolidate the start and end station columns:

CREATE OR REPLACE TABLE `sql-portfolio-475719.citibike.san_francisco_trips_filtered` AS
SELECT
 
  CONCAT(CAST(start_station_id AS STRING), ', ', CAST(end_station_id AS STRING)) AS start_and_end,
  bikeid,
  usertype
FROM
  `sql-portfolio-475719.citibike.san_francisco_cleaned`
WHERE
  start_station_id IS NOT NULL
  AND end_station_id IS NOT NULL
  AND start_station_id = end_station_id;

The resulting table:

san_francisco_trips

What if you wanted to use MariaDB instead?

In this project, I used SELECT to choose only the necessary columns, focusing on start and end station IDs, bike IDs, and user types, while in MariaDB the syntax would remain largely the same but fully-qualified BigQuery table references would be replaced with standard database.table notation.

I applied WHERE filters to exclude NULL values and to separate trips that started and ended at the same station versus different stations; MariaDB handles IS NOT NULL identically, though indexing may be needed for performance on large tables.

I combined numeric columns into a single text field using CONCAT and converted numbers to strings with CAST, noting that in MariaDB numbers are cast to CHAR or VARCHAR instead of BigQuery’s STRING.

I categorized trips using CASE WHEN … THEN … ELSE … END, which is compatible in MariaDB with the same logic, though BigQuery allows more advanced nested operations.

To generate summary metrics, I used COUNT(*) with GROUP BY, and while MariaDB supports this syntax, large datasets may require indexing or partitioning to optimize aggregation.

The transformed data was saved using CREATE OR REPLACE TABLE, whereas in MariaDB this would require DROP TABLE IF EXISTS followed by CREATE TABLE because OR REPLACE is not natively supported.

Finally, column names were aliased with AS, which functions identically in both systems. Overall, while the core SQL logic remains consistent, MariaDB requires attention to data type casting and table replacement mechanics, and lacks BigQuery’s fully-qualified dataset references and automatic distributed scaling.

Linux/MariaDB simulator

*Please note, this is a javascript-based simulator and is not yet fully comprehensive as I am still learning how to translate SQL functionality to JS. Some functions may not work properly.

$