
You can see the entire code here.
Comparing Station Pairs
How can we identify which stations have similar ridership patterns? My goal was to compare over 1000 station pairs, check their patterns over time and see which have similar enough patterns that predicting the outcome of one can reasonably predict the outcome of another. I do this by using a time series cluster analysis over the course of seven years. This analysis was accomplished using Python's Pandas, NumPy, and Scikit-learn.
Prepping the Data

I originally had a dataset with cluster inputs for every representative day over seven years. I needed to create an average cluster input for all seven years grouped by station pair, season, and day of week, so that there were just 42 cluster inputs for each station pair. Here are some code snippets of me preparing the data:



Once I found the cluster inputs, I created a pivot-table to input into the machine-learning k-means cluster analysis.

PERFORMING THE ANALYSIS
I then performed several weighted K-Means Cluster Analyses using Python's Scikit-learn, starting from 4 clusters and up to 20 clusters.

We were looking for the lowest inertia with the least amount of clusters. As you can see from the graph below, the slope of the inertias begins to level out at around 18 clusters, which is the number of clusters we chose.

RESULTS
After conducting the analysis, I created visualizations of each cluster to show their centroids and how the station pairs relate to one another using MatPlotLib. Here's an example of the code for one of the clusters:

We can see clearly that the clusters identify a pattern between stations over time and can use these clusters to create forecasts for an entire cluster, rather than every individual station! Here are some of the resulting clusters, the centroids are highlighted in black:


