Simulation of a real-time feed using historical data
Before you run this example, download some Yellow Cab trip data CSV files from the aforementioned website at nyc.gov. At the time of writing, this example is compatible with the data format used in the CSV files between 2015-01 and 2016-06. Let's say you have chosen2016-01 and saved the data as yellow_tripdata_2016-01.csv
.
We want to simulate a real-time feed. However, because the trip data source is wildly unordered, we want to sort the data with some random deviation. A real-time feed usually contains some out-of-order data, but not to the extent of the original trip data files.
So, let's sort the data by timestamp:
bash> sort -t, -k2 yellow_tripdata_2016-01.csv > yellow_tripdata_sorted_2016-01.csv
Next, add some random deviation to the sorted data:
bash> cat yellow_tripdata_sorted_2016-01.csv | perl -e '@lines = (); while (<>) { if (@lines && rand(10) < 1) { print shift @lines; } if (rand(20) < 1) { push @lines...