IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two.
Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s.
Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets.
Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because:
While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT.
Dataset Name | Domain | Provider | Notes | Address/Link |
CGIAR dataset | Agriculture, Climate | CCAFS | High-resolution climate datasets for a variety of fields including agricultural |
http://www.ccafs-climate.org/ |
Educational Process Mining |
Education | University of Genova |
Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator |
http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set |
Commercial Building Energy Dataset |
Energy, Smart Building |
IIITD | Energy related data set from a commercial building where data is sampled more than once a minute. |
http://combed.github.io/ |
Individual household electric power consumption |
Energy, Smart home |
EDF R&D, Clamart, France |
One-minute sampling rate over a period of almost 4 years |
http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption |
AMPds dataset |
Energy, Smart home |
S. Makonin | AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring |
http://ampds.org/ |
UK Domestic Appliance-Level |
Electricity Energy, Smart Home |
Kelly and Knottenbelt |
Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. |
http://www.doc.ic.ac.uk/∼dk3810/data/ |
PhysioBank databases |
Healthcare | PhysioNet | Archive of over 80 physiological datasets. |
https://physionet.org/physiobank/database/ |
Saarbruecken Voice Database |
Healthcare | Universitat¨ des Saarlandes |
A collection of voice recordings from more than 2000 persons for pathological voice detection. |
http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4 |
T-LESS
|
Industry | CMP at Czech Technical University |
An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects |
http://cmp.felk.cvut.cz/t-less/ |
CityPulse Dataset Collection |
Smart City | CityPulse EU FP7 project |
Road Traffic Data, Pollution Data, Weather, Parking |
http://iot.ee.surrey.ac.uk:8080/datasets.html |
Open Data Institute - node Trento |
Smart City | Telecom Italia |
Weather, Air quality, Electricity, Telecommunication |
http://theodi.fbk.eu/openbigdata/ |
Malaga datasets | Smart City | City of Malaga |
A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. |
http://datosabiertos.malaga.eu/dataset |
Gas sensors for home activity monitoring |
Smart home | Univ. of California San Diego |
Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. |
http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring |
CASAS datasets for activities of daily living |
Smart home | Washington State University |
Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. | http://ailab.wsu.edu/casas/datasets.html |
ARAS Human Activity Dataset |
Smart home | Bogazici University |
Human activity recognition datasets collected from two real houses with multiple residents during two months. |
https://www.cmpe.boun.edu.tr/aras/ |
MERLSense Data | Smart home, building |
Mitsubishi Electric Research Labs |
Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. |
http://www.merl.com/wmd |
SportVU
|
Sport | Stats LLC
|
Video of basketball and soccer games captured from 6 cameras. |
http://go.stats.com/sportvu |
RealDisp | Sport | O. Banos
|
Includes a wide range of physical activities (warm up, cool down and fitness exercises). |
http://orestibanos.com/datasets.htm |
Taxi Service Trajectory |
Transportation | Prediction Challenge, ECML PKDD 2015 |
Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. |
http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html |
GeoLife GPS Trajectories |
Transportation | Microsoft | A GPS trajectory by a sequence of time-stamped points |
https://www.microsoft.com/en-us/download/details.aspx?id=52367 |
T-Drive trajectory data |
Transportation | Microsoft | Contains a one-week trajectories of 10,357 taxis |
https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ |
Chicago Bus Traces data |
Transportation | M. Doering
|
Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. |
http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/
|
Uber trip data |
Transportation | FiveThirtyEight | About 20 million Uber pickups in New York City during 12 months. |
https://github.com/fivethirtyeight/uber-tlc-foil-response |
Traffic Sign Recognition |
Transportation | K. Lim
|
Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. |
https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 |
DDD17
|
Transportation | J. Binas | End-To-End DAVIS Driving Dataset. |
http://sensors.ini.uzh.ch/databases.html |