A Graph Neural Network (GNN) relates the 14 parameters such as time, location, and weather details such as humidity, rainfall, dew, snowfall, etc., to predict how accident prone a given road is. A dataset relating vehicles travelling on given roads, and the various alerts that were raised by the vehicle, along with the geocoordinates, time, and weather was data engineered to define the safety index of a road, which was then used as labels for the GNN.
How did this project come to be?
This is one of the most novel projects I have worked on. This was a part of the Intel AI Tech Challenge at PES University that I took part in with 2 other friends. The challenge was unique, and lasted 6 months. The topic we picked was to with data engineering, analyses, and ML.
We were provided a dataset containing Kerala inter-state busses and routes, along with geo-coordinates, timestamps, and also alerts. The alerts were of various types, and were obtained by certain devices being placed in all the busses- there was an alert in case a pedestrian crossed the road in front of a bus, an alert for monitoring headway, etc. The task was to use this dataset, and come up with some unique/useful application!
Initial brainstorming using the given dataset lasted around a month, and we tried to do something related to fuel efficiency concerning various routes, but that was infeasible as we couldn't get accurate data for this. We gradually moved outside of the dataset having realized there was not much that could be done with the existing features (or maybe there is and it didn't strike us!). We used the latitude and longitude coordinates to figure out the altitudes variance in each of the routes using Google Earth. This data gave a little more insight to the data distribution, given that Kerala has lots of uneven terrains and entails a vast range of altitudes. We then wanted to extend this analysis to see what was the difference in roads (beyond the road specific properties, like potholes, road width etc.,) based on these geocoordinates (including altitude). We figured that getting an insight to road conditions, such as amount of rainfall, visibility due to the rainfall, etc., would prove useful, and thus started on quest to extend the dataset beyond the feature spaces provided. We used Openstreetmaps API to get the weather data based on the geocoordinates, to further extend our dataset by 24 weather parameters,- ranging from maximum recorded temperature on a given day, to rainfall, to humidity etc.
All Data Science is to build the confidence that if you stare into a potentially useless dataset for long enough, you can make something useful out of it
We then found a direct use case, which could be made using this new dataset- to predict how safe a road is based on the current weather data. The idea was that it would be like Google Maps- where instead of suggesting alternate routes based on traffic, we could suggest alternate SAFER routes given the current weather conditions!
What next?
Once we had the dataset, and a rough plan of what we wanted to do- we deep dove into the exploratory data analysis. We constructed graphs based on routes and busses, and compared the same, to manually see trends in alerts and which were the more notorious roads. We also normalized them based on the road length and the number of alerts, to get a more objective sense into roads and relative safety. At this point we realized it was important to define something called a safety index because there are different alert types, and we needed a single parameter relating all these alerts. We define Safety Index based on the number of alerts in a 1km radius on each of the roads we had in our dataset, and set it to values between 1-5. 1 being the safest and 5 being the most accident prone road.
This would eventually serve to be our labels for any ML approach we take on in the future. We also realized that 24 weather parameters is probably an overkill, and decided to pick out only the ones that matters. We also performed a PCA to validate this, and zeroed in on totally 14 parameters including time, weather, and location (TWL parameters). Completing a detailed analyses on the dataset, and subsequently formulating the problem statement formally, took nearly 3 months.
Lets talk ML
We started off the learning process by trying out various neural network approaches. We also gave traditional ML a chance when we tried out logistic regression - where the safety index label was a binary value rather than indices from 1-5. The binary values indicated either the road to be accident prone or not. Feed forward neural networks of varying complexity, and also LSTMs (time ordered based on events) were tried out on this dataset. However, none of these approaches could yield good results in terms of accuracy and loss, in predicting if a road was accident prone or not (and how much), based on the 14 parameters.
It turns out that these models are relatively simplistic to employ and figure out the relations between all these different feature spaces.
Graph Neural Networks
Here is when we came across GNNs, and immersed ourselves in studying the architecture and possibilities, and familiarizing ourselves to Pytorch. We realized that a GNN based approach would prove more fruitful as it is uniquely suited for this application due to its ability to capture the complexities of the inherent nonlinear interlinking in a vast feature space. We then employed a GNN to emulate the TWL feature vectors as individual nodes which were interlinked vis-à-vis edges of a graph. This model was verified to perform better than Logistic Regression, simple Feed- Forward Neural Networks, and even Long Short Term Memory (LSTM) Neural Networks!
Our Architecture
The GNN model we used follows the procedures of message passing, from layer to layer, while keeping the structure of the graph -in terms of input and output shape- similar. The unidirectional approach used
here, follows a batch processing schema, where each batch consists of 20 graphs. Each graph pertains to a particular bus, traveling on a particular road, on a particular day. The individual nodes of this graph are the individual alert events that occurred for that bus, on that road, on that day. Each event, in turn, comprises the 14-dimensional feature space, relating to the TWL parameters. The ordering of such events across the nodes is handled by the edges, which are time-series dependent and are thus directional in nature. The natural order being from the oldest event of the day, to the latest event of the day. Finally, a label- derived from the collision concentrations- is assigned to each graph, denoting the accident proneness on a scale of 1-5.
The architecture constructed is the first of its kind and is novel, as it employs a method of using a sequence of trainable graphs, each of which is attributed to a safety index. The identity- Session ID- is a string constructed by considering the device ID of the bus, the name of the road it traveled on, and that particular date (in standard format). Thus the Session IDs are unique to busses traveling on a given road, based on the dates.
An Event is then defined using a vector squashing function. The squashing function takes into account the individual vectors which were considered as inputs to a model, as defined by the Principal Component Analysis. A total of 14 vectors, placed (logically) one above the other are finally squashed to a single 14-dimensional vector. This is treated as the Event Vector's feature space. Each event is a 14-dimensional feature space, corresponding to the respective TWL parameters, for a given time instant.
Next, the Session IDs are then made to correspond to a time-series ordered set of Events (that happened on that given day). Directed edges based on the time series of events, are then used to fit the structure onto a graph.
Finally, a Label (binary or 1-5; based on the model) is attributed to each Session ID, which denotes the Accident Proneness or the Safety Index of the road respectively. This would serve analogous to the set of output labels present in the training data-set of any traditional Neural Network. This entire architecture, is used to train and model a GNN, to predict accident proneness and safety index of a road, based on the TWL feature space (14 in number).
Conclusion
The results achieved through this GNN architecture, using a TWL input feature space proved to be more feasible than the other predictive models, having reached a peak accuracy of 65%. The prediction of road accidents is associated with a high degree of uncertainty due to human error accounting for a major fraction of the causes. The accident proneness of a road caused by environmental and non-human dynamic factors has been predicted. The concentrations of collision warnings, after normalization and development into a "Safety Index", was used as an indication as to the vulnerability of a location to road accidents. The influence features associated with time, weather, and location was quantified based on the effect they had on the Safety Index and Accident Proneness. Multiple machine learning models were tested on the data to predict the Safety Index of a location-based on TWL characteristics. The Graph Neural Network approach achieved the best results with an accuracy of 65% on untrained data. The reason for not achieving an even higher accuracy, can be attributed to the randomness associated with human error in many accidents. Future models could also use a different algorithm on fitting the connections of this graph structure, to try yielding better results. One such approach could be by setting the connections and ordering of events (nodes) of the graph - instead of just following a time series order- to follow the 3D distance formula by including the latitude and longitude along with time, as parameters to decide the next node.
Thanks to Anish Reddy, Muvazima Mansoor, Suresh Jamadagni, and Asif Qamar for making this research possible- this project went on to win the "Most Innovative Project" Award at the Intel 2020 Hackathon, and has also been presented at ICMLA (Miami).
Feel free to connect if you have done some similar analyses on accident proneness but with different factors- because the best way to implement the Google-Maps variant for safe roads, would be if this approach is used in tandem with other ML solutions that relate the accident proneness to road specific conditions, and human error- that way ALL the major factors would be considered!
Comments