Guest post by Khalid Azam, Director Of Data Services, Renovo
The newfound combination of electric vehicles hitting the road and break-though self-driving technology is unleashing a revolution. The consensus among auto industry experts is that car ownership will slowly be replaced by autonomous vehicles (AVs) and AV ride hailing services as the primary means of urban transportation, as the latter will provide a much cheaper cost per mile.
The data produced by AVs is like nothing we’ve ever seen, with each sensor-laden AV generating a thousand times more data per day than Twitter does worldwide in the same amount of time. These vehicles are also outfitted with heavy duty compute infrastructure that run dozens of domain-specific applications used for processing data from sensors to enable safe and reliable autonomous driving.
Renovo has built the AWare™ software platform for AVs that has abstracted, among other things, the underlying mechanical and motion control infrastructure of an AV with well-defined APIs. This enables third party applications to run as containers on the AWare™ platform, subscribe to sensor data, process that data, and use AWare™ Motion Control APIs to self drive the car. The third party applications do not have to worry about the make or model of the underlying car.
A Renovo AWare™-equipped fleet can include thousands of vehicles. Each AWare™-equipped vehicle runs dozens of applications and services. Logs produced by applications and services continue to remain one of the main sources of data for troubleshooting. They are critical in identifying and preempting issues with complex software. Operators will often look to logs to identify bugs, monitor service-level-indicators, and triage issues. Often, an issue may surface infrequently across different vehicles in a fleet. Therefore, having a holistic, single pane view of logs across the vehicle fleet is essential.
On the Vehicle
We deploy Elasticsearch in the vehicle and leverage the computing capability of the vehicle to perform initial processing and indexing of log documents as well as continuous queries to identify problems. An on-board Elasticsearch-based dashboard allows safety drivers and other technicians to detect and debug issues in the car in real time.
In comparison to the amount of data generated, autonomous test vehicles are usually outfitted with a relatively low bandwidth 4G cellular connection with an unreliable uplink to the cloud. Since network bandwidth is our most precious resource for each vehicle, it is not feasible to stream raw log documents off vehicle to the cloud due to volume and possible drops. Therefore, log indices are shipped as compressed snapshots to an Amazon Web Services-backed repository for archiving and non-real-time triaging of issues. These logs are enhanced with the source vehicle ID information and integrated into the master aggregate Elasticsearch index that allows us to run search queries across a fleet of vehicles, identify problems and alert them to appropriate stakeholders.
The security features of Elasticsearch allow us to deploy a secure-by-default indexer cluster to each vehicle without software licensing costs. Query and transport channels are encrypted. Authentication is required and it authorizes users to platform or customer log streams.
The AWS Pipeline
AWS allows us to archive, restore, and reprocess logs as soon as they are uploaded no matter the volume. We leverage Amazon Elasticsearch Service to mitigate the operational complexity of a large, secure log warehouse.
An AWS S3 bucket is configured for server-side-encryption and new object creation events are generated whenever snapshot index files are uploaded to this bucket. The new object create events are sent to the initial processing SQS queue. New messages are not visible for 15 minutes to allow snapshot actions to complete.
An index file processor AWS Lambda function is triggered by new messages to the initial processing queue. This function is responsible for parsing the snapshot index, check whether a snapshot has been restored yet, and write snapshot restore tasks to the restore SQS queue. In order to communicate with the data warehouse Elasticsearch cluster, the function gets and de-envelopes a TLS certificate and key from a secrets S3 bucket.
Snapshots are restored to the log warehouse by a Lambda function that is triggered by new messages in the restore queue. Indexes contained in the snapshot are restored to the cluster with vehicle-specific names. This worker enqueues the restore task with the cluster, writes snapshot name and restore function task id to pipeline state, and writes a message to a reindex SQS queue.
Reindexing is necessary to reduce cluster overhead. Vehicle-specific indices are reindexed into the global dated index and then deleted. Reindex Lambda functions are triggered by new messages to the reindex queue. This function polls for snapshot restore task state, enqueues reindex task with cluster, updates snapshot restore and reindex task ids in pipeline state, and then writes a message to the deletion SQS queue. You can guess what happens from here. An index deletion function is triggered by new messages to the deletion queue. For each reindex task, the worker polls for that task state and deletes the source index when complete.
Querying the Warehouse
Log documents are annotated with fleet and vehicle metadata when they are initially processed on-vehicle. Usually this information is used to distinguish deployments in a common index. In addition, the metadata is used in conjunction with the user’s identity to provide document-level security for log warehouse searches. Each user identity contains permission to fleet and/or vehicle log streams.
The provided Kibana deployment is the primary search interface for the log warehouse. It enables our developers and users to create dashboards and alerts to better monitor their services and the operation of one or many vehicles.
Amazon Web Services has enabled us to keep up with logs generated by vehicles and support systems in a manner that does not require us to over-provision processing and focus on the problem space: operating a vehicle or fleet of vehicles.
We strive to continuously improve our solution. In the future, we would like to transition to Step Functions for workflow management and add alerting for unrestorable snapshots.