ChatGPT: Analyse 1Y of EV6 CAN Bus Data [Code Interpreter]

ChatGPT Code Interpreter Time Series CAN Bus Data

Need to analyse tons of CAN bus data, fast?

In this intro you'll learn how to use ChatGPT (GPT4) with Advanced Data Analysis (formerly Code Interpreter) to analyse 'big data' from the CAN bus of your vehicles/machinery.

In particular, we get practical via 12 awesome show cases of time series data analysis - and 10 tips for using ChatGPT to save 90%+ of time on your data processing.

By the end we expect you'll agree:

GPT4 with Advanced Data Analysis is a game changer.

Why use ChatGPT for your CAN data analysis?

Our end users face a challenge:

They manage to collect tons of CAN/LIN data using the CANedge - but now have to extract insights across several devices, thousands of log files and months/years of data.

Most of our users are engineers, but not data scientists. And even if they are, statistical data analysis is very time consuming.

But what if AI could help?

Specifically, we investigate the practical usefulness of ChatGPT (GPT4) with Code Interpreter. In short, this lets you upload data and ask it to analyse the data via Python.

12 showcases: Examples of ChatGPT analysis outputs

GPT4 Code Interpreter Big Data Time Series Analysis

Our CAN bus dataset

For this evaluation, we use CAN/UDS data from a Kia EV6 electric car. The data was recorded with a CANedge incl. GPS/IMU and includes information on the EV battery, temperatures, position, speed, acceleration and more (see the summary).

Warning: Before you use GPT4 on your own CAN bus data you should review your company's policy on this.

How to use Advanced Data Analysis

GPT4 (ChatGPT Plus) now natively lets you analyse data directly in the main window - no setup required.

How to prepare your CANedge CAN bus data for GPT4

The CANedge CAN bus data loggers record raw CAN/LIN data from vehicles/machinery in a binary file format (MF4).

The devices log data on an 8-32 GB SD card and support optional WiFi, 3G/4G and GPS/IMU.

To load this data in ChatGPT, we need to decode the CAN frames to physical values (degC, %, RPM, ...) and export the result to CSV. This requires a DBC file and suitable software/API tools.

With the CANedge software/APIs, you can follow one of the below methods:

1: asammdf GUI

Load MF4 file(s) in the asammdf GUI
Optionally concatenate the data
DBC decode the data
Export the decoded data to CSV

2: MF4 decoders

Download our MF4 decoders
Add your MF4 files and DBC file(s)
Drag & drop to produce CSV

3: Python API

Download and modify our Python script
Follow the README installation guide
Add your own MF4 files and DBC file(s)
Run the script to produce your CSV

For details on our data preparation and key considerations, see below.

Using asammdf or our MF4 decoders is the simplest way to get started with your own data. In our case, however, we must use the Python API because our data contains multi-frame UDS response data from the Kia EV6 (not currently supported by the asammdf GUI).

In theory it should be possible to allow GPT4 to perform the DBC decoding and data processing by providing it with the relevant Python packages in the form of *.whl files and some guidance. We did not test this particular step, however, as the purpose in this article is to focus on how competent GPT4 is when analysing a 'prepared' dataset.

Similarly, GPT4 can handle multiple CSV files rather than a single CSV file (as shown later). However, GPT4 achieves this by looping through the files and we expect it would increase the delay significantly between prompt responses, hindering the workflow. In addition, by performing the CSV concatenation prior to upload, we create a more consistent starting point for the analysis, while making it easier to double-check the results on our side.

When exporting DBC decoded CAN bus data to CSV, we typically distinguish between two types of data structures:

1: Signals in rows

Here, a 'Signal' column contains the name of each signal (e.g. Speed, RPM, ...). This lets you efficiently store signals with different timestamps and frequencies. This can be useful for CAN bus data where each CAN message generally has a unique timestamp/frequency. However, it also makes it cumbersome to e.g. calculate ratios between signals on-the-fly.

You can produce this CSV structure via our Python API.

2: Signals in columns

Here, each column represents a unique signal, all sharing the same timestamp (found in the index column). To represent signals across multiple CAN messages in this structure, one has to use resampling and/or extrapolation. For example, one could resample each signal to a shared 5-second frequency.

You can produce this structure via our Python API and the asammdf GUI.

Generally, the first method is superior from a technical perspective as you can minimize resampling/extrapolation, thus retaining the best data quality. For example, resampling a 100 Hz acceleration signal to a 5 second frequency results in this signal losing almost all meaning. Conversely, resampling an outdoor temperature to a 5 second frequency implies zero practical loss in quality. With the first data structure, you can address these considerations dynamically, tailoring your approach to each analysis. Further, for some use cases it is critical that data is analysed at the original frequency (e.g. in some automotive diagnostics use cases) and here this structure is recommended.

With that said, if you are going to analyse CAN data across e.g. 1 year, then you'll most likely need to do some level of resampling before exporting it to CSV (to avoid analysing e.g. 5-10 GB of data per device). For such a use case we generally recommend the second structure.

For our use case, we want to end up with a single CSV file (signals in columns) that has a consistent set of headers. This could be achieved by looping through all the log files and loading them into a single pandas dataframe, then saving this. Given that our data might exceed available RAM, this approach may not be scalable.

Instead, we append data to a CSV file that accumulates with each loop iteration. To do this, a list of signals needs to be defined up front which must be available in each dataframe. Without this, we cannot ensure that each value appended ends in the correct column of the CSV.

Basically, the signal list will be used to filter each dataframe (to ensure we do not get excess signals), as well as discard dataframes (if they do not include all required signals, e.g. because of a very short file, differences in configuration etc.).

In order to identify this list of signals, we take the following steps:

Identify a log file that should be 'representative' of the majority of the data
Extract and restructure the dataframe to have signals in columns
List the column names using list(df_phys) and subsequently sort the list
Use this list as the preliminary signals_required variable and filter all decoded logs
Next, loop through all the data and extract the list of column names for each file
If the column names of a file differs from the signals_required, print a warning and review the reason for the difference, then take action accordingly

Upon iterating, you might find that while 30% of the data aligns 95% with your 'signals_required', there are discrepancies, possibly due to 1-2 redundant signals. By storing the signals_extracted in a new list, you can quickly look for the unique elements between that and your signals_required via below line:

list(set(signals_required).symmetric_difference(set(signals_extracted)))

You then update your signals_required list to remove the redundant signals and repeat until you've arrived at a satisfactory signals_required list that also lets you work with e.g. 80-90% of your data. Of course, some use cases may require a different approach.

Before uploading the large CSV file to GPT4 it can be useful to do a quick review of the data.

For example, do some sanity checks by loading the file in Excel (if possible) - or create a smaller output file based on e.g. just 20 logs, then upload it to GPT4 for initial feedback.

For example, ChatGPT informed us that the signal VehicleSpeed had several missing values. This is a known problem with UDS responses from the Kia EV6 (and other Hyundai/Kia EVs) which is why we'd normally discard this parameter in favour of the Speed signal produced by the CANedge GPS module (which is consistently reliable). We therefore updated the output file to remove VehicleSpeed.

Having prepared our CAN bus data, we now have a 'decoded_ev6_data.csv' file of ~110 MB - ready for ChatGPT!

Get the CSV in our 'EV data pack'

Want to try out ChatGPT with the EV6 data yourself?

Our 100% free EV data pack includes the full ~110 MB decoded EV6 CSV file used in this article!

The pack also includes the 1 GB of raw CAN data, configuration file and DBC file for the EV6 - and other popular electric cars like the Tesla Model 3, Nissan Leaf, Enyaq and more!

Download now

12 awesome show cases of GPT4 + CAN bus data

Now comes the fun part: Analysing the EV6 CAN bus data with GPT4!

We start by uploading the CSV to get a summary:

As evident, GPT4 uses Code Interpreter to perform a high level summary of the data. This is not rocket science, but it's great for getting a quick overview and spotting potential issues. Note that all Python code can be viewed via the 'Show work' tabs.

In the following sections, we go through 12 show cases of how you can use ChatGPT with Code Interpreter.

In the following analyses, we adhere to the following restrictions:

We do not use any coding knowledge in any of our prompts
Each analysis was completed in <30 min (most in <10 min)
All outputs you'll see are directly from ChatGPT (with no edits by us)
We started a new 'chat session' for most of the examples to get a clean slate
The analyses were done in August 2023 (ChatGPT August 3 version)

Example #1: Creating 10 insightful data visualisations

First we will try two approaches to visualising our data.

1.1: High level request

To get a feel for the data, we start by creating some quick data visualisations with the shown prompt.

ChatGPT proposes visualisations, creates the code and produces the requested 2 x 5 grid of visuals. Pretty cool!

However, the plot also illustrates a key problem:

Our prompt is too high level.

While GPT4 'delivers' on our request, the selected signals and charts are not that insightful (detailed workflow below).

ChatGPT prompt data summary can bus data

Prompt 1: We start with a very high level request

ChatGPT time series data visualization minimum guidance

Result 1: As expected, the resulting plot is pretty random

Below is the full ChatGPT workflow (excl. the Python code generated), followed by the plot insights it provides.

As per the workflow, ChatGPT defines a list of tasks for each part of our prompt, then produces the code required to deliver the necessary output. The code it generates sometimes fails and ChatGPT then uses the error information to refine the code and try again. At one point we need to give it a 'nudge' to continue, but other than that it succeeds in finishing the task in the first attempt.

ChatGPT also provides a summary conclusion of each chart. The first iteration is generic and not useful. However, when we give ChatGPT this feedback it delivers far more relevant insights:

1.2: Specific request

In short, this 1st result was mediocre - but so was our prompt.

Let's try again with a more specific prompt as shown.

As seen in the resulting plot below, GPT4 does exactly what we ask - and the end result is far more useful.

Note that we only provide minor styling follow-ups - what you see is practically the 'first response' from ChatGPT.

GPT4 code interpreter big data visualisation with guidance

Prompt 2: We now get far more specific on what we want

ChatGPT Code Interpreter Time Series Data Plots Visuals

Result 2: ChatGPT produces exactly the plot we described in our prompt

Further, when prompting ChatGPT to provide 3 key takeaways it does a good job and e.g. highlights the relevant correlation between OutdoorTemperature and BatteryMaxTemperature (plots 2 and 3). It does, however, also hallucinate a bit as shown below.

Below is the full ChatGPT workflow followed by the plot insights it provides. As follow-up to our original prompt, we make a number of styling corrections and tweaks (e.g. converting Speed to SpeedKMH), but this is quickly handled by ChatGPT.

As shown, most of the insights make sense and are relevant.

Interestingly, however, GPT4 claims in the second insight that the EV6 StateOfHealth is deteriorating. That's a bold claim and seems inconsistent with the visual plot. After some probing, GPT4 admits that this was an error.

As we will see throughout other examples, this is an example of ChatGPT producing valid plots/numbers - but often invalid conclusions. Specifically, our hypothesis in this case is that ChatGPT 'solves' the insight aspect of our prompt based on its training data, rather than on the actual numbers - and it must be 'forced' to use the actual results in order to avoid this. In turn, this also begs the question as to whether the conclusions ChatGPT provided on the other plots are actually based on the data - or if they are just 'likely conclusions' given the context.

ChatGPT data visualization guided workflow 1

ChatGPT data visualization guided workflow 2

ChatGPT data visualization guided workflow 3

Example #2: Calculating kWh/100km

Next, let's test creating a new parameter.

When it comes to analysing the driving behaviour in electric cars, it is often of interest to look at kWh/100km as a measure of battery consumption. This is not directly available from the EV6 data, but has to be calculated.

As per our request, GPT4 calculates the battery consumption parameter correctly and visualises it. For fine-tuning, we ask GPT4 to remove outliers and exclude 1-hour periods that had less than 5 minutes of data in them.

This example may appear simple, but for CAN bus data analysis it is vital that ChatGPT is able to calculate custom signals on-the-fly. We'll use this ability frequently in subsequent analyses.

ChatGPT electric vehicle data battery consumption prompt

Prompt: We ask ChatGPT to calculate kWh/100km

Electric Car kWh/100km battery efficiency ChatGPT

Result: GPT4 correctly calculates the new BatterConsumption (kWh/100km) signal and creates 3 plots

Example #3: Identifying correlating drivers of battery consumption

Now that we have BatteryConsumption let's analyse correlations between this and key parameters.

3.1: Correlation vs. select signals

As per the prompt shown, we start by explicitly informing ChatGPT which parameters we wish to evaluate.

Note that ChatGPT re-uses BatteryConsumption from our previous example, but now also has to calculate AvgTirePressure from the individual tire pressure signals of each wheel.

In response, ChatGPT correctly calculates the correlation coefficients and creates neat scatter plots as per our request (plots and coefficients are correct).

Correct plots, incorrect conclusions

Curiously, however, ChatGPT also comments on the results as requested (in the same response as the plot) - yet refers to incorrect correlation coefficients (complete with compatible - and thus incorrect - conclusions).

We point this out, after which ChatGPT again refers to incorrect correlation coefficients (but new ones vs. before).

This again illustrates how ChatGPT can be prone to hallucination when asked to derive insights from its own analyses.

We eventually resolve this by asking it to redo the analysis and 'think step-by-step' (more on this later), after which it produces both valid plots and conclusions.

GPT4 prompt correlation analysis of time series data

Prompt 1: Visualize how battery consumption correlates with 5 signals

ChatGPT time series data correlation analysis

Result 1: GPT4 visualizes the data correctly - but struggles with conclusions

Below is the detailed conversation. Note how ChatGPT again struggles with basing its conclusions on the actual data, instead generating arbitrary correlation coefficients. Importantly, this should not detract from the quality and usefulness of the actual plot produced (which is valid and easily replicated based on the code). However, it stresses again the need for caution when asking ChatGPT to interpret its own results.

chatgpt-data-correlation-time-series-workflow-1

chatgpt-data-correlation-time-series-workflow-2

chatgpt-data-correlation-time-series-workflow-3

3.2: Top 3 correlating signals

The correlations we picked above might not be the most relevant. Let's try asking ChatGPT to identify the top 3 correlations across the dataset (with some conditions).

In response, ChatGPT correctly identifies the top 3 correlations.

Initially, it lists TireTemperatureFrontLeft, OutdoorTemperature and TireTemperatureBackRight. This is technically correct, but not what we intended. Instead, we ask it to calculate AvgTireTemperature (and ignore the individual tires).

Additionally, ChatGPT incorrectly displays the absolute form of each correlation coefficient (both in plots and conclusions). This happens because absolutes are used for the ranking - and GPT4 forgets to revert to non-absolutes for the remaining analysis.

With these tweaks, ChatGPT ends up with AvgTireTemperature, OutdoorTemperature and IndoorTemperature as shown in the plot - and it provides a sensible conclusion as well (see below).

ChatGPT using code interpreter to generate data insights

ChatGPT identify strong data correlations with Code Interpreter

Prompt 2: We ask GPT4 to find the top 3 correlations

ChatGPT data correlation top 3 time series

Result 2: After some corrections, we get the requested plot

Example #4: Visualising parameter relations in 3D surface plots

Next, let's try something simple and cool: 3D plots!

open the 3D plot

GPT4 uses plotly to create the 3D surface map - neat!

ChatGPT 3D surface plot of automotive CAN bus data

Prompt: Create a 3D plot with battery consumption

Result: ChatGPT easily produces this interactive 3D surface map

Example #5: Creating geographical plots of the car's GPS data

The dataset also contains GPS positions recorded by the CANedge GPS/IMU module. To better understand the vehicle dynamics it can be useful to plot these positions on a map.

5.1: GPS positions as route

Let's start with a very basic route plot.

As evident, this initial prompt was simple and ChatGPT manages to deliver (using folium) without any issues.

open the GPS plot

5.2: Speed-based geo heatmap

Let's try something harder:

Creating a geo-heatmap with speed-based coloring.

Here, ChatGPT breaks down our complex prompt into actionable subtasks and proceeds to execute each task.

Through this process, it corrects itself multiple times. At one point it also 'times out', but resumes when we ask it to 'continue'.

It successfully produces the requested plot, after which we follow-up with a few styling tweaks (e.g. asking it to remove circles with a radius beyond 50 km).

Note: GPT4 solved this task in <5 min!

open the GPS plot

GPT-4 Geographical Map Plot Code Interpreter GPS Positions Routes

Prompt 1: Plot the car's positions as a route

ChatGPT GPS position plot on interactive map CAN bus data

Result 1: An interactive map with visualised routes

LLM ChatGPT big data prompt GPS analysis map

Prompt 2: Create a geographical heatmap of positions and speed

Result 2: ChatGPT successfully produces a highly complex GPS based heatmap

Below is the full conversation. ChatGPT practically creates this in 'one go', except for the zoom level of the map.

Example #6: Identifying top 10 most frequently visited areas

Let us try another complex geospatial analysis.

A practical challenge encountered by many of our users is to evaluate the most frequently visited areas for a specific vehicle.

This is the most complex problem so far and importantly:

We initially don't know how to solve this problem.

6.1: Let ChatGPT figure it out

Since we do not have a clear way forward, we instead simply outline the intended end result - with zero guidance.

ChatGPT lays out a plan, but fails in defining areas.

Specifically, the chat bot attempts rather wild solutions such as DBSCAN clustering and KD-Trees. Both fail.

But ChatGPT then does something awesome:

It recognises that it's not getting anywhere, outlines two options - and asks for guidance!

This characterizes a great analyst - knowing when to ask for help and proposing viable routes forward.

Code Interpreter GPS data time series map plots

Prompt 1: Identify top 10 areas visited by the EV6

ChatGPT asks for help when unable to solve the problem

Result 1: ChatGPT initially fails, but proposes a grid-based solution

Below is the conversation showing how ChatGPT tries to solve the task - but fails with the DBSCAN and KD-Tree approaches:

6.2: Use GPT4's idea with guidance

We ask GPT4 to proceed with the grid based approach and it then produces a valid result. However, the road to this is rather iterative so to get a cleaner starting point we redefine our original prompt in a new session as shown in the 2nd prompt.

ChatGPT now manages to create the intended result almost out-the-gate with mainly a few styling corrections.

We do make one 'significant' correction on #hours:

GPT4 correctly identifies that most observations have 5 seconds between them (as we've resampled the data to 5 seconds), but it extrapolates this assumption to hold true across all observations - which is incorrect. For example, the delta between the last observation of trip X and the first observation of trip Y can be minutes, hours or even days.

In its defense, GPT4 identifies and fixes this mistake when we point out: 'The #hours values seem lower than expected?'. After fixing this, it produces the exact intended result.

open the GPS plot

ChatGPT prompt for identifying GPS based areas on a map

Prompt 2: Identify top 10 areas (with guidance)

Electric Car GPS Data Analysis by ChatGPT with Code Interpreter

Result 2 (1 of 2): Bar chat of top 10 visited areas for the EV6

Result 2 (2 of 2): Geo-plot of top 10 visited areas for the EV6

Below is the detailed conversation. This is a relatively rare example of ChatGPT making an erroneous assumption in the Python code, specifically assuming that the time between observations is always 5 seconds (an extrapolation it makes based on part of the data).

chatgpt-top-10-visited-areas-python-v2-1

chatgpt-top-10-visited-areas-python-v2-2

chatgpt-top-10-visited-areas-python-v2-3

chatgpt-top-10-visited-areas-python-v2-4

chatgpt-top-10-visited-areas-python-v2-5

One thing worth noting is that GPT4 struggles with adding e.g. city names, addresses or zip codes to each area. This is because it does not have access to geocoding packages and it is reluctant to use potentially outdated knowledge before the 2021 training data cutoff.

If pushed to provide a "best guess" as to the country of Area 1, it proposes Netherlands (the correct answer is Denmark, so it's not even close). However, this turns out to be an issue with GPT4 using only a single decimal in the latitude and longitude values.

After requesting that it leverages the full precision GPS coordinates, GPT4 does manage to map 8 of 10 areas to Denmark (Aarhus), which is correct - but fails to map two areas, specifically Denmark (Ebeltoft) and Austria (Kirchberg). For this reason we showcase the map without the partial geocoding result.

We expect Code Interpreter will at some point be able to fetch Python packages online, which will enable it to solve this type of task with ease.

Example #7: Visualising top 10 acceleration events

When analysing large amounts of data, you'll often be looking for specific events. To illustrate this, we will try to identify the top 10 acceleration events of the car as per the prompt shown.

Obviously, this is another complex request. We are deliberately asking ChatGPT to do a ton of stuff in one prompt as a challenge. Despite this it does pretty good on the first go.

As detailed in the full chat conversation below, the 1st prompt results mainly in events where the car starts after a standstill. In practice, we may be more interested in acceleration events during active driving, which motivates our follow-up prompt.

This follow-up prompt is not easy, but ChatGPT succeeds!

Fun fact: We don't prompt ChatGPT to add the dotted red 'event line' in Result 1 - it does so unprompted.

open the GPS plot

Prompt 1: Identify and visualize top 10 acceleration events

GPT-4 acceleration analysis automotive data logger

Prompt 2: Add complex conditioning to the event identification

car acceleration events including 60 seconds before/after

Result (1 of 3): Top 10 events, incl. acceleration and speed before/after the event

Result (2 of 3): GPS plot of each event incl. hover info (open the plot)

Top 10 acceleration events for electric car CAN bus data GPT

Result (3 of 3): Top 10 harsh acceleration events listed in a table structure

This conversation is fairly iterative with multiple adjustment prompts added in order to arrive at the final result.

In particular, we provide a few styling prompts and ask GPT4 to filter out events where ImuValid or SpeedValid equal 0. GPT4 struggles with labelling each event with a consistent 'event index' across all three outputs, but succeeds with some guidance. The results of one of the outputs is below (showing the event plot).

As evident from the results of the first prompt, many of the plots include missing observations because SpeedKMH is zero during part of the dataset. This is expected as the CANedge is configured to only record data when Speed exceeds zero. It also means that a lot of the acceleration events reflect the car starting up after a stand still.

To fix this, we make the follow-up prompt shown in 13 / 20 below. This follow-up prompt is non-trivial because of the iterative nature: ChatGPT has to both identify the top 10 acceleration events - while also ensuring that each event has 121 observations. It tries a few iterative approaches until it eventually decides on a rolling window methodology that achieves the task with sufficient efficiency.

As shown in the line charts, this method ensures in 9 of 10 plots that there is also 60 seconds of data before/after the actual event, even though that is not in principle a requirement in the initial filtration that GPT4 produces. As evident, one plot (Event 5) thus also lacks some data after the event, but seemingly makes the cut because it has sufficient data before the event to fulfil the requirement of 121 rolling observations.

After some minor styling prompts, we arrive at the final output - showing the top 10 harsh braking events along with the surrounding speed data.

chatgpt-can-bus-data-harsh-acceleration-1

chatgpt-can-bus-data-harsh-acceleration-2

chatgpt-can-bus-data-harsh-acceleration-3

chatgpt-can-bus-data-harsh-acceleration-4

chatgpt-can-bus-data-harsh-acceleration-5

chatgpt-can-bus-data-harsh-acceleration-6

chatgpt-can-bus-data-harsh-acceleration-7

chatgpt-can-bus-data-harsh-acceleration-8

chatgpt-can-bus-data-harsh-acceleration-9

chatgpt-can-bus-data-harsh-acceleration-10

Example #8: Daily distribution of consumed State of Charge (SoC%)

When working with CAN bus time series data you'll often need to work with the change in a variable, rather than the absolute value. However, doing so can be non-trivial for large amounts of data. Let's put ChatGPT to the test.

Our prompt for this task is very long, but with clear sub tasks.

In response, ChatGPT does a good job as per the below results - providing us with an overview of when the car is primarily used throughout the day and whether some hours of the day appear to enable more efficient driving than others.

Disclosure: As explained below, we made multiple attempts before arriving at this initial prompt.

For most of the examples in this article, we define our initial prompts in advance - and do not modify them through iteration. We do so to mimic real-life analysis, where you may not have time to perfectly engineer an initial prompt.

However, for this particular prompt we did go through about 3-4 iterations. Our original prompt was less specific on how to construct the calculated signals, e.g. not explicitly stating that the delta in DistanceTrip should exclude negative values. Without this guidance, ChatGPT would calculate a negative delta between the last observation of Trip A and the first observation of trip B, which is not our intent. Through iteration with ChatGPT we would identify unexpected values for the calculated signals and adjust our prompt accordingly.

Of course, this can be handled iteratively in the prompt conversation flow, but we decided to update the initial prompt instead to showcase how ChatGPT does succeed out-the-box if given sufficiently clear guidance on how to calculate each signal. This is also a reminder that even though ChatGPT is great, the user must do part of the thinking and scrutinise the results.

GPT-4 State of Charge analysis electric vehicle UDS data logger

Prompt: Plot the daily distribution of consumed SoC%

State of Charge Kia EV6 daily distribution of consumption and efficiency

Result: Daily distribution of distance travelled and consumed SoC in %-pts and per km

Example #9: Plotting State of Charge (SoC) over time excluding non-trips

A challenge when visualising data from vehicles over days or months is the fact that the data often consists of brief trips surrounded by a lot of nothing (when the vehicle is not driving).

To avoid wasting screen space on non-trip periods, let's try to cut them out as per our prompt.

ChatGPT produces the requested plot without any follow-ups required, providing a useful "zoomed in" perspective on the data that filters out all the non-trip gaps.

ChatGPT prompt plot state of charge excluding non-trip periods

Prompt: Plot SoC% over time, excluding non-trip periods

State of Charge SoC% Kia EV6 plotted over time including and excluding non-trip periods

Result: SoC% plotted over time including/excluding non-trip periods

Example #10: Reviewing the quality of the GPS/IMU data

In order to showcase the ability of ChatGPT to derive high level insights from the data, we'll use it to help identify the validity of the GPS/IMU data we've recorded throughout.

Note how ChatGPT is tasked with deciding on the analysis in this case - although we do specify that it should focus on the validity signals. GPT4 arrives at the expected conclusion: That the signals are valid throughout almost the entire dataset.

Below is the summary conclusion made by ChatGPT:

If we do not specify which signals ChatGPT should focus on, it will attempt to identify this itself. It succeeds to some extent in our tests, but it will to a larger extent look at the actual signals (such as Speed, Latitude etc) rather than the more specific validity signals. One can argue whether this is preferable, but the result is an analysis that is more 'all over the place' - even if it arrives at the same conclusion overall.

As in other tests, providing a bit of extra guidance up-front results in vastly better output.

GPT-4 analyse GPS data time series prompt

Prompt: Determine if the GPS/IMU data is accurate

Signal validity ChatGPT time series analysis

Result (1 of 3): % of valid data per signal

Result (2 of 3): Distribution of data validity

GPS validity over time in 5 min period intervals

Result (3 of 3): Validity over time (5-min periods)

Example #11: Who's the better driver - Martin or Josefine?

Towards the end, we will try using ChatGPT to perform an analysis of driver performance for the car.

Specifically, the EV6 is used by two drivers: Martin and Josefine. We want to determine who drives most efficiently to/from work.

11.1: Clustering trips by driver

First, we need to cluster trips by driver. We do so through geo-tagging trips using the respective work places as per the prompt.

As evident from the below results, ChatGPT is able to segment the dataset into trips based on a simple logic we provide. It is also easily able to group the trips by driver based on whether the car moves into predefined geographical areas. We do some light filtration to focus only on work commutes and ask ChatGPT to summarize the trip information and plot it on a map.

open the GPS plot

ChatGPT analyse driving behavior electric car via Code Interpreter

Prompt 1: Group data by trips and map drivers based on geofences

Result 1: The GPS plot enables us to quickly verify if the trip clustering is valid

ChatGPT manages to perform the fairly complex trip clustering with ease and also provides summary statistics of the trips.

gpt4-code-interpreter-time-series-big-data-vehicle-trip-clustering-1

gpt4-code-interpreter-time-series-big-data-vehicle-trip-clustering-2

gpt4-code-interpreter-time-series-big-data-vehicle-trip-clustering-3

gpt4-code-interpreter-time-series-big-data-vehicle-trip-clustering-4

gpt4-code-interpreter-time-series-big-data-vehicle-trip-clustering-5

11.2: Visualising power consumption

Next, we use GPT4 to create a histogram of consumed power to compare regeneration patterns between the two drivers.

As evident, Josefine has a higher share of observations that fall into the light regenerative area compared to Martin.

Next, we'll derive a definitive conclusion.

GPT compare driving behavior via regenative braking histogram

Prompt 2: Driver histograms of power consumption

EV regenerative braking pattern histograms via ChatGPT

Result 2: Power consumption histograms for evaluating regenerative braking patterns

Below is the workflow for generating the histogram. Note how ChatGPT wrongfully concludes that Martin is better at regenerative braking, which appears to contrast the results from the plot. More on this in the next section.

chatgpt-electric-car-trip-data-analysis-code-interpreter-1

chatgpt-electric-car-trip-data-analysis-code-interpreter-2

chatgpt-electric-car-trip-data-analysis-code-interpreter-3

chatgpt-electric-car-trip-data-analysis-code-interpreter-4

chatgpt-electric-car-trip-data-analysis-code-interpreter-5

11.3: Accumulated consumption per km

To further evaluate the respective driver efficiency, we separately accumulate the negative and positive power consumption values and standardise them based on distance travelled (as shown in the prompt).

In response, ChatGPT does exactly what is requested, with only minor styling prompts required (results below).

As evident, the resulting plots clearly show that Josefine is a more efficient driver than Martin, specifically due to her ability to use regenerative braking better. This is also evident if we instead get ChatGPT to calculate battery consumption in kWh/100 km, where Josefine is almost 20% more efficient.

Valid plots, invalid conclusions

To cap it off, we ask ChatGPT a simple question: Who is the better driver and why?

And here it gets weird:

For all its amazing capabilities, ChatGPT wrongfully claims that Martin is the better driver.

This is a consistent theme across our examples:

While ChatGPT creates valid plot and analyses 9/10 times, it makes valid conclusions 5/10 times at best.

We comment on this in the chat workflow below.

GPT4 analyse driver efficiency big data power consumption

Prompt 3A: Visualize the relative driving efficiency

follow-up prompt for ChatGPT to adjust output

Prompt 3B: Example of follow-up prompt

ChatGPT calculate battery consumption per driver

Prompt 3C: Calculate battery consumption per driver

Standardized power consumption by driver split by consumed vs regenerated

Result 3 (1 of 2): Standardized power - regenerated, consumed and net

Result 3 (2 of 2): Average battery consumption by driver

Below is the full workflow for the accumulated power consumption analysis.

As evident from the chat extract shown, ChatGPT appears to swap the names of the drivers when asked to provide a conclusion. Additionally, it only fixes this upon being informed of the error.

Interestingly, ChatGPT makes the same erroneous conclusion when reviewing the previous power consumption histograms, claiming that Martin was driving more efficiently. In fact, it even makes the same wrong conclusion when based on the extremely clear-cut battery consumption plot - generating new hallucinated consumption averages as part of its response.

As discussed previously, this appears to reflect GPT4 defaulting to its training data for conclusions, thus hallucinating 'likely' values for each driver instead of relying on the factual results it just produced.

If that is the case, then it is curious that ChatGPT consistently believes Martin is the better driver across sessions. Maybe this reflects an intrinsic bias in ChatGPT towards thinking men are better than women at driving. Or, maybe it thinks that the prompter is Martin (which is correct) and thus finds it likely that a conclusion that favorises the prompter is preferred. Regardless, it's surprising how frequently ChatGPT ignores the factual results and instead goes for hallucination.

When asked why it fails, ChatGPT appears to suggest that it's not leveraging the 'hard data' it produced in previous responses, but rather does a visual inspection of the plot.

It is unclear to us if ChatGPT actually tries to extract the numbers from the plot. We tested this ability in isolation, uploading the battery consumption plot in a new session and asking ChatGPT to extract the numbers and derive the insights. It explains that it will do so using Optical Character Recognition (OCR) and proceeds to extract five numbers (19.0, 17.5, 15.4, 15.0, 12.5 - i.e. including the two intended values). It fails in this test to make the intended conclusion as well. Based on this it appears GPT4 has some level of competence for plot based number extraction, but it is too sporadic to be reliable. As such, GPT4 may be telling the truth when it claims that it tries using the data from the plot - but it could also be that the numbers it uses are completely made up.

Regardless, when ChatGPT actually re-does the calculation to fetch the 'hard numbers' it correctly concludes that Josefine is the better driver.

chatgpt-vehicle-driver-benchmark-analysis-can-bus-1

chatgpt-vehicle-driver-benchmark-analysis-can-bus-2

chatgpt-vehicle-driver-benchmark-analysis-can-bus-3

chatgpt-vehicle-driver-benchmark-analysis-can-bus-4

chatgpt-vehicle-driver-benchmark-analysis-can-bus-5

chatgpt-vehicle-driver-benchmark-analysis-can-bus-6

chatgpt-vehicle-driver-benchmark-analysis-can-bus-7

Example #12: Analysing data from multiple vehicles

So far, we've used a single CSV from a single Kia EV6. In practice, our end users often need to extract insights across multiple vehicles, boats or machines - sometimes entire fleets.

12.1: Simulating fleet data

To mimic this, we first re-purpose our data to simulate a fleet of 5 cars (using GPT4, of course).

We originally intended to manually restructure the dataset by breaking it into 5 roughly equal periods and then re-baselining the timestamps to make it appear the data was recorded in parallel.

But then we of course came to our senses and asked ChatGPT to do it for us:

ChatGPT restructure electric car data set to fleet

ChatGPT manages to do this without any issues, producing 5 separate chunks of data and re-baselining it to start January 1, 2023. It adds the car identifier in a new column in each dataset, creates 5 CSV files and provides a download link for the zip containing these 5 CSV files. Neat!

One minor critique is that it did not compress the data in the zip (but then again, we did not ask it to).

Generally, you could take many approaches to assigning a vehicle/machine ID to a dataset. It could be provided via the file name, folder name, an extra column etc. We find it useful to have the ID in the column as it offers some flexibility for also mixing multiple assets within the same dataset file. From our tests you just need to make it sufficiently explicit and ChatGPT will understand the logic - and if not, you can explicitly tell it.

The result is a new zip file containing 5 individual CSV files, each structured as before - but with the CSV name and an extra column specifying the vehicle identifier of each CSV.

12.2: Visualising fleet data

Next, we ask ChatGPT to perform a simple analysis based on the CSV files. As shown below, ChatGPT easily works across the 5 individual files to calculate the new signals and plotting all of them in a single visual - great! This also shows that you can work with multiple datasets in parallel if preferred.

We also ask for a plot of battery consumption (a complex calculated variable), which it also does without issues.

ChatGPT prompt analyse vehicle fleet data

Prompt 1: Summarize and visualize data across 5 CSV files

ChatGPT prompt analyse battery consumption across electric vehicle fleet

Prompt 2: Visualize battery consumption across fleet

ChatGPT code interpreter multiple CSV files analysis

Result 1: Distance travelled across the 5 separate cars

Battery consumption over time per vehicle by ChatGPT code interpreter

Result 2: Battery consumption by car

12.3: 3D scatter plot

For some light action, we ask for a 3D scatter plot.

ChatGPT again delivers (using plotly), creating a cool 3D scatter plot of battery consumption vs. outdoor temperature split by car. Note how it correctly applies the cross-plot color-coding.

open the 3D plot

ChatGPT prompt 3D scatter plot across electric vehicle fleet

Prompt 3: Create a 3D scatter plot

Result 3: 3D scatter of battery consumption vs. outdoor temperature by car

12.4: Car ranking and distance split

We also ask GPT4 to create a visual ranking of the cars based on driving efficiency, which it delivers with minor styling edits.

And as a final challenge, we ask it to revert from the hourly aggregated data back to the 5-second resolution (in-session) to calculate the distance travelled by each car and to visualise this in a pie chart. It does so correctly (with a few tweaks), despite having worked at weekly / hourly averages for the previous plots.

Overall, ChatGPT fluently handles the combined challenge of multiple CSV files, calculated variables and different sampling frequencies.

Prompt 4: Rank the cars based on driving efficiency

ChatGPT prompt 3D scatter plot across electric vehicle fleet

Prompt 5: Create a pie chart of distance travelled

ChatGPT bar chart showing a sorted list of cars based on battery consumption

Result 4: Ranking of cars based on battery consumption

GPT4 pie chart code interpreter distance travelled by cars

Result 5: Pie chart showing breakdown of distance travelled by car

10 tips & tricks for using GPT4 with your CAN data

Below we outline practical tips and tricks:

GPT4 will often 'drop out' of the session you're in, in which case it will lose access to the uploaded data. Rather than starting up a new chat session, the simplest fix is to reupload the relevant data and ask it to retry.

GPT4 often hallucinates which variables should be used in a calculation or visualisation. It'll typically realise the mistake and prompt you to inform it about the correct variable name, but to save time you can inform it upfront.

If you know programming, it's useful to open the code produced by GPT4 in each step. Sometimes there are some odd decisions made by GPT4 in developing the code. In addition, GPT4 often remakes the code from scratch upon e.g. new attempts - so it can be difficult to keep track at times. We find it useful to request GPT4 to output the full consolidated code when relevant. If you do this, make sure to also ask it to output the result of the code in the same prompt. You can then save this and run it locally in case you need to reproduce the exact results later.

While GPT4 is fast, you quickly get lost in the rabbit hole and hours can pass. To minimise this, be very clear on what you need to ask before you start prompting. In fact, we recommend writing down your full initial prompt for each question, ensuring you detail it to the extent you believe necessary to provide GPT4 with a chance to 'get it right' in the first go.

In our tests, ChatGPT will struggle with large complex prompts. You'll notice that we therefore frequently use the phrase "Let's think step-by-step", which has been shown to enable the large language model (LLM) to more easily handle complex tasks. Without this phrase, ChatGPT tends to proceed more iteratively and plunge into the task. With the phrase, it typically outlines each step clearly up-front as a "plan" - and then proceeds to execute it, resulting in far fewer mistakes.

If you're going to use ChatGPT for similar analyses, you may want to sanitise your data up front. Specifically, you may consider removing GPS data from your dataset before uploading, as this will in practical terms make your data anonymous and may also be required for GDPR or CCPA purposes. This can be done easily via the asammdf GUI filters when exporting to CSV - or by using our Python API to filter out specific signals.

You'll often need to re-upload your data e.g. if a session expires. To speed this up you can zip the CSV file to reduce size by 80%+. This also allows you to easily upload multiple CSV files and/or larger CSV files than the normal file limit.

In many of the examples in this article we deliberately challenge ChatGPT by providing extensive prompts with multiple sub tasks and multiple outputs. In practice, it is often simpler to take it step by step, e.g. asking ChatGPT to produce one output at a time and wait for feedback before proceeding.

If you know certain variables need to be calculated for use in many prompts, do this in advance. This helps you get more consistent results between prompts and you avoid the risk of ChatGPT occasionally calculating a new variable incorrectly. Similarly, consider what sampling frequency and data structure minimises the effort from ChatGPT in order to perform your required analyses.

In our experience, working with GPT4 plus Code Interpreter is like having a junior analyst that produces output at an inhuman rate. Yet, like most junior analysts you'll need to provide clear guidance up front and diligently review any analyses produced.

This does not mean that you need to know Python code - but you should ideally have an understanding of how to work with time series data to get the most out of this tool.

Side comment

As we wrote the above tip, we wanted to exemplify this. Specifically, we believed ChatGPT would not proactively remove 'charging events' when calculating the 'consumed' state of charge over time. However, for the sake of good order we checked this - and we were wrong ...

ChatGPT prompt calculate consumed state of charge

ChatGPT correctly filters out charging events in consumed state of charge

GPT4 consumed state of charge electric car over time

Our verdict on GPT4 with Code Interpreter

In the spirit of this article, you'll find below both our verdict - and ChatGPT's verdict:

Our verdict

As should be obvious from the examples, our verdict is clear:

GPT4 with Code Interpreter is a game changer.

If an engineer with light/medium Python skills had to re-create all of the outputs in this article, we estimate it would take 20-30 hours of work. In comparison, the prompting process with ChatGPT took 2-3 hours - 90% less time.

More importantly: You may not know Python at all.

Here, ChatGPT lets you perform analyses that would otherwise be impossible. This makes it a key supplement to existing CANedge software for ad hoc analysis of medium/large datasets (across many files, months, years, devices, ...).

Also, we agree with the relevant pros and cons listed in the ChatGPT verdict [sigh, why are we even writing our own...].

ChatGPT's verdict

ChatGPT self-evaluation on CAN bus data analysis competences

To summarise:

GPT4 with Code Interpreter lets you produce amazing data analyses and visualisations with a ridiculous output/effort ratio. The Python based outputs are almost always valid - and can be easily reviewed. However, GPT4 is prone to hallucination when asked to comment on its own analyses - so make your own conclusions.

Closing remarks

We hope you've found this article insightful!

If you want to play around with ChatGPT and Code Interpreter, download the 100% free EV6 CSV dataset now.

Tip: We may add a section with 'user examples' - if you have a prompt + output, share it on LinkedIn or via our contact form!

Ready to use ChatGPT with your own CAN data?

Get your CANedge today!

Buy now Contact us

Recommended for you

CANedge3 - CAN Bus Data Logger with 3G/4G and GNSS/IMU

CANEDGE3
CAN/LIN LOGGER (3G/4G + GPS/IMU)

CAN TELEMATICS AT SCALE

PYTHON API - PROCESS YOUR CAN DATA

ChatGPT: Analyse 1Y of EV6 CAN Bus Data [Code Interpreter]

In this article

Why use ChatGPT for your CAN data analysis?

Our CAN bus dataset

How to use Advanced Data Analysis

How to prepare your CANedge CAN bus data for GPT4

1: asammdf GUI

2: MF4 decoders

3: Python API

1: Signals in rows

2: Signals in columns

Get the CSV in our 'EV data pack'

12 awesome show cases of GPT4 + CAN bus data

Example #1: Creating 10 insightful data visualisations

1.1: High level request

1.2: Specific request

Example #2: Calculating kWh/100km

Example #3: Identifying correlating drivers of battery consumption

3.1: Correlation vs. select signals

Correct plots, incorrect conclusions

3.2: Top 3 correlating signals

Example #4: Visualising parameter relations in 3D surface plots

Example #5: Creating geographical plots of the car's GPS data

5.1: GPS positions as route

5.2: Speed-based geo heatmap

Example #6: Identifying top 10 most frequently visited areas

6.1: Let ChatGPT figure it out

6.2: Use GPT4's idea with guidance

Example #7: Visualising top 10 acceleration events

Example #8: Daily distribution of consumed State of Charge (SoC%)

Example #9: Plotting State of Charge (SoC) over time excluding non-trips

Example #10: Reviewing the quality of the GPS/IMU data

Example #11: Who's the better driver - Martin or Josefine?

11.1: Clustering trips by driver

11.2: Visualising power consumption

11.3: Accumulated consumption per km

Valid plots, invalid conclusions

Example #12: Analysing data from multiple vehicles

12.1: Simulating fleet data

12.2: Visualising fleet data

12.3: 3D scatter plot

12.4: Car ranking and distance split

10 tips & tricks for using GPT4 with your CAN data

Side comment

Our verdict on GPT4 with Code Interpreter

Our verdict

ChatGPT's verdict

Closing remarks

Recommended for you