Validating your water data

  • Why sensor data is worth its weight in gold

  • How anomalies can cause chaos and confusion

  • The role data science plays in validating sensor data

Errors in your data?

No matter how much we rely on technology, it’s perhaps a natural human instinct to be suspicious of something you can’t see with your own eyes. And the truth is, surprising results in water management data are common. Data can drift, contain extreme values or, in some cases, suddenly flatline.

Sometimes this indicates a problem in your asset that needs addressing. Other times, it simply means you have to replace a sensor. But unless you know the difference between these two scenarios, confusion can reign over your water management.

The images below show some of the most common anomalies water managers may see, using the water levels in a sewer as an example.

In figure ‘A’ you can see a baseline shift. These can occur if a rain shower takes sand into the sewer pipes, for example, causing the water level to rise.

After a cleaning session, the pipe is once again empty and the water returns to a normal level. In this case, the baseline shift was cause by a real event, and not a data error.

It can also be possible for sand to be slowly deposited in the sewer pipe, causing water levels to rise gradually over time. This ‘drift’ can be seen in figure ‘B’. Or, heavy rainfall could result in a strong short peak, as illustrated by the extreme value in figure ‘C’.

Finally, during a period of drought, you might get a flatline, as seen in figure ‘D’.


Figure A

Each of these instances would capture the attention of whoever is monitoring the data. And they all share one thing in common: they could either be caused by actual events, or they could be complete fabrications created by faulty sensors and erroneous data.

So, how do you tell which is which?

Data validation modelling

The key to assessing whether a data shift indicates a genuine incident or is instead the result of faulty tech, comes down to incorporating other data alongside your sensor data.

If viewed in isolation, sensor data can be misleading. But, by using data science to align those instances with other factors – like weather data, for example – you can gain a clear picture of what you’re looking at. And whether or not it’s likely to have been caused by real-world events.

To illustrate, let’s look at our recent collaboration with Aquasuite, where we developed an approach for establishing the aeration frequency of blower installation in a water purification environment.

If you look at the charts below, you’ll see the data shows a ‘baseline shift’ of around 250Nm3/h after August 20th. So how do we identify if this is a data or process error?


Figure B

We know that the main predictor for aeration is the influent of dirty water to the treatment plant. The dirtier the water, the more aeration has to be done to clean it.

Fortunately, we also have data on the water influent concentrations. Using this, we set up a Quantile Regression Neural Network to predict what the expected amount of aeration should be for a given influent. This gives us the following results:


Figure C

The blue line in the figure above indicates the expected aeration given the influent. And the areas around it indicate the range of values it can normally assume. Data values that fall outside of this region are referred to as ‘outliers’.

The graph shows that a number of values are indicated as outliers after August 20th. It also shows that the average aeration is above the forecast from that date. To better represent this, we can look at the results based on the daily averages.


Figure D

Again, values that fall outside of the first limit are indicated as outliers. You can see that the daily average starts to exceed the expected bandwidth after August 20th. Therefore, with this model, we can conclude that there is likely to be something wrong with the aeration measurement process.

Quantile Regression Neural Networks

The great thing about a Quantile Regression Neural Network is that it can detect all kinds of errors, from drifts and baseline shifts to extreme values and flatlines. The big advantage of this is that you don’t have to devise separate rules for each possible error type.

It’s also a highly scalable solution, as the model learns from historical data over time. As a result, nothing needs to be fine-tuned – it happens automatically.

Once we have trained a Quantile Regression Neural Network, we can not only display which values are outliers, but also replace them with predictive modelling, bringing greater accuracy to the purification process.

This ensures that treatment plants can continue to run in a stable and data-driven manner, even if sensor data isn’t always reliable.

Domain knowledge meets data science

It’s very difficult to determine whether certain values are accurate or not using sensor readings alone. And unless you verify accuracy, your sensor data loses some of its value, and can even mislead decision making.

However, by combining domain knowledge with the right data science techniques, validation is easier, faulty sensors can be detected more quickly, and your insights can consistently and accurately identify real problems – and dictate the appropriate actions to take.

If you want to get started validating the data from your sensors, get in touch. We’d be happy to discuss the possibilities.