NMFk: Unmixing Contaminated Groundwater

NMFk

NMFk is a code within our award winning SmartTensors framework for unsupervised, supervised and physics-informed (scientific) machine learning (ML) and artificial intelligence (AI) (web source).

NMFk

NMFk performs Nonnegative Matrix Factorization with $k$-means clustering

The example problem presented here is demonstrating how NMFk can be applied to unmix groundwater.

Here, the dataset analyzed by NMFk includes a series of hydrogeochemical concentrations observed at monitoring wells.

This type of analysis can also be called blind source separation or feature extraction.

In general, applying NMFk, we can automatically:

Setup

If NMFk is not installed, first execute in the Julia REPL:

import Pkg
Pkg.add("NMFk")
Pkg.add("Mads")
Pkg.add("Cairo")
Pkg.add("Fontconfig")
Pkg.add("Gadfly")

Let us assume that there are multiple wells (monitoring points) detecting simultaneously multiple signals (here, contaminant plumes with different chemical signatures).

Our goal is to estimate automatically the number of signals (plumes) using unsupervised machine learning (ML).

In the example presented here, we assume that there are 20 wells (monitoring points), 2 sources (contaminant plumes), and 3 chemical species (e.g., nitrate, sulfate, chloride) detected at each well.

These types of unsupervised machine learning analyses are applicable in many other situations where the mixing of signals (volumes/mass) is constrained by volumetric or weight constraints.

For example, these types of problems occur in the case of mass transport in fluids (e.g., atmosphere, oceans, watersheds, oil/gas/geothermal reservoirs, aquifers, etc.).

Let us assume that there are 20 measurement locations (wells), 2 contaminant sources, and 3 geochemical species:

Due to volumetric constraints, mixing coefficients of the different sources (plumes) at each of well (observation point) have to add to 1.

The mixing matrix W is defined as:

Here, W defines 2 mixing values for each of the 20 observation points.

The 2 mixing values define how the 2 sources (signals) are mixed at each well.

The mixing values for each well (along each row) add up to 1.

Let us also define H as a matrix which specifies the concentrations of the 3 chemical species present in the 2 contaminant sources:

Here, the first source (row 1) has elevated concentration for the first chemical component (100).

The second source (row 2) has elevated concentrations for the second (10) and third (20) chemical species.

Now for this synthetic problem, we can compute the synthetic observed concentrations at each the wells X as:

We can also make some of the observations ''missing'' by setting respective matrix entries equal to NaN:

Analysis

Now assuming that only X is known, NMFk can estimate unknown W and H.

NMFk estimates automatically the number signals (features; here, contaminant sources) present in the analyzed dataset X.

The number of signals is equal to the number of columns in W and the number of rows in H.

NMFk computes the number signals (features) based on silhouettes of the k-means clusters obtained from a series of NMF solutions using random initial guesses (see the papers and presentations at http://SmartTensors.com for more details).

NMFk execution produces:

Above, NMFk provides estimates of W and H (We and He). These are vectors of matrices for a different number of sources. For example, We[2], We[3], ..., He[2], He[3], ... .

The NMFk analyses are performed for the number of sources varying between 2 and 5 (2:5).

NMFk also returns the reconstruction error (fit); here, fit is a Frobenius norm between the known X and estimated Xe where Xe is computed as, for example, We[2] * He[2].

NMFk also provides the silhouettes of the k-means clusters (sil)

fit and sil are vectors with estimates for a different number of sources.

The automatically estimated number of features (plumes) kopt is returned as well.

In the example above, kopt is equal to 2, which matches the true unknown number of sources.

The estimated concentrations of geochemical species for the 3 sources are:

The concentration estimates are very close to the actual values H (note that the matrix rows are not expected to be in the same order for estimated and true H matrices):

The estimated mixing coefficient at the 20 wells are:

They are also very close to the actual values (note that the matrix columns are not expected to be in the same order for estimated and true W matrices):

These matrices can be also easily ploted:

In conclusion, NMFk successfully estimated the number of plumes, the original concentrations at the contaminant sources (before mixing in the aquifer), and the mixing coefficients at each well.

The NMFk unsupervised ML analyses are unbiased and robust.