Let us assume that there are multiple wells (monitoring points) detecting simultaneously multiple signals (here, contaminant plumes with different chemical signatures). Our goal is to estimate automatically the number of signals (plumes) using unsupervised machine learning (ML) Here, we apply our method called NMFk which uses NMF coupled with k-means clustering. For more information about NMFk and our unsupervised ML methods in general check http://tensors.lanl.gov.
In the example presented here, we assume that there are 20 wells (monitoring points), 2 sources (contaminant plumes), and 3 chemical species (e.g., nitrate, sulfate, chloride) detected at each well.
These types of unsupervised machine learning analyses are applicable in many other situations where the mixing of signals (volumes/mass) is constrained by volumetric or weight constraints. For example, these types of problems occur in the case of mass transport in fluids (e.g., atmosphere, oceans, watersheds, oil/gas/geothermal reservoirs, aquifers, etc.).
import NMFk
import Random
nWells = 20
nSources = 2
nSpecies = 3
Random.seed!(2015);
Due to volumetric constraints, mixing coefficients of the different sources (plumes) at each of well (observation point) have to add to 1.
The mixing matrix W
is defined as:
W = rand(nWells, nSources)
for i = 1:nWells
W[i, :] ./= sum(W[i, :])
end
display(W)
Here, W
defines 2 mixing values for each of the 20 observation points.
The 2 mixing values define how the 2 sources (signals) are mixed at each well.
The mixing values for each well (along each row) add up to 1.
Let us also define H
as a matrix which specifies the concentrations of the 3 chemical species present in the 2 contaminant sources:
H = [100 0 3; 5 10 20]
Here, the first source (row 1) has elevated concentration for the first chemical component (100).
The second source (row 2) has elevated concentrations for the second (10) and third (20) chemical species.
Now for this synthetic problem, we can compute the synthetic observed concentrations at each the wells X
as:
X = W * H
We can also make some of the observations ''missing'' by setting respective matrix entries equal to NaN:
X[1, 1] = NaN
display(X)
Now assuming that only X
is known, NMFk can estimate unknown W
and H
.
NMFk estimates automatically the number signals (features; here, contaminant sources) present in the analyzed dataset X
.
The number of signals is equal to the number of columns in W
and the number of rows in H
.
NMFk computes the number signals (features) based on silhouettes of the k-means clusters obtained from a series of NMF solutions using random initial guesses (see the papers and presentations at http://tensors.lanl.gov for more details).
NMFk execution produces:
We, He, fit, sil, aic, kopt = NMFk.execute(X, 2:10; mixture=:mixmatch);
Above, NMFk provides estimates of W
and H
(We
and He
).
These are vectors of matrices for a different number of sources. For example, We[2]
, We[3]
, ..., He[2]
, He[3]
, ... .
The NMFk analyses are performed for the number of sources varying between 2 and 10 (2:10
).
NMFk also returns the reconstruction error (fit
); here, fit
is a Frobenius norm between the known X
and estimated Xe
where Xe
is computed as, for example, We[2] * He[2]
.
NMFk also provides the silhouettes of the k-means clusters (sil
)
fit
and sil
are vectors with estimates for a different number of sources.
The automatically estimated number of features (plumes) kopt
is returned as well.
In the example above, kopt
is equal to 2 which matches the true unknown number of sources.
The estimated concentrations of the 3 sources are:
He[2]
The concentration estimates are very close to the actual values H
:
H
The estimated mixing coefficient at the 20 wells are:
We[2]
They are also very close to the actual values:
W
These matrices can be also easily ploted:
display(NMFk.plotmatrix(We[2]'))
display(NMFk.plotmatrix(W'))
display(NMFk.plotmatrix(He[2]))
display(NMFk.plotmatrix(H))
In conclusion, NMFk successful estimated the number of plumes, the original concentrations at the contaminant sources (before mixing in the aquifer), and the mixing coefficients at each well.
The NMFk unsupervised ML analyses are unbiased and robust.