After launching two community competitions on predictive soil spectroscopy on Kaggle, we are now pleased to announce the winners based on the results of the Private leaderboard and the competitors’ willingness to publish reproducible code online:
- SAIL team, composed by Mahamad Salah Mahmoud and Nikos Tziolas from University of Florida — USA. In their approach, Mahamad and Nikos proposed dual-channel 1-dimension (1D) Convolution Neural Network (CNN) architecture that is able to leverage both the spectral and geocovariates space separately, which later feed a final Artificial Neural Network (ANN) learning algorithm for the prediction of Soil Organic Carbon. This framework resulted in a final RMSE of 0.255 in the log-SOC scale, and is currently being transformed into a publication with the code to be produced jointly by the OSSL team and the winners.
- David Schurman, Co-Founder and CTO at Perennial (a private digital soil carbon mapping startup) — USA. In his approach, David performs a first-derivative preprocessing on spectra, and fits clustered eXtreme Gradient Boosting Regressions. Training samples were clustered using their geocovariates and k-means algorithm. The rationale for this approach is to use the geocovatiates as contextual information for establishing which soil measurements might be most similar to each other. This resulted in a final RMSE of 0.259 in the log-SOC scale.
Based on the private standing (Figure 1), we can see that by combining geocovariates with the NIR spectra (implemented by both winners) we can get a reduction of ~25% in RMSE when compared to the current OSSL framework based solely on the spectral variations. Furthermore, the winners have implemented more sophisticated approaches like the dual-channel convolutional neural network (SAIL team) and model localization by k-means clustering (David Schurman).
Figure 1. Final standing (based on the private category) of the near-infrared (NIR) + geocovariates community competition for predicting soil organic carbon (SOC): 1st SAIL team and 2nd David Schurman. The submission from the OSSL modeling framework is included for a comparison basis.
- UTM Ecuador, composed by Jorge Párraga and Lizardo Reyna from Universidad Técnica de Manabí (UTM) – Ecuador. In their solution, the winners prepare a golden training subset using samples with precise summation of the texture components, applying a Savistiky-Golay smoothing filter, and Partial Least Square Regression as the learning algorithm. This resulted in a final RMSE of 8.325 (wt %).
- SoilAnalyzers, composed by Sharad Kumar Gupta from Helmholtz Centre for Environmental Research (UFZ) — Germany, and Ellur Rajath, University of Agricultural Sciences Bangalore – India. In their framework, the team preprocess the spectra with Savistsky-Golay smoothing and first-derivative filter, and calibrate a model with Gradient Boosting Regressor. This resulted in a final RMSE of 8.739 (wt %).
Final scores were calculated with the pooled instrument predictions (ALL category in Figure 2). In this competition, the UTM Ecuador had the best performing and consistent model across dissimilar instruments, while the other entries were still subject to issues with instrument #4 (including the current OSSL modeling framework).
The following FTIR MIR instruments composed the test set:
- INST1: Bruker Alpha.
- INST2: Perkin Elmer Spectrum 100.
- INST3: Bruker Invenio.
- INST4: Thermo Fisher Nicolet.
- INST5: Agilent 4300.
Figure 2. Final standing of the middle-infrared (MIR) community competition for predicting clay content across dissimilar instruments: 1st UTM Ecuador and 2nd SoilAnalysers. The submission from Henning Teickner and the OSSL modeling framework are included for a comparison basis.
The results from the MIR competition seem to highlight that relatively simple models can perform well with the right data screening and pre-processing.
All the winners will receive a PeerJ lifetime subscription, which allows up to 2 publications per year. We also greatly appreciate everyone that took the time to submit an entry and the release of reproducible code from other participants (Daori Han and Henning Teickner).
Most of the code implementations and prepared datasets are provided in the Data and Code panels of the Kaggle platform. You can also find the original input datasets in the OSSL Database and learn more about the Ring Trial in this study.
And, yes that is a Dall-E generated feature image when prompted with “soil spectroscopy data science competition”.