Derek Lowe’s blog on drug discovery and the pharma industry, "In the pipeline", features an article published by ACS Omega on "Machine Learning C–N Couplings: Obstacles for a General-Purpose Reaction Yield Prediction".
In the paper, the authors from pharmaceutical company La Roche deliberately take on a difficult synthetic challenge, predicting yields (and thus suggesting reaction conditions) for metal-catalyzed C-N couplings.
In his blog, Lowe, with his usual wittiness, amplifies the authors' conclusions on the importance of having "good" data to feed to the model. Notably, negative results are also as important as positive ones, but don't often reach publication. "We now realize that these negative results are gold, not garbage, when it comes to training machine-learning models."
Additionally, results have to be in a comparable format. "We are, in the end, probably going to have to turn the robots loose and replicate big swaths of the literature under controlled conditions. The machines are going to force us to get our house in order."