Source Dossier Tier 2 Unknown Active

Google AI Blog

blog.research.google • Ai • Visit source

Observations

Tracked in signal graph

Last Seen

Mar 29, 2024

Recent source activity

Perspective Context

No political-bias profile stored yet. Fulqrum still tracks provenance and evidentiary context.

Provenance & Access

Domain group

Source family

Unknown

Parser

Atom

Provider

Unknown

Provenance

Unknown

Evidence weight

Unknown

Access

Unknown

Geography

Global

Language scope

Unknown

Update cadence

Unknown

Recent Activity

First seen Jan 23, 2024

Generative AI to quantify uncertainty in weather forecasting

<span class="byline-author">Posted by Lizao (Larry) Li, Software Engineer, and Rob Carver, Research Scientist, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglI5U51vvhkA4cAuVvMLn0TbbL5pdlFL-LO1sNnqLyUieA6A88I5HrhJlszxR1GKQqSK5wsdlATDKSy6EC1BsNF7tzS6oVlFLtau13mVFLk954nFu85HDMP3PrQboG4eXExEtUjEuDRFpcrMqE_F0ikSwXiWBECAfJiLbjr6h6523DROJkbC284xX35zC7/s1000/image3.gif" style="display: none;" /> <p> Accurate weather forecasts can have a direct impact on people’s lives, from helping make routine decisions, like what to pack for a day’s activities, to informing urgent actions, for example, protecting people in the face of hazardous weather conditions. The importance of accurate and timely weather forecasts will only increase as the climate changes. Recognizing this, we at Google have been investing in weather and climate research to help ensure that the forecasting technology of tomorrow can meet the demand for reliable weather information. Some of our recent innovations include <a href="https://blog.research.google/2023/11/metnet-3-state-of-art-neural-weather.html">MetNet-3</a>, Google's high-resolution forecasts up to 24-hours into the future, and <a href="https://deepmind.google/discover/blog/graphcast-ai-model-for-faster-and-more-accurate-global-weather-forecasting/">GraphCast</a>, a weather model that can predict weather up to 10 days ahead. </p> <a name='more'></a> <p> Weather is inherently stochastic. To quantify the uncertainty, traditional methods rely on physics-based simulation to generate an ensemble of forecasts. However, it is computationally costly to generate a large ensemble so that rare and extreme weather events can be discerned and characterized accurately. </p> <p> With that in mind, we are excited to announce our latest innovation designed to accelerate progress in weather forecasting, <a href="https://www.science.org/doi/10.1126/sciadv.adk4489">Scalable Ensemble Envelope Diffusion Sampler</a> (SEEDS), recently published in <em><a href="https://www.science.org/journal/sciadv">Science Advances</a></em>. SEEDS is a generative AI model that can efficiently generate ensembles of weather forecasts <em>at scale </em>at a small fraction of the cost of traditional physics-based forecasting models. This technology opens up novel opportunities for weather and climate science, and it represents one of the first applications to weather and climate forecasting of probabilistic diffusion models, a generative AI technology behind recent advances in media generation. </p> <br /> <h2>The need for probabilistic forecasts: the butterfly effect</h2> <p> In December 1972, at the <a href="https://www.aaas.org/">American Association for the Advancement of Science</a> meeting in Washington, D.C., MIT meteorology professor <a href="https://en.wikipedia.org/wiki/Edward_Norton_Lorenz">Ed Lorenz</a> gave a talk entitled, “Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?” which contributed to the term “<a href="https://en.wikipedia.org/wiki/Butterfly_effect">butterfly effect</a>”. He was building on his earlier, landmark 1963 paper where he examined the feasibility of “very-long-range weather prediction” and described how errors in initial conditions grow exponentially when integrated in time with numerical weather prediction models. This exponential error growth, known as chaos, results in a deterministic predictability limit that restricts the use of individual forecasts in decision making, because they do not quantify the inherent uncertainty of weather conditions. This is particularly problematic when forecasting extreme weather events, such as hurricanes, heatwaves, or floods. </p> <p> Recognizing the limitations of deterministic forecasts, weather agencies around the world issue <em>probabilistic forecasts</em>. Such forecasts are based on ensembles of deterministic forecasts, each of which is generated by including synthetic noise in the initial conditions and stochasticity in the physical processes. Leveraging the fast error growth rate in weather models, the forecasts in an ensemble are purposefully different: the initial uncertainties are tuned to generate runs that are as different as possible and the stochastic processes in the weather model introduce additional differences during the model run. The error growth is mitigated by averaging all the forecasts in the ensemble and the variability in the ensemble of forecasts quantifies the uncertainty of the weather conditions. </p> <p> While effective, generating these probabilistic forecasts is computationally costly. They require running highly complex numerical weather models on massive supercomputers multiple times. Consequently, many operational weather forecasts can only afford to generate ~10–50 ensemble members for each forecast cycle. This is a problem for users concerned with the likelihood of rare but high-impact weather events, which typically require much larger ensembles to assess beyond a few days. For instance, one would need a 10,000-member ensemble to forecast the likelihood of events with 1% probability of occurrence with a relative error less than 10%. Quantifying the probability of such extreme events could be useful, for example, for emergency management preparation or for energy traders. </p> <br /> <h2>SEEDS: AI-enabled advances</h2> <p> In the aforementioned <a href="https://www.science.org/doi/10.1126/sciadv.adk4489">paper</a>, we present the Scalable Ensemble Envelope Diffusion Sampler (SEEDS), a generative AI technology for weather forecast ensemble generation. SEEDS is based on <a href="https://blog.research.google/2021/07/high-fidelity-image-generation-using.html">denoising diffusion probabilistic</a> models, a state-of-the-art generative AI method pioneered in part by Google Research. </p> <p> SEEDS can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. The generated ensembles not only yield plausible real-weather–like forecasts but also match or exceed physics-based ensembles in skill metrics such as the <a href="https://www.jstor.org/stable/26201352">rank histogram</a>, the <a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">root-mean-squared error</a> (RMSE), and the <a href="https://www.tandfonline.com/doi/abs/10.1198/016214506000001437">continuous ranked probability score</a> (CRPS). In particular, the generated ensembles assign more accurate likelihoods to the tail of the forecast distribution, such as ±2σ and ±3σ weather events. Most importantly, the computational cost of the model is negligible when compared to the hours of computational time needed by supercomputers to make a forecast. It has a throughput of 256 ensemble members (at 2° resolution) per 3 minutes on Google Cloud TPUv3-32 instances and can easily scale to higher throughput by deploying more accelerators. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglI5U51vvhkA4cAuVvMLn0TbbL5pdlFL-LO1sNnqLyUieA6A88I5HrhJlszxR1GKQqSK5wsdlATDKSy6EC1BsNF7tzS6oVlFLtau13mVFLk954nFu85HDMP3PrQboG4eXExEtUjEuDRFpcrMqE_F0ikSwXiWBECAfJiLbjr6h6523DROJkbC284xX35zC7/s1000/image3.gif" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="1000" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglI5U51vvhkA4cAuVvMLn0TbbL5pdlFL-LO1sNnqLyUieA6A88I5HrhJlszxR1GKQqSK5wsdlATDKSy6EC1BsNF7tzS6oVlFLtau13mVFLk954nFu85HDMP3PrQboG4eXExEtUjEuDRFpcrMqE_F0ikSwXiWBECAfJiLbjr6h6523DROJkbC284xX35zC7/s16000/image3.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">SEEDS generates an order-of-magnitude more samples to in-fill distributions of weather patterns.</td></tr></tbody></table> <div style="line-height: 40%;"> <br /> </div> <h2>Generating plausible weather forecasts</h2> <p> Generative AI is known to generate very detailed images and videos. This property is especially useful for generating ensemble forecasts that are consistent with plausible weather patterns, which ultimately result in the most added value for downstream applications. As Lorenz points out, “The [weather forecast] maps which they produce should look like real weather maps." The figure below contrasts the forecasts from SEEDS to those from the operational U.S. weather prediction system (<a href="https://www.emc.ncep.noaa.gov/emc/pages/numerical_forecast_systems/gefs.php">Global Ensemble Forecast System</a>, GEFS) for a particular date during the <a href="https://en.wikipedia.org/wiki/2022_European_heatwaves">2022 European heat waves</a>. We also compare the results to the forecasts from a Gaussian model that predicts the univariate mean and standard deviation of each atmospheric field at each location, a common and computationally efficient but less sophisticated data-driven approach. This Gaussian model is meant to characterize the output of pointwise post-processing, which ignores correlations and treats each grid point as an independent random variable. In contrast, a real weather map would have detailed <em>correlational</em> structures. </p> <p> Because SEEDS directly models the joint distribution of the atmospheric state, it realistically captures both the spatial covariance and the correlation between mid-tropospheric geopotential and mean sea level pressure, both of which are closely related and are commonly used by weather forecasters for evaluation and verification of forecasts. Gradients in the mean sea level pressure are what drive winds at the surface, while gradients in mid-tropospheric geopotential create upper-level winds that move large-scale weather patterns. </p> <p> The generated samples from SEEDS shown in the figure below (frames Ca–Ch) display a geopotential trough west of Portugal with spatial structure similar to that found in the operational U.S. forecasts or the reanalysis based on observations. Although the Gaussian model predicts the marginal univariate distributions adequately, it fails to capture cross-field or spatial correlations. This hinders the assessment of the effects that these anomalies may have on hot air intrusions from North Africa, which can exacerbate heat waves over Europe. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQE94TGK404COMAKKxaPwUO9bD8gIzQfu6A0u5c-5xbGKhlUtBW_0KAj-Ur8kpgt5_f-IjAuFzeecpRbbWVujZNQVExTsl0UuDRtOb84Y8uFWc4G1UYYZos6gLVtIHQ3AZ7ojRqoMSmt8IHdTOSx365AaoNyUfNMi1ksC0Wh_axeD_THB6sOmnZZHhrvHQ/s1999/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1999" data-original-width="1675" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQE94TGK404COMAKKxaPwUO9bD8gIzQfu6A0u5c-5xbGKhlUtBW_0KAj-Ur8kpgt5_f-IjAuFzeecpRbbWVujZNQVExTsl0UuDRtOb84Y8uFWc4G1UYYZos6gLVtIHQ3AZ7ojRqoMSmt8IHdTOSx365AaoNyUfNMi1ksC0Wh_axeD_THB6sOmnZZHhrvHQ/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Stamp maps over Europe on 2022/07/14 at 0:00 UTC. The contours are for the mean sea level pressure (dashed lines mark isobars below 1010 hPa) while the heatmap depicts the geopotential height at the 500 hPa pressure level. (A) The&nbsp;<a href="https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5">ERA5</a>&nbsp;reanalysis, a proxy for real observations. (Ba-Bb) 2 members from the 7-day U.S. operational forecasts used as seeds to our model. (Ca-Ch) 8 samples drawn from SEEDS. (Da-Dh) 8 non-seeding members from the 7-day U.S. operational ensemble forecast. (Ea-Ed) 4 samples from a pointwise Gaussian model parameterized by the mean and variance of the entire U.S. operational ensemble.</td></tr></tbody></table> <div style="line-height: 40%;"> <br /> </div> <h2>Covering extreme events more accurately </h2> <p> Below we show the joint distributions of temperature at 2 meters and total column water vapor near Lisbon during the extreme heat event on 2022/07/14, at 1:00 local time. We used the 7-day forecasts issued on 2022/07/07. For each plot, we generate 16,384-member ensembles with SEEDS. The observed weather event from ERA5 is denoted by the star. The operational ensemble is also shown, with squares denoting the forecasts used to seed the generated ensembles, and triangles denoting the rest of ensemble members. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVbbmrrrJ5L1NVb_O7WPUD-d6ULlTJTSns6ZaqjxOqZ4YAi4zOiT72rfMBf8EGTe0kdofIrWAMESq1m2v9IBjnd_k6UAIDM7LvhbxdVr41FOQ0fqkKeERF_QqXbxs94qKLdMxR-A7Hbxkjd4zZn07AlldAsuvn7jsYCu-V3UVAatovY1ELbrcLQz5I1ppX/s1999/image1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="941" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVbbmrrrJ5L1NVb_O7WPUD-d6ULlTJTSns6ZaqjxOqZ4YAi4zOiT72rfMBf8EGTe0kdofIrWAMESq1m2v9IBjnd_k6UAIDM7LvhbxdVr41FOQ0fqkKeERF_QqXbxs94qKLdMxR-A7Hbxkjd4zZn07AlldAsuvn7jsYCu-V3UVAatovY1ELbrcLQz5I1ppX/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">SEEDS provides better statistical coverage of the 2022/07/14 European extreme heat event, denoted by the brown star . Each plot shows the values of the total column-integrated water vapor (TCVW) vs. temperature over a grid point near Lisbon, Portugal from 16,384 samples generated by our models, shown as green dots, conditioned on 2 seeds (blue squares) taken from the 7-day U.S. operational ensemble forecasts (denoted by the sparser brown triangles). The valid forecast time is 1:00 local time. The solid contour levels correspond to iso-proportions of the kernel density of SEEDS, with the outermost one encircling 95% of the mass and 11.875% between each level.</td></tr></tbody></table> <br /> <p> According to the U.S. operational ensemble, the observed event was so unlikely seven days prior that none of its 31 members predicted near-surface temperatures as warm as those observed. Indeed, the event probability computed from a Gaussian kernel density estimate is lower than 1%, which means that ensembles with less than 100 members are unlikely to contain forecasts as extreme as this event. In contrast, the SEEDS ensembles are able to extrapolate from the two seeding forecasts, providing an envelope of possible weather states with much better statistical coverage of the event. This allows both quantifying the probability of the event taking place and sampling weather regimes under which it would occur. Specifically, our highly scalable generative approach enables the creation of very large ensembles that can characterize very rare events by providing samples of weather states exceeding a given threshold for any user-defined diagnostic. </p> <br /> <h2>Conclusion and future outlook</h2> <p> SEEDS leverages the power of generative AI to produce ensemble forecasts comparable to those from the operational U.S. forecast system, but at an accelerated pace. The results reported in this paper need only 2 seeding forecasts from the operational system, which generates 31 forecasts in its current version. This leads to a hybrid forecasting system where a few weather trajectories computed with a physics-based model are used to seed a diffusion model that can generate additional forecasts much more efficiently. This methodology provides an alternative to the current operational weather forecasting paradigm, where the computational resources saved by the statistical emulator could be allocated to increasing the resolution of the physics-based model or issuing forecasts more frequently. </p> <p> We believe that SEEDS represents just one of the many ways that AI will accelerate progress in operational numerical weather prediction in coming years. We hope this demonstration of the utility of generative AI for weather forecast emulation and post-processing will spur its application in research areas such as climate risk assessment, where generating a large number of ensembles of climate projections is crucial to accurately quantifying the uncertainty about future climate. </p> <br /> <h2>Acknowledgements</h2> <p> <em>All SEEDS authors, Lizao Li, Rob Carver, Ignacio Lopez-Gomez, Fei Sha and John Anderson, co-authored this blog post, with Carla Bromberg as Program Lead. We also thank Tom Small who designed the animation. Our colleagues at Google Research have provided invaluable advice to the SEEDS work. Among them, we thank Leonardo Zepeda-Núñez, Zhong Yi Wan, Stephan Rasp, Stephan Hoyer, and Tapio Schneider for their inputs and useful discussion. We thank Tyler Russell for additional technical program management, as well as Alex Merose for data coordination and support. We also thank Cenk Gazen, Shreya Agrawal, and Jason Hickey for discussions in the early stage of the SEEDS work. </em> </p>

Mar 29, 2024

AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks

<span class="byline-author">Posted by Urs Köster, Software Engineer, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgd5Wc54p1HvgIokpazxDsMo1u6i9wg3ovpNOiFc4-wYwebETvjs9-hm2wxZ4osNbBAxhet8To3hwGg-whFScksHQB_BP1kS4Z8Cu7FQT2bjVtJl4trPid-OxCyYocwyRTN66tuvAedu9z0FepBg4zZvmLbLxY6uuib8p5jVH2kfb3RxT_HMABsKMXuSFXr/s320/AutoBNN.jpg" style="display: none;" /> <p> <a href="https://en.wikipedia.org/wiki/Time_series">Time series</a> problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian</a> approaches start with an assumption about the data's patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like <a href="https://gaussianprocess.org/gpml/">Gaussian processes</a> (GPs) and <a href="https://blog.tensorflow.org/2019/03/structural-time-series-modeling-in.html">Structural Time Series</a> are extensively used for modeling time series data, e.g., the commonly used <a href="https://gml.noaa.gov/ccgg/trends/">Mauna Loa CO2</a> dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don't produce reliable confidence intervals. </p> <a name='more'></a> <p> To that end, we introduce <a href="https://github.com/tensorflow/probability/tree/main/spinoffs/autobnn">AutoBNN</a>, a new open-source package written in <a href="https://github.com/google/jax">JAX</a>. AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks. </p> <div style="line-height: 40%;"> <br /> </div> <h2>AutoBNN</h2> <p> AutoBNN is based on a <a href="https://proceedings.mlr.press/v28/duvenaud13.html">line</a> <a href="https://royalsocietypublishing.org/doi/10.1098/rsta.2011.0550">of</a> <a href="https://proceedings.mlr.press/v202/saad23a.html">research</a> that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned <a href="https://www.cs.toronto.edu/~duvenaud/cookbook/">kernel</a> structures. The kernel function of a GP encodes assumptions about the function being modeled, such as the presence of trends, periodicity or noise. With learned GP kernels, the kernel function is defined compositionally: it is either a base kernel (such as <code>Linear</code>, <code>Quadratic</code>, <code>Periodic</code>, <code><a href="https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function">Matérn</a></code> or <code>ExponentiatedQuadratic</code>) or a composite that combines two or more kernel functions using operators such as <code>Addition</code>, <code>Multiplication</code>, or <code><a href="https://icml.cc/Conferences/2010/papers/170.pdf">ChangePoint</a></code>. This compositional kernel structure serves two related purposes. First, it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series. Second, techniques like <a href="https://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf">Sequential Monte Carlo</a> can be used for discrete searches over small structures and can output interpretable results.</p> <p> AutoBNN improves upon these ideas, replacing the GP with <a href="https://www.cs.toronto.edu/~duvenaud/distill_bayes_net/public/">Bayesian neural networks</a> (BNNs) while retaining the compositional kernel structure. A BNN is a neural network with a probability distribution over weights rather than a fixed set of weights. This induces a distribution over outputs, capturing uncertainty in the predictions. BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and <a href="https://cloud.google.com/tpu?hl=en">TPU</a> hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with <a href="https://arxiv.org/abs/2007.06823">traditional deep BNNs</a>, which have the ability to do feature discovery. One could imagine "hybrid" architectures, in which users specify a top-level structure of <code>Add</code>(<code>Linear</code>, <code>Periodic</code>, <code>Deep</code>), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information. </p> <p> How might one translate a GP with compositional kernels into a BNN then? A single layer neural network will typically converge to a GP as the number of neurons (or "width") <a href="https://link.springer.com/chapter/10.1007/978-1-4612-0745-0_2">goes to infinity</a>. More recently, researchers have <a href="https://openreview.net/forum?id=gRwh5HkdaTm">discovered</a> a correspondence in the other direction — many popular GP <a href="https://www.cs.toronto.edu/~duvenaud/cookbook/">kernels</a> (such as <code>Matern</code>, <code>ExponentiatedQuadratic</code>, <code>Polynomial</code> or <code>Periodic</code>) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite. For example, the figures below show the difference in the <a href="https://en.wikipedia.org/wiki/Covariance_matrix#:~:text=In%20probability%20theory%20and%20statistics,of%20a%20given%20random%20vector">covariance</a> between pairs of observations, and <a href="https://en.wikipedia.org/wiki/Kriging">regression</a> results of the true GPs and their corresponding width-10 neural network versions. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHJ7hHI33S76Id3RrWCYezQKky9oELeuWf_CTm7GYadxpV7-B9GSQKCZgTmVQABi9zpWcEK8uvTYITyX2_jcbv_qF-eGv2C1QkU9oDCAS09FfoCne81yEAqC5moTNIqsn05aHfWNr8uy48N3UfV_tRGOyGrrQvB8l7RegzAq5_LNK2W8_Y_gSavdfi5aDI/s1350/image3.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="598" data-original-width="1350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHJ7hHI33S76Id3RrWCYezQKky9oELeuWf_CTm7GYadxpV7-B9GSQKCZgTmVQABi9zpWcEK8uvTYITyX2_jcbv_qF-eGv2C1QkU9oDCAS09FfoCne81yEAqC5moTNIqsn05aHfWNr8uy48N3UfV_tRGOyGrrQvB8l7RegzAq5_LNK2W8_Y_gSavdfi5aDI/s16000/image3.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Comparison of <a href="https://en.wikipedia.org/wiki/Gram_matrix">Gram matrices</a> between true GP kernels (top row) and their width 10 neural network approximations (bottom row).</td></tr></tbody></table> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoidYqlAK2J1n4y71Qn-WuIcmaxGI9ynwSjtHAvyukuY_q5QcX4pVEheX2pwMxIhkAu7_OZR-0s7N7e-cU-caromj1wntP7E1txZfxHqh2yeTedusA90k9hFZ2yvzEZmC2QlPyR7trgVuMro-MoicBxpAbrkQXs2F9h1uux3AXzUENmJ0NA8Ch9dyICT15/s1328/image4.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="586" data-original-width="1328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoidYqlAK2J1n4y71Qn-WuIcmaxGI9ynwSjtHAvyukuY_q5QcX4pVEheX2pwMxIhkAu7_OZR-0s7N7e-cU-caromj1wntP7E1txZfxHqh2yeTedusA90k9hFZ2yvzEZmC2QlPyR7trgVuMro-MoicBxpAbrkQXs2F9h1uux3AXzUENmJ0NA8Ch9dyICT15/s16000/image4.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Comparison of regression results between true GP kernels (top row) and their width 10 neural network approximations (bottom row).</td></tr></tbody></table> <p> Finally, the translation is completed with <a href="https://arxiv.org/abs/1905.06076">BNN analogues</a> of the <code>Addition</code> and <code>Multiplication</code> operators over GPs, and input warping to produce periodic kernels. BNN addition is straightforwardly given by adding the outputs of the component BNNs. BNN multiplication is achieved by multiplying the activations of the hidden layers of the BNNs and then applying a shared dense layer. We are therefore limited to only multiplying BNNs with the same hidden width. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Using AutoBNN</h2> <p> The AutoBNN <a href="https://github.com/tensorflow/probability/tree/main/spinoffs/autobnn">package</a> is available within <a href="https://www.tensorflow.org/probability">Tensorflow Probability</a>. It is implemented in <a href="https://github.com/google/jax">JAX</a> and uses the <a href="https://github.com/google/flax">flax.linen</a> neural network library. It implements all of the base kernels and operators discussed so far (<code>Linear</code>, <code>Quadratic</code>, <code>Matern</code>, <code>ExponentiatedQuadratic</code>, <code>Periodic</code>, <code>Addition</code>, <code>Multiplication</code>) plus one new kernel and three new operators: </p> <ul> <li>a <code>OneLayer</code> kernel, a single hidden layer <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">ReLU</a> BNN, </li><li>a <code><a href="https://icml.cc/Conferences/2010/papers/170.pdf">ChangePoint</a></code> operator that allows smoothly switching between two kernels, </li><li>a <code>LearnableChangePoint</code> operator which is the same as <code>ChangePoint</code> except position and slope are given prior distributions and can be learnt from the data, and </li><li>a <code>WeightedSum</code> operator. </li> </ul> <p> <code>WeightedSum</code> combines two or more BNNs with learnable mixing weights, where the learnable weights follow a <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet prior</a>. By default, a flat Dirichlet distribution with concentration 1.0 is used. </p> <p> <code>WeightedSums</code> allow a "soft" version of structure discovery, i.e., training a linear combination of many possible models at once. In contrast to structure discovery with discrete structures, such as in <a href="https://proceedings.mlr.press/v202/saad23a.html">AutoGP</a>, this allows us to use standard gradient methods to learn structures, rather than using expensive discrete optimization. Instead of evaluating potential combinatorial structures in series, WeightedSum allows us to evaluate them in parallel. </p> <p> To easily enable exploration, AutoBNN defines a <a href="https://github.com/tensorflow/probability/blob/main/spinoffs/autobnn/autobnn/models.py">number of model structures</a> that contain either top-level or internal <code>WeightedSums</code>. The names of these models can be used as the first parameter in any of the <a href="https://github.com/tensorflow/probability/blob/main/spinoffs/autobnn/autobnn/estimators.py">estimator</a> constructors, and include things like <code><a href="https://github.com/tensorflow/probability/blob/main/spinoffs/autobnn/autobnn/models.py#L133">sum_of_stumps</a></code> (the <code>WeightedSum</code> over all the base kernels) and <code>sum_of_shallow</code> (which adds all possible combinations of base kernels with all operators).</p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNmWFuh7tVRkaF9o4nr3Fu7B2CNmXpDkGx8_9fMASh2olAfjlSdBXLj-0cgh7UIVWs6fHlNyyCvRPA_vc4eq-3lixkC2VXzCeSCZBFDHIc1qYfK53EwEdngf1KykzCfpPiIg3YoN46AZkBSSmCLrgPXX84PaZp_cxLrNnmojz2S6pLOCmTTT2niRi8Qfe5/s1389/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="255" data-original-width="1389" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNmWFuh7tVRkaF9o4nr3Fu7B2CNmXpDkGx8_9fMASh2olAfjlSdBXLj-0cgh7UIVWs6fHlNyyCvRPA_vc4eq-3lixkC2VXzCeSCZBFDHIc1qYfK53EwEdngf1KykzCfpPiIg3YoN46AZkBSSmCLrgPXX84PaZp_cxLrNnmojz2S6pLOCmTTT2niRi8Qfe5/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Illustration of the <code>sum_of_stumps</code> model. The bars in the top row show the amount by which each base kernel contributes, and the bottom row shows the function represented by the base kernel. The resulting weighted sum is shown on the right.</td></tr></tbody></table> <p> The figure below demonstrates the technique of structure discovery on the N374 (a time series of yearly financial data starting from 1949) from the <a href="https://forecasters.org/resources/time-series-data/m3-competition/">M3</a> dataset. The six base structures were <code>ExponentiatedQuadratic</code> (which is the same as the Radial Basis Function kernel, or <a href="https://en.wikipedia.org/wiki/Radial_basis_function_kernel">RBF</a> for short), <code>Matern</code>, <code>Linear</code>, <code>Quadratic</code>, <code>OneLayer</code> and <code>Periodic</code> kernels. The figure shows the MAP estimates of their weights over an ensemble of 32 particles. All of the high likelihood particles gave a large weight to the <code>Periodic</code> component, low weights to <code>Linear</code>, <code>Quadratic</code> and <code>OneLayer</code>, and a large weight to either <code>RBF</code> or <code>Matern</code>. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_5mU3VknB1oyCwNdCQj9kWTVV5J0BuylHB8W2LUK4sT6JpkOWdluZwh8_fKvRN5eSo2xBbQ0pRxDYa86IqML9H2-JZOmxxRJSm9ExG_PUr6U7iFl8nyp4lEaNpG3guYov3hPP3l9zifdu_iv_5aeP05OftccGqwJ7D0WAeMox_aWMGm3hN5nOkrj4BPxU/s868/image5.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="542" data-original-width="868" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_5mU3VknB1oyCwNdCQj9kWTVV5J0BuylHB8W2LUK4sT6JpkOWdluZwh8_fKvRN5eSo2xBbQ0pRxDYa86IqML9H2-JZOmxxRJSm9ExG_PUr6U7iFl8nyp4lEaNpG3guYov3hPP3l9zifdu_iv_5aeP05OftccGqwJ7D0WAeMox_aWMGm3hN5nOkrj4BPxU/s16000/image5.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Parallel coordinates plot of the <a href="https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php">MAP</a> estimates of the base kernel weights over 32 particles. The <code>sum_of_stumps</code> model was trained on the N374 series from the M3 dataset (insert in blue). Darker lines correspond to particles with higher likelihoods.</td></tr></tbody></table> <p> By using <code>WeightedSums</code> as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the <code>sum_of_products</code> model (illustrated in the figure below) which first creates a pairwise product of two <code>WeightedSums</code>, and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 2<sup>16</sup>, since there are 16 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9VhSV6af55mkKxUKzpJJrqQiAV6WUWJ8HY9Q-5qcPB_mr8_P0lvrcGGkEUNe_-UB6Ri5VgWFkdHvRwEe7snZucQtvzMR_548jt4h2lbTzfnp7ZUeYFDmas7LwKc_9UAzdLE4gr8g9pVVkMXy9GU8qMUzrKfd9tjDEc2C4Ub6aXDzjHf2FjCryg_pWu39E/s1754/AutoBNN%20illustration.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="640" data-original-width="1754" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9VhSV6af55mkKxUKzpJJrqQiAV6WUWJ8HY9Q-5qcPB_mr8_P0lvrcGGkEUNe_-UB6Ri5VgWFkdHvRwEe7snZucQtvzMR_548jt4h2lbTzfnp7ZUeYFDmas7LwKc_9UAzdLE4gr8g9pVVkMXy9GU8qMUzrKfd9tjDEc2C4Ub6aXDzjHf2FjCryg_pWu39E/s16000/AutoBNN%20illustration.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Illustration of the "sum_of_products" model. Each of the four WeightedSums have the same structure as the "sum_of_stumps" model.</td></tr></tbody></table> <p> We have found, however, that certain combinations of kernels (e.g., the product of <code>Periodic</code> and either the <code>Matern</code> or <code>ExponentiatedQuadratic</code>) lead to overfitting on many datasets. To prevent this, we have defined model classes like <code>sum_of_safe_shallow</code> that exclude such products when performing structure discovery with <code>WeightedSums</code>. </p> <p> For training, AutoBNN provides <code>AutoBnnMapEstimator</code> and <code>AutoBnnMCMCEstimator</code> to perform MAP and MCMC inference, respectively. Either estimator can be combined with any of the six <a href="https://github.com/tensorflow/probability/blob/main/spinoffs/autobnn/autobnn/likelihoods.py">likelihood functions</a>, including four based on normal distributions with different noise characteristics for continuous data and two based on the negative binomial distribution for count data. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVzVWT-e-lcT53h75r2QJpR7iH9FAgCkpQY_oBNq7o1YoO4TkJ2GVpXLYcyY3RjOfgaXRM2LRII_jK31PbxTQF29yH1cTJRdI-XkXmnZMR_imlFv0uOuIPni3nW_vb1ercfuJuKHbrbuIA4bVR5EuGTs5iUHRXs-4WaA9wFEX54RwOJQt0BGMGfkNW4kxn/s1076/image1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="280" data-original-width="1076" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVzVWT-e-lcT53h75r2QJpR7iH9FAgCkpQY_oBNq7o1YoO4TkJ2GVpXLYcyY3RjOfgaXRM2LRII_jK31PbxTQF29yH1cTJRdI-XkXmnZMR_imlFv0uOuIPni3nW_vb1ercfuJuKHbrbuIA4bVR5EuGTs5iUHRXs-4WaA9wFEX54RwOJQt0BGMGfkNW4kxn/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Result from running AutoBNN on the <a href="https://gml.noaa.gov/ccgg/trends/">Mauna Loa CO2</a> dataset in our example <a href="https://github.com/tensorflow/probability/blob/main/discussion/examples/Forecasting_With_AutoBNN.ipynb">colab</a>. The model captures the trend and seasonal component in the data. Extrapolating into the future, the mean prediction slightly underestimates the actual trend, while the 95% confidence interval gradually increases.</td></tr></tbody></table> <p> To fit a model like in the figure above, all it takes is the following 10 lines of code, using the <a href="https://scikit-learn.org/stable/">scikit-learn</a>–inspired estimator interface:</p> <pre class="prettyprint">import autobnn as ab model = ab.operators.Add( bnns=(ab.kernels.PeriodicBNN(width=50), ab.kernels.LinearBNN(width=50), ab.kernels.MaternBNN(width=50))) estimator = ab.estimators.AutoBnnMapEstimator( model, 'normal_likelihood_logistic_noise', jax.random.PRNGKey(42), periods=[12]) estimator.fit(my_training_data_xs, my_training_data_ys) low, mid, high = estimator.predict_quantiles(my_training_data_xs) </pre> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Conclusion</h2> <p> <a href="https://github.com/tensorflow/probability/tree/main/spinoffs/autobnn">AutoBNN</a> provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data. We invite the community to try the&nbsp;<a href="https://github.com/tensorflow/probability/blob/main/discussion/examples/Forecasting_With_AutoBNN.ipynb" target="_blank">colab</a>, and leverage this library to innovate and solve real-world challenges. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Acknowledgements</h2> <p> <em>AutoBNN was written by Colin Carroll, Thomas Colthurst, Urs Köster and Srinivas Vasudevan. We would like to thank Kevin Murphy, Brian Patton and Feras Saad for their advice and feedback.</em> </p><p></p>

Mar 28, 2024

Computer-aided diagnosis for lung cancer screening

<span class="byline-author">Posted by Atilla Kiraly, Software Engineer, and Rory Pilgrim, Product Manager, Google Research </span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFpuCd82OUmuS2oG2cVir_ZgeOyUpFndr-kCq8V4pDv6fzxeyViBJymfVt5FFUqgkM_X57msxNv84XBtaXs2FsD7R8_tNqtH6D8X_KiMtZRaJ37JphQsvM35_gIk-4Tn2eEYvrInjMLV5ouwhRJv3Oqb30Z71P546NszeURINBoJnlWnzgASn-6D9YFwZo/s320/PULMA%20hero.jpg" style="display: none;" /> <p> Lung cancer is the leading cause of cancer-related deaths globally with <a href="https://www.who.int/news-room/fact-sheets/detail/cancer#:~:text=The%20most%20common%20causes%20of,rectum%20(916%20000%20deaths)%3B">1.8 million deaths</a> reported in 2020. Late diagnosis dramatically reduces the chances of survival. <a href="https://www.cdc.gov/cancer/lung/basic_info/screening.htm">Lung cancer screening</a> via <a href="https://www.cancer.gov/about-cancer/diagnosis-staging/ct-scans-fact-sheet#:~:text=indicate%20real%20problems.-,Lung%20cancer,-Low%2Ddose%20CT">computed tomography</a> (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans. </p> <a name='more'></a> <p> The <a href="https://www.uspreventiveservicestaskforce.org/uspstf/recommendation/lung-cancer-screening">United States Preventive Services Task Force</a> recently expanded lung cancer screening recommendations by <a href="https://pubmed.ncbi.nlm.nih.gov/34636916/">roughly 80%</a>, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability. </p> <p> At Google we have previously developed <a href="https://blog.google/technology/health/lung-cancer-prediction/">machine learning (ML) models for lung cancer detection</a>, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential. </p> <p> To that end, in “<a href="https://pubs.rsna.org/doi/10.1148/ryai.230079">Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan</a>”, published in <em><a href="https://pubs.rsna.org/journal/ai">Radiology AI</a></em>, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (<a href="https://www.acr.org/-/media/ACR/Files/RADS/Lung-RADS/LungRADSAssessmentCategoriesv1-1.pdf">Lung-RADSs V1.1</a> and <a href="https://www.jscts.org/pdf/guideline/gls3rdfig_english130621.pdf">Sendai Score</a>) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have <a href="https://github.com/Google-Health/google-health/tree/master/ct_dicom">open-sourced code</a> to process CT images and generate images compatible with the <a href="https://en.wikipedia.org/wiki/Picture_archiving_and_communication_system">picture archiving and communication system</a> (PACS) used by radiologists. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Developing an interface to communicate model results</h2> <p> Integrating ML models into radiologist workflows involves understanding the nuances and goals of their tasks to meaningfully support them. In the case of lung cancer screening, hospitals follow various country-specific guidelines that are regularly updated. For example, in the US, Lung-RADs V1.1 assigns an <a href="https://www.acr.org/-/media/ACR/Files/RADS/Lung-RADS/LungRADSAssessmentCategoriesv1-1.pdf">alpha-numeric score</a> to indicate the lung cancer risk and follow-up recommendations<em>. </em>When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions. </p> <p> Our first step was to improve the <a href="https://blog.google/technology/health/lung-cancer-prediction/">previously developed ML models</a> through additional training data and architectural improvements, including <a href="https://research.google/pubs/attention-is-all-you-need/">self-attention</a>. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system. </p> <p> </p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiChGqKLOWQAzrIzk294q6i6XuUoR1ul0qoTAR8RHQw-bZT-ulyruug-HNY8f2em7ZgzHE1UP6yQbe4plM0gkmXu6KwcTmsNogbr6FjTGzSDrBEDFhVLQ4TdbxVp_bbB21gA_jR84-1r9ly-O5HXqOzuZERgJyjFSYtZty7h6J3UErWsP0-DoQ1pFZtyjiw/s857/image1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="436" data-original-width="857" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiChGqKLOWQAzrIzk294q6i6XuUoR1ul0qoTAR8RHQw-bZT-ulyruug-HNY8f2em7ZgzHE1UP6yQbe4plM0gkmXu6KwcTmsNogbr6FjTGzSDrBEDFhVLQ4TdbxVp_bbB21gA_jR84-1r9ly-O5HXqOzuZERgJyjFSYtZty7h6J3UErWsP0-DoQ1pFZtyjiw/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view.</td></tr></tbody></table> <p> The assistive lung cancer screening system comprises 13 models and has a high-level architecture similar to the end-to-end system used in <a href="https://blog.google/technology/health/lung-cancer-prediction/">prior work</a>. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a <a href="https://cloud.google.com/kubernetes-engine">Google Kubernetes Engine</a> (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in <a href="https://cloud.google.com/healthcare-api/docs/concepts/dicom">DICOM stores</a>. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlQLk7XcQtSX367ubw0D0TtTqZQg-H69p63qtVrGir3UfJcYUyys0n_Nks-YqURRklRWllhSKdH-FFjRvfkb9mGxEmL191sfpAclKD085x-u20FJS9BWJGULyLk0foVGKfq5T5F7_hx7Z4xHu1ZeHPLM63HUCaiCrkt8BThhiImts9epWqqCE2s0BLeoWU/s646/image4.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="394" data-original-width="646" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlQLk7XcQtSX367ubw0D0TtTqZQg-H69p63qtVrGir3UfJcYUyys0n_Nks-YqURRklRWllhSKdH-FFjRvfkb9mGxEmL191sfpAclKD085x-u20FJS9BWJGULyLk0foVGKfq5T5F7_hx7Z4xHu1ZeHPLM63HUCaiCrkt8BThhiImts9epWqqCE2s0BLeoWU/s16000/image4.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Reader studies </h2> <p> To evaluate the system’s utility in improving clinical performance, we conducted two reader studies (i.e., experiments designed to assess clinical performance comparing expert performance with and without the aid of a technology) with 12 radiologists using pre-existing, de-identified CT scans. We presented 627 challenging cases to 6 US-based and 6 Japan-based radiologists. In the experimental setup, readers were divided into two groups that read each case twice, with and without assistance from the model. Readers were asked to apply scoring guidelines they typically use in their clinical practice and report their overall suspicion of cancer for each case. We then compared the results of the reader’s responses to measure the impact of the model on their workflow and decisions. The score and suspicion level were judged against the actual cancer outcomes of the individuals to measure sensitivity, specificity, and <a href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=AUC%20stands%20for%20%22Area%20under,across%20all%20possible%20classification%20thresholds.">area under the ROC curve</a> (AUC) values. These were compared with and without assistance. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmiP7GWIMf_TKezxSK0sM8EOtfm2M3QoZtgvYfcjacMm2atdilirD93ftlu_QlyusIu_ocC6R0iHX1eXtHrU6g1yLUWnZ1Bq0FJ0nXEjTezptuSxGbpwDFIkQGeZrFPmwXV3IYvyzJYPCEhp4etRNzhGmHbbfQAwntOm4ZhQNpuXbei5sfN6MqsQXJctVH/s794/image3.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="297" data-original-width="794" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmiP7GWIMf_TKezxSK0sM8EOtfm2M3QoZtgvYfcjacMm2atdilirD93ftlu_QlyusIu_ocC6R0iHX1eXtHrU6g1yLUWnZ1Bq0FJ0nXEjTezptuSxGbpwDFIkQGeZrFPmwXV3IYvyzJYPCEhp4etRNzhGmHbbfQAwntOm4ZhQNpuXbei5sfN6MqsQXJctVH/s16000/image3.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (<strong>blue</strong>) and then with assistance (<strong>orange</strong>) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering.</td></tr></tbody></table> <p> The ability to conduct these studies using the same interface highlights its generalizability to completely different cancer scoring systems, and the generalization of the model and assistive capability to different patient populations. Our study results demonstrated that when radiologists used the system in their clinical evaluation, they had an increased ability to correctly identify lung images without actionable lung cancer findings (i.e., <em>specificity</em>) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as <a href="https://pubmed.ncbi.nlm.nih.gov/34636916/">more people become eligible for screening</a>. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDMKrqRR9njVuYSLV0Nzb7-MXdpyJTSofvvxFhyendGwnM9pddFyy48MVBWKsadYMUp1RGQBNL77vC0gCvjZ_fIsIQ8ZhGHZmy52srebu49xIL4wYkuvyftssXzvohoSoBKt9C2uwua6gz4ReO4LQvfMbhdrgtXvcYb3JruZAchta2n5MhU41pTpJLyMJI/s1999/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="824" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDMKrqRR9njVuYSLV0Nzb7-MXdpyJTSofvvxFhyendGwnM9pddFyy48MVBWKsadYMUp1RGQBNL77vC0gCvjZ_fIsIQ8ZhGHZmy52srebu49xIL4wYkuvyftssXzvohoSoBKt9C2uwua6gz4ReO4LQvfMbhdrgtXvcYb3JruZAchta2n5MhU41pTpJLyMJI/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same.</td></tr></tbody></table> <div style="line-height: 40%;"> <br /> </div> <h2>Translating this into real-world impact through partnership </h2> <p> The system results demonstrate the potential for fewer follow-up visits, reduced anxiety, as well lower overall costs for lung cancer screening. In an effort to translate this research into real-world clinical impact, we are working with: <a href="https://deephealth.com/">DeepHealth</a>, a leading AI-powered health informatics provider; and <a href="https://apolloradiologyintl.com/">Apollo Radiology International</a> a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by <a href="https://github.com/Google-Health/google-health/tree/master/ct_dicom">open sourcing code</a> used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Acknowledgements</h2> <p> <em>Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.</em> </p><p></p>

Mar 20, 2024

Using AI to expand global access to reliable flood forecasts

<span class="byline-author">Posted by Yossi Matias, VP Engineering &amp; Research, and Grey Nearing, Research Scientist, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgABDUlqCHMxNY-QfEftM_9yPy1z4jr1odB-_kSP79yjk6igtpPJNFIocQOKDRnZ3VLmqrI9tqX-dCHpcYtnSx96y9X9V9knp1CiAREvfgZX71D0XpWZNgPdZOI7aMW3POigHJ2rLeA1G1asaAPO3KIB3j0WzUr5C707I7p0L_itspYYEhYDhDTzd39tNUD/s320/Flood%20forecasting%20hero%20image.jpg" style="display: none;" /> <p> Floods are the <a href="https://openknowledge.worldbank.org/server/api/core/bitstreams/e218989e-8b3b-5f8c-944c-06e9812215aa/content">most common natural disaster</a>, and are responsible for roughly <a href="https://www.swissre.com/risk-knowledge/mitigating-climate-risk/floods.html">$50 billion</a> in annual financial damages worldwide. The <a href="https://library.wmo.int/records/item/57630-2021-state-of-climate-services-water?offset=1#:~:text=WMO%2DNo.,1278&amp;text=More%20than%202%20billion%20people,for%20the%20past%2020%20years.">rate of flood-related disasters has more than doubled</a> since the year 2000 partly <a href="https://www.nature.com/articles/s41598-020-70816-2">due to climate change</a>. Nearly <a href="https://openknowledge.worldbank.org/server/api/core/bitstreams/e218989e-8b3b-5f8c-944c-06e9812215aa/content">1.5 billion people</a>, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations <a href="https://elibrary.worldbank.org/doi/abs/10.1596/1813-9450-6058">can save thousands of lives per year</a>. </p> <a name='more'></a> <p> Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this <a href="https://blog.google/technology/ai/google-ai-global-flood-forecasting/">multi-year journey</a>, we advanced research over the years hand-in-hand with building a real-time operational flood forecasting system that <a href="https://blog.google/technology/ai/expanding-our-ml-based-flood-forecasting/">provides alerts</a> on Google Search, Maps, Android notifications and through the <a href="http://g.co/floodhub">Flood Hub</a>. However, in order to <a href="https://blog.google/outreach-initiatives/sustainability/flood-hub-ai-flood-forecasting-more-countries/">scale globally</a>, especially in places where accurate local data is not available, more research advances were required. </p> <p> In “<a href="https://www.nature.com/articles/s41586-024-07145-1">Global prediction of extreme floods in ungauged watersheds</a>”, published in <em><a href="https://www.nature.com/">Nature</a></em>, we demonstrate how machine learning (ML) technologies can significantly improve global-scale <a href="https://sites.research.google/floodforecasting/">flood forecasting</a> relative to the current state-of-the-art for countries where flood-related data is scarce. With these AI-based technologies we extended the reliability of currently-available global nowcasts, on average, from zero to five days, and improved forecasts across regions in Africa and Asia to be similar to what are currently available in Europe. The evaluation of the models was conducted in collaboration with the European Center for Medium Range Weather Forecasting (<a href="https://www.ecmwf.int/">ECMWF</a>). </p> <p> These technologies also enable <a href="http://g.co/floodhub">Flood Hub</a> to provide real-time river forecasts up to seven days in advance, <a href="https://blog.google/outreach-initiatives/sustainability/flood-hub-ai-flood-forecasting-more-countries/">covering</a> river reaches across over 80 countries. This information can be used by people, communities, governments and international organizations to take anticipatory action to help protect vulnerable populations. </p> <br /> <div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" frameborder="0" height="360" src="https://www.youtube.com/embed/ET04pDj-RvM?si=WJJXEtwJqtyMRuC_?rel=0&amp;" width="640" youtube-src-id="[ET04pDj-RvM]"></iframe></div> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Flood forecasting at Google </h2> <p> The ML models that power the FloodHub tool are the product of many years of research, conducted in collaboration with several partners, including academics, governments, international organizations, and NGOs. </p> <p> In 2018, we <a href="https://blog.google/products/search/helping-keep-people-safe-ai-enabled-flood-forecasting/">launched a pilot</a> early warning system in the Ganges-Brahmaputra river basin in India, with the <a href="https://arxiv.org/abs/1901.09583">hypothesis</a> that ML could help address the challenging problem of reliable flood forecasting at scale. The pilot was further <a href="https://blog.google/technology/ai/tracking-our-progress-on-flood-forecasting/">expanded</a> the following year <a href="https://ai.googleblog.com/2019/09/an-inside-look-at-flood-forecasting.html">via the combination</a> of an inundation model, real-time water level measurements, the creation of an elevation map and hydrologic modeling. </p> <p> In <a href="https://ai.googleblog.com/2019/03/a-summary-of-google-flood-forecasting.html">collaboration</a> with academics, and, in particular, with the <a href="https://www.jku.at/en/institute-for-machine-learning/">JKU Institute for Machine Learning</a> we explored ML-based hydrologic models, showing that <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">LSTM</a>-based models could <a href="https://hess.copernicus.org/articles/23/5089/2019/">produce more accurate simulations</a> than traditional conceptual and physics-based <a href="https://en.wikipedia.org/wiki/Hydrological_model">hydrology models</a>. This research led to <a href="https://blog.research.google/2020/09/the-technology-behind-our-recent.html">flood forecasting improvements</a> that enabled the <a href="https://blog.google/technology/ai/flood-forecasts-india-bangladesh/">expansion</a> of our forecasting coverage to include all of India and Bangladesh. We also worked with researchers at Yale University to test technological interventions that increase the <a href="https://egc.yale.edu/about/perspectives/pande-and-coauthors-using-technology-save-lives-during-indias-monsoon-season">reach and impact</a> of flood warnings. </p> <p> Our hydrological models predict river floods by processing publicly available weather data like precipitation and physical watershed information. Such models must be calibrated to long data records from <a href="https://en.wikipedia.org/wiki/Stream_gauge">streamflow gauging stations</a> in individual rivers. A low percentage of global river watersheds (basins) have streamflow gauges, which are expensive but necessary to supply relevant data, and it’s challenging for hydrological simulation and forecasting to provide <a href="https://www.tandfonline.com/doi/full/10.1080/02626667.2013.803183">predictions in basins</a> that lack this infrastructure. Lower <a href="https://www.pnas.org/doi/full/10.1073/pnas.1414439112">gross domestic product</a> (GDP) is correlated with increased <a href="https://www.pnas.org/doi/full/10.1073/pnas.1414439112">vulnerability to flood risks</a>, and there is an inverse correlation between national GDP and the amount of publicly available data in a country. ML helps to address this problem by allowing a <a href="https://www.pnas.org/doi/full/10.1073/pnas.1414439112">single model to be trained on all available river data</a> and to be applied to ungauged basins where <a href="https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020wr028091">no data are available</a>. In this way, models can be trained globally, and can make predictions for any river location. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxQUgMZAg0tVPN5LrxYbhpn3dukUCVogsWPgynrYNjFfbXpwK0RF79rYvK9kyehrha0F-vMLZR2eqBWdKCuGter6VoZrbCKnROTNn_hmOXBDxWmOFhFRvyg36ghO0B08fsQv7cqXdyngtfgCAgF5LhONs5VDzyvYjxzEYejVN3FxvzRs8w9Q5EeGJJTr3O/s1051/Streamflow%20data%20from%20the%20Global%20Runoff%20Data%20Center.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="788" data-original-width="1051" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxQUgMZAg0tVPN5LrxYbhpn3dukUCVogsWPgynrYNjFfbXpwK0RF79rYvK9kyehrha0F-vMLZR2eqBWdKCuGter6VoZrbCKnROTNn_hmOXBDxWmOFhFRvyg36ghO0B08fsQv7cqXdyngtfgCAgF5LhONs5VDzyvYjxzEYejVN3FxvzRs8w9Q5EeGJJTr3O/s16000/Streamflow%20data%20from%20the%20Global%20Runoff%20Data%20Center.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">There is an inverse (log-log) correlation between the amount of publicly available streamflow data in a country and national GDP. Streamflow data from the <a href="https://www.bafg.de/GRDC/EN/Home/homepage_node.html">Global Runoff Data Center</a>.</td></tr></tbody></table> <p> Our academic collaborations led to ML research that developed methods to <a href="https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020wr028091">estimate uncertainty in river forecasts</a> and showed how ML river forecast models <a href="https://hess.copernicus.org/articles/25/2685/2021/hess-25-2685-2021-relations.html">synthesize information from multiple data sources</a>. They demonstrated that these models can <a href="https://hess.copernicus.org/articles/26/3377/2022/hess-26-3377-2022.html">simulate extreme events reliably</a>, even when those events are not part of the training data. In an effort to <a href="https://blog.research.google/2023/04/directing-ml-toward-natural-hazard.html">contribute</a> to open science, in 2023 we open-sourced a community-driven dataset for large-sample hydrology in <em><a href="https://www.nature.com/articles/s41597-023-01975-w">Nature Scientific Data</a></em>. </p> <div style="line-height: 40%;"> <br /> </div> <h2>The river forecast model</h2> <p> Most hydrology models used by national and international agencies for flood forecasting and river modeling are state-space models, which depend only on daily inputs (e.g., precipitation, temperature, etc.) and the current state of the system (e.g., soil moisture, snowpack, etc.). LSTMs are a variant of state-space models and work by defining a neural network that represents a single time step, where input data (such as current weather conditions) are processed to produce updated state information and output values (streamflow) for that time step. LSTMs are applied sequentially to make time-series predictions, and in this sense, behave similarly to how scientists typically conceptualize hydrologic systems. Empirically, we have found that <a href="https://hess.copernicus.org/articles/23/5089/2019/">LSTMs perform well</a> on the task of river forecasting. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMfiw33NkHO8CQsYGWSZ91xhPx0iDONFLe8WZuRWDsoi8RRv7pHlF6M8eDLEWpO8lZECUfGi59_NsMXO8ASDZQ9xxrB87mupNTPpioKT0wRgSSc1FwYDmfCUWyooGGZmvMhZv0RDcWJVslQOPvRNOK_B6dXUGsnijSl-W-lICOIbALAwNC2PNEmqqXhv6g/s960/image1.gif" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="540" data-original-width="960" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMfiw33NkHO8CQsYGWSZ91xhPx0iDONFLe8WZuRWDsoi8RRv7pHlF6M8eDLEWpO8lZECUfGi59_NsMXO8ASDZQ9xxrB87mupNTPpioKT0wRgSSc1FwYDmfCUWyooGGZmvMhZv0RDcWJVslQOPvRNOK_B6dXUGsnijSl-W-lICOIbALAwNC2PNEmqqXhv6g/s16000/image1.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A diagram of the LSTM, which is a neural network that operates sequentially in time. An accessible primer can be found <a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a>.</td></tr></tbody></table> <p> Our river forecast model uses two LSTMs applied sequentially: (1) a “hindcast” LSTM ingests historical weather data (dynamic hindcast features) up to the present time (or rather, the issue time of a forecast), and (2) a “forecast” LSTM ingests states from the hindcast LSTM along with forecasted weather data (dynamic forecast features) to make future predictions. One year of historical weather data are input into the hindcast LSTM, and seven days of forecasted weather data are input into the forecast LSTM. Static features include geographical and geophysical characteristics of watersheds that are input into both the hindcast and forecast LSTMs and allow the model to learn different hydrological behaviors and responses in various types of watersheds. </p> <p> Output from the forecast LSTM is fed into a “head” layer that uses <a href="https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf">mixture density networks</a> to produce a probabilistic forecast (i.e., predicted parameters of a probability distribution over streamflow). Specifically, the model predicts the parameters of a mixture of heavy-tailed probability density functions, called <a href="https://en.wikipedia.org/wiki/Asymmetric_Laplace_distribution">asymmetric Laplacian distributions</a>, at each forecast time step. The result is a mixture density function, called a <a href="https://proceedings.neurips.cc/paper_files/paper/2019/file/d80126524c1e9641333502c664fc6ca1-Paper.pdf">Countable Mixture of Asymmetric Laplacians</a> (CMAL) distribution, which represents a probabilistic prediction of the volumetric flow rate in a particular river at a particular time. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVPR4LA0EbJyAesDg4HvrMdxgG_0wiyLqJveir2Ryy06qDNVshkM2-zHvMj_y1LEBXOSm7ajMx2qzYCLNQrQ3dm8TRicy_wkTVtM4Xio_mhQPsgaSiN3sm3J8BBNYNpxWQbSm_aTSMyRW9UyIEWAAT9secPekdYNzyKRrXwgm10-ksyeUzTFRydXnt_Wai/s960/LSTM-based%20river%20forecast%20model.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="960" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVPR4LA0EbJyAesDg4HvrMdxgG_0wiyLqJveir2Ryy06qDNVshkM2-zHvMj_y1LEBXOSm7ajMx2qzYCLNQrQ3dm8TRicy_wkTVtM4Xio_mhQPsgaSiN3sm3J8BBNYNpxWQbSm_aTSMyRW9UyIEWAAT9secPekdYNzyKRrXwgm10-ksyeUzTFRydXnt_Wai/s16000/LSTM-based%20river%20forecast%20model.jpeg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">LSTM-based river forecast model architecture. Two LSTMs are applied in sequence, one ingesting historical weather data and one ingesting forecasted weather data. The model outputs are the parameters of a probability distribution over streamflow at each forecasted timestep.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Input and training data</h2> <p> The model uses three types of publicly available data inputs, mostly from governmental sources: </p> <ol> <li><em>Static watershed attributes representing geographical and geophysical variables:</em> From the <a href="https://www.hydrosheds.org/hydroatlas">HydroATLAS project</a>, including data like long-term climate indexes (precipitation, temperature, snow fractions), land cover, and anthropogenic attributes (e.g., a nighttime lights index as a proxy for human development). </li><li><em>Historical meteorological time-series data</em>: Used to spin up the model for one year prior to the issue time of a forecast. The data comes from <a href="https://gpm.nasa.gov/data/imerg">NASA IMERG</a>, <a href="https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html">NOAA CPC Global Unified Gauge-Based Analysis of Daily Precipitation</a>, and the <a href="https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview">ECMWF ERA5-land reanalysis</a>. Variables include daily total precipitation, air temperature, solar and thermal radiation, snowfall, and surface pressure. </li><li><em>Forecasted meteorological time series over a seven-day forecast horizon</em>: Used as input for the forecast LSTM. These data are the same meteorological variables listed above, and come from the <a href="https://www.ecmwf.int/en/forecasts/datasets/set-i">ECMWF HRES atmospheric model</a>. </li> </ol> <p> Training data are daily streamflow values from the <a href="https://www.bafg.de/GRDC/EN/Home/homepage_node.html">Global Runoff Data Center</a> over the time period 1980 - 2023. A single streamflow forecast model is trained using data from 5,680 diverse watershed streamflow gauges (shown below) to improve <a href="https://eartharxiv.org/repository/view/6363/">accuracy</a>. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJZa8BMczHa_WiWNB1FJvPgEcw5O6U_IumoXBvI3gB_cIqrbte2SZKu_Msr1MudCVPv3YF6L3BweAC0hhMkET634isx6xzUswrYfDwp8oueoWJ7c3hf0os-RIsaNrdgAboc7HUly0rGtuBt6OVQ-MnY5P44DKOXSHKYl_T-gMz5z0ek8CHk0lIx45fnZYU/s1417/gauge_locations_map(1).jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="689" data-original-width="1417" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJZa8BMczHa_WiWNB1FJvPgEcw5O6U_IumoXBvI3gB_cIqrbte2SZKu_Msr1MudCVPv3YF6L3BweAC0hhMkET634isx6xzUswrYfDwp8oueoWJ7c3hf0os-RIsaNrdgAboc7HUly0rGtuBt6OVQ-MnY5P44DKOXSHKYl_T-gMz5z0ek8CHk0lIx45fnZYU/s16000/gauge_locations_map(1).jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Location of 5,680 streamflow gauges that supply training data for the river forecast model from the <a href="https://www.bafg.de/GRDC/EN/Home/homepage_node.html">Global Runoff Data Center</a>.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Improving on the current state-of-the-art</h2> <p> We compared our river forecast model with <a href="https://www.globalfloods.eu/">GloFAS version 4</a>, the current state-of-the-art global flood forecasting system. These experiments showed that ML can provide accurate warnings earlier and over larger and more impactful events. </p> <p> The figure below shows the distribution of <a href="https://en.wikipedia.org/wiki/F-score">F1 scores</a> when predicting different severity events at river locations around the world, with plus or minus 1 day accuracy. F1 scores are an average of precision and recall and event severity is measured by <a href="https://en.wikipedia.org/wiki/Return_period#:~:text=A%20return%20period%2C%20also%20known,river%20discharge%20flows%20to%20occur.">return period</a>. For example, a 2-year return period event is a volume of streamflow that is expected to be exceeded on average once every two years. Our model achieves reliability scores at up to 4-day or 5-day lead times that are similar to or better, on average, than the reliability of GloFAS nowcasts (0-day lead time). </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjwzwV6QYl4yIlWs1xdHz2HRiNi2I8WUaTGBVlVvA4guppIGpJ3RMj8ypE7chWz8sV5KJuS4dPe9PUd6TqWe46W8Yelga1Nq28Mts72zqJhLJXDgMjSa6VCHlb9ZH3eo8XETWSqj8lNraejCAezFpkGpfJrPIl4xMhRPHSdO1WX7bZmVSLDFMZOwMfarb5/s3908/Distributions%20of%20F1%20scores%20over%202-year%20.jpeg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1844" data-original-width="3908" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjwzwV6QYl4yIlWs1xdHz2HRiNi2I8WUaTGBVlVvA4guppIGpJ3RMj8ypE7chWz8sV5KJuS4dPe9PUd6TqWe46W8Yelga1Nq28Mts72zqJhLJXDgMjSa6VCHlb9ZH3eo8XETWSqj8lNraejCAezFpkGpfJrPIl4xMhRPHSdO1WX7bZmVSLDFMZOwMfarb5/s16000/Distributions%20of%20F1%20scores%20over%202-year%20.jpeg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Distributions of <a href="https://en.wikipedia.org/wiki/F-score">F1 scores</a> over 2-year return period events in 2,092 watersheds globally during the time period 2014-2023 from GloFAS (<strong>blue</strong>) and our model (<strong>orange</strong>) at different lead times. On average, our model is statistically as accurate as GloFAS nowcasts (0–day lead time) up to 5 days in advance over 2-year (shown) and 1-year, 5-year, and 10-year events (not shown).</td></tr></tbody></table> <p> Additionally (not shown), our model achieves accuracies over larger and rarer extreme events, with precision and recall scores over 5-year return period events that are similar to or better than GloFAS accuracies over 1-year return period events. See the <a href="https://www.nature.com/articles/s41586-024-07145-1">paper</a> for more information. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Looking into the future</h2> <p> The flood forecasting initiative is part of our <a href="https://blog.google/outreach-initiatives/sustainability/google-ai-climate-change-solutions/">Adaptation and Resilience efforts</a> and reflects Google's commitment&nbsp;<a href="https://research.google/teams/climate-and-sustainability/">to address climate change</a> while helping global communities become more resilient. We believe that AI and ML will continue to play a critical role in helping advance science and research towards climate action. </p> <p> We actively <a href="https://blog.google/outreach-initiatives/sustainability/4-flood-forecasting-collaboration-case-studies-show-how-ai-can-help-communities-in-need/">collaborate</a> with several international aid organizations (e.g., the Centre for Humanitarian Data and the Red Cross) to provide actionable flood forecasts. Additionally, in an ongoing collaboration with the <a href="https://wmo.int/">World Meteorological Organization</a> (WMO) to <a href="https://blog.google/outreach-initiatives/sustainability/early-warning-system-wmo-google/">support early warning systems</a> for climate hazards, we are conducting a study to help understand how AI can help address real-world challenges faced by national flood forecasting agencies. </p> <p> While the work presented here demonstrates a significant step forward in flood forecasting, future work is needed to further expand flood forecasting coverage to more locations globally and other types of flood-related events and disasters, including flash floods and urban floods. We are looking forward to continuing collaborations with our partners in the academic and expert communities, local governments and the industry to reach these goals. </p>

Mar 20, 2024

ScreenAI: A visual language model for UI and visually-situated language understanding

<span class="byline-author">Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoXlMR7pAKRRnyKZT8C40i6mPX0KKNGT6AFNvFOFIhZ7BD0rXaU3NS_aqISTGq9S_d0zozgcO0HR_v3R6Msm4uUDkaBFsFVx-miaDL6L0UhSz1Is8_L_iFjtvNE5OX9HX98t92b3r-rLQfJG1RrzVW354NdVUlIJVRLdQ_l4dFYa1773J-tJligdvh7QsX/s320/ScreenAI%20-%20hero.jpeg" style="display: none;" /> <p> Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge. </p> <a name='more'></a> <p> To that end, we introduce “<a href="https://arxiv.org/abs/2402.04615">ScreenAI: A Vision-Language Model for UI and Infographics Understanding</a>”. ScreenAI improves upon the <a href="https://arxiv.org/abs/2305.18565">PaLI architecture</a> with the flexible patching strategy from <a href="https://arxiv.org/abs/2210.03347">pix2struct</a>. We train ScreenAI on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information (i.e., type, location and description) on a screen. These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks (<a href="https://x-lance.github.io/WebSRC/">WebSRC</a> and <a href="https://github.com/aburns4/MoTIF">MoTIF</a>), and best-in-class performance on <a href="https://github.com/vis-nlp/ChartQA">Chart QA</a>, <a href="https://rrc.cvc.uab.es/?ch=17&amp;com=evaluation&amp;task=1">DocVQA</a>, and <a href="https://arxiv.org/abs/2104.12756">InfographicVQA</a> compared to models of similar size. We are also releasing three new datasets: <a href="https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#screen-annotation-dataset-details">Screen Annotation</a> to evaluate the layout understanding capability of the model, as well as <a href="https://github.com/google-research-datasets/screen_qa/tree/main?tab=readme-ov-file#short_answers-directory">ScreenQA Short</a> and <a href="https://github.com/google-research-datasets/screen_qa?tab=readme-ov-file#complexqa" target="_blank">Complex ScreenQA</a> for a more comprehensive evaluation of its QA capability. </p> <div style="line-height: 40%;"> <br /> </div> <h2>ScreenAI</h2> <p> ScreenAI’s architecture is based on <a href="https://arxiv.org/abs/2209.06794">PaLI</a>, composed of a multimodal encoder block and an autoregressive decoder. The PaLI encoder uses a <a href="https://arxiv.org/abs/2010.11929">vision transformer</a> (ViT) that creates image embeddings and a multimodal encoder that takes the concatenation of the image and text embeddings as input. This flexible architecture allows ScreenAI to solve vision tasks that can be recast as text+image-to-text problems. </p> <p> On top of the PaLI architecture, we employ a flexible patching strategy introduced in pix2struct. Instead of using a fixed-grid pattern, the grid dimensions are selected such that they preserve the native aspect ratio of the input image. This enables ScreenAI to work well across images of various aspect ratios. </p> <p> The ScreenAI model is trained in two stages: a pre-training stage followed by a fine-tuning stage. First, self-supervised learning is applied to automatically generate data labels, which are then used to train ViT and the language model. ViT is frozen during the fine-tuning stage, where most data used is manually labeled by human raters. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS1qatfLUw6BZZgkPxrv0Hx1pAPAehiF8q3kfA0BUyyPx4XXpwZRr75nYl99fTIQwLNmOHXhSBbpzHDnw6yQXZls1ZV-IE-d75jP5M02cRSZTYuU8FJBS4mubPzUPIuvcj_oqkEJcWtNWtnLmPZ3P1jJlDmc8GA1WNq00jUwl2o8gfLIIXlknrjy4z6y7Y/s1600/image6.gif" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="583" data-original-width="1600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS1qatfLUw6BZZgkPxrv0Hx1pAPAehiF8q3kfA0BUyyPx4XXpwZRr75nYl99fTIQwLNmOHXhSBbpzHDnw6yQXZls1ZV-IE-d75jP5M02cRSZTYuU8FJBS4mubPzUPIuvcj_oqkEJcWtNWtnLmPZ3P1jJlDmc8GA1WNq00jUwl2o8gfLIIXlknrjy4z6y7Y/s16000/image6.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">ScreenAI model architecture.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Data generation</h2> <p> To create a pre-training dataset for ScreenAI, we first compile an extensive collection of screenshots from various devices, including desktops, mobile, and tablets. This is achieved by using <a href="https://arxiv.org/abs/1910.10683" target="_blank">publicly accessible web pages</a> and following the programmatic exploration approach used for the <a href="https://dl.acm.org/doi/10.1145/3126594.3126651" target="_blank">RICO dataset</a> for mobile apps. We then apply a layout annotator, based on the <a href="https://arxiv.org/abs/2005.12872" target="_blank">DETR</a> model, that identifies and labels a wide range of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms undergo further analysis using an <a href="https://arxiv.org/abs/2210.02663" target="_blank">icon classifier</a> capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle information conveyed through icons. For icons that are not covered by the classifier, and for infographics and images, we use the PaLI image captioning model to generate descriptive captions that provide contextual information. We also apply an <a href="https://cloud.google.com/use-cases/ocr" target="_blank">optical character recognition</a> (OCR) engine to extract and annotate textual content on screen. We combine the OCR text with the previous annotations to create a detailed description of each screen. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_wzxsb1U_PH17m3dG92ny7PpJjIYK39k1NQme1i5GM63tAd_OGdxMAV2_OQQVQSdkdyY1Tb3s8ibI2M3Kp1VpdNMsBr0ugBcBdL_r6dUwOwdfJfBMn3ae9Zl3zM2IpfZV654DFybMhMLimy0cuUNsnU5L8O2byu9eHmhdWcIvsb1t8AWi-tKNkXFq7Neo/s1747/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1055" data-original-width="1747" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_wzxsb1U_PH17m3dG92ny7PpJjIYK39k1NQme1i5GM63tAd_OGdxMAV2_OQQVQSdkdyY1Tb3s8ibI2M3Kp1VpdNMsBr0ugBcBdL_r6dUwOwdfJfBMn3ae9Zl3zM2IpfZV654DFybMhMLimy0cuUNsnU5L8O2byu9eHmhdWcIvsb1t8AWi-tKNkXFq7Neo/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A mobile app screenshot with generated annotations that include UI elements and their descriptions, e.g., <code>TEXT</code> elements also contain the text content from OCR, <code>IMAGE</code> elements contain image captions, <code>LIST_ITEMs</code> contain all their child elements.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h3>LLM-based data generation</h3> <p> We enhance the pre-training data's diversity using <a href="https://blog.google/technology/ai/google-palm-2-ai-large-language-model/">PaLM 2</a> to generate input-output pairs in a two-step process. First, screen annotations are generated using the technique outlined above, then we craft a prompt around this schema for the LLM to create synthetic data. This process requires prompt engineering and iterative refinement to find an effective prompt. We assess the generated data's quality through human validation against a quality threshold. </p> <br /> <pre class="prettyprint" style="margin-left: 40px; margin-right: 40px; white-space: pre-wrap;"><font color="#008000">You only speak JSON. Do not write text that isn’t JSON. You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows: questions: [ {{question: the question, answer: the answer }}, ... ] {THE SCREEN SCHEMA} </font></pre> <br /> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td class="tr-caption" style="text-align: center;">A sample prompt for QA data generation.</td></tr></tbody></table> <p> By combining the natural language capabilities of LLMs with a structured schema, we simulate a wide range of user interactions and scenarios to generate synthetic, realistic tasks. In particular, we generate three categories of tasks: </p> <ul> <li><strong>Question answering</strong>: The model is asked to answer questions regarding the content of the screenshots, e.g., “When does the restaurant open?” </li><li><strong>Screen navigation</strong>: The model is asked to convert a natural language utterance into an executable action on a screen, e.g., “Click the search button.” </li><li><strong>Screen summarization</strong>: The model is asked to summarize the screen content in one or two sentences. </li> </ul> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiinxXWrVJQr3tZJ4-o3ipkdJriUqTRbi2CFWor4I2SpyMiswx6uZOM2ZJW0gZC75MXYshkjXPABvDuSnhR44ceNwDpkvaSLa4R3v4C-hEsnHdEc-JUUx31zZmDHDDwhWaMDqnD0wo6ibt7qBZfaYN_yx1myH77k-ruO9fjd33SiLnP0jLnjOfmhdEHbsR7/s1398/image3.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1398" data-original-width="1272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiinxXWrVJQr3tZJ4-o3ipkdJriUqTRbi2CFWor4I2SpyMiswx6uZOM2ZJW0gZC75MXYshkjXPABvDuSnhR44ceNwDpkvaSLa4R3v4C-hEsnHdEc-JUUx31zZmDHDDwhWaMDqnD0wo6ibt7qBZfaYN_yx1myH77k-ruO9fjd33SiLnP0jLnjOfmhdEHbsR7/s16000/image3.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Block diagram of our workflow for generating data for QA, summarization and navigation tasks using existing ScreenAI models and LLMs. Each task uses a custom prompt to emphasize desired aspects, like questions related to counting, involving reasoning, etc.</td></tr></tbody></table> <br /> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><img height="540" src="https://lh7-us.googleusercontent.com/LmUtXBMXK-zy_rMShHQ_Hk4vQeXu2Kpx8zfzjhE3uAREczbkbGTEjZ7OMTbqtB37lD4rF31xJsoWdVXNAXLbbM1Uc_01WZWmOfBg9RwyAUEToPpa1W38Pt117Zj5LrNfnxXqjXoAJDZd-zcAIgU4QSoBaAKsIrSi8_POI14F5hguN1NJL9a2RsrKg6WHz7w" style="margin-left: auto; margin-right: auto; margin-top: 0px;" width="705" /></td></tr><tr><td class="tr-caption" style="text-align: center;">LLM-generated data. Examples for screen QA, navigation and summarization. For navigation, the action bounding box is displayed in red on the screenshot.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Experiments and results</h2> <p> As previously mentioned, ScreenAI is trained in two stages: pre-training and fine-tuning. Pre-training data labels are obtained using self-supervised learning and fine-tuning data labels comes from human raters. </p> <p> We fine-tune ScreenAI using public QA, summarization, and navigation datasets and a variety of tasks related to UIs. For QA, we use well established benchmarks in the multimodal and document understanding field, such as <a href="https://github.com/vis-nlp/ChartQA">ChartQA</a>, <a href="https://rrc.cvc.uab.es/?ch=17&amp;com=evaluation&amp;task=1">DocVQA</a>, <a href="https://rrc.cvc.uab.es/?ch=17&amp;com=tasks">Multi page DocVQA</a>, <a href="https://arxiv.org/abs/2104.12756">InfographicVQA</a>, <a href="https://ocr-vqa.github.io/">OCR VQA</a>, <a href="https://x-lance.github.io/WebSRC/">Web SRC</a> and <a href="https://github.com/google-research-datasets/screen_qa">ScreenQA</a>. For navigation, datasets used include <a href="https://github.com/google-research-datasets/uibert/tree/main">Referring Expressions</a>, <a href="https://github.com/aburns4/MoTIF">MoTIF</a>, <a href="https://arxiv.org/abs/2209.15099">Mug</a>, and <a href="https://github.com/google-research/google-research/tree/master/android_in_the_wild">Android in the Wild</a>. Finally, we use <a href="https://github.com/google-research-datasets/screen2words">Screen2Words</a> for screen summarization and <a href="https://paperswithcode.com/paper/widget-captioning-generating-natural-language/review/">Widget Captioning</a> for describing specific UI elements. Along with the fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using three novel benchmarks: </p> <ol> <li>Screen Annotation: Enables the evaluation model layout annotations and spatial understanding capabilities. </li><li>ScreenQA Short: A variation of ScreenQA, where its ground truth answers have been shortened to contain only the relevant information that better aligns with other QA tasks. </li><li>Complex ScreenQA: Complements ScreenQA Short with more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. </li> </ol> <p> The fine-tuned ScreenAI model achieves state-of-the-art results on various UI and infographic-based tasks (<a href="https://x-lance.github.io/WebSRC/">WebSRC</a> and <a href="https://github.com/aburns4/MoTIF">MoTIF</a>) and best-in-class performance on <a href="https://github.com/vis-nlp/ChartQA">Chart QA</a>, <a href="https://rrc.cvc.uab.es/?ch=17&amp;com=evaluation&amp;task=1">DocVQA</a>, and <a href="https://arxiv.org/abs/2104.12756">InfographicVQA</a> compared to models of similar size. ScreenAI achieves competitive performance on Screen2Words and OCR-VQA. Additionally, we report results on the new benchmark datasets introduced to serve as a baseline for further research. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijJAw824LdVbrFU3c7oerx9Ik86dWnuQ2NqliLpUZLp6U-9pDxZKsw10VSMfYOSwns-GWJRdSCj3UmyxytOZxfoM64psBSKCjLYa-3zkXDt8mGvFbNpydwS1Ya2dhDeYfihWL1mVCyTWIzdgfblxawoxukWW1vLLwfNWMNKQ64B8wUM5SlNKgegdGxXlr7/s1183/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1137" data-original-width="1183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijJAw824LdVbrFU3c7oerx9Ik86dWnuQ2NqliLpUZLp6U-9pDxZKsw10VSMfYOSwns-GWJRdSCj3UmyxytOZxfoM64psBSKCjLYa-3zkXDt8mGvFbNpydwS1Ya2dhDeYfihWL1mVCyTWIzdgfblxawoxukWW1vLLwfNWMNKQ64B8wUM5SlNKgegdGxXlr7/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Comparing model performance of ScreenAI with state-of-the-art (SOTA) models of similar size.</td></tr></tbody></table> <p> Next, we examine ScreenAI’s scaling capabilities and observe that across all tasks, increasing the model size improves performances and the improvements have not saturated at the largest size. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKNMvTyz1RhM0wqgn7eAGB9Lev3YUhKhHrcAmJt3SB1Gi6ozIaxHoPzAj-bm6II-_91viG2FXrfNZiiwSSI_YNQGwKGyO6YkAW05Cfl9oys869f7DMyJcthlj6c0CLwzMAGP8HM9AmxdCK92d4PL2Ujz-tI4CZsQOlzlecMLgElWBjl9FZtj-zWIWata2k/s1999/image1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="523" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKNMvTyz1RhM0wqgn7eAGB9Lev3YUhKhHrcAmJt3SB1Gi6ozIaxHoPzAj-bm6II-_91viG2FXrfNZiiwSSI_YNQGwKGyO6YkAW05Cfl9oys869f7DMyJcthlj6c0CLwzMAGP8HM9AmxdCK92d4PL2Ujz-tI4CZsQOlzlecMLgElWBjl9FZtj-zWIWata2k/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Model performance increases with size, and the performance has not saturated even at the largest size of 5B params.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Conclusion</h2> <p> We introduce the ScreenAI model along with a unified representation that enables us to develop self-supervised learning tasks leveraging data from all these domains. We also illustrate the impact of data generation using LLMs and investigate improving model performance on specific aspects with modifying the training mixture. We apply all of these techniques to build multi-task trained models that perform competitively with state-of-the-art approaches on a number of public benchmarks. However, we also note that our approach still lags behind large models and further research is needed to bridge this gap. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Acknowledgements</h2> <p> <em>This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, along with Rahul Aralikatte, Hao Cheng and Daniel Kim for their support in data preparation. We also thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, vision and support. We are very grateful toTom Small for helping us create the animation in this post.</em> </p>

Mar 19, 2024

SCIN: A new resource for representative dermatology images

<span class="byline-author">Posted by Pooja Rao, Research Scientist, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_fSTMFxLAMHLJ0rw7OAddGSPMW2tRl8kmTr2mWiiJunKxB8ZflMJeWkBmB5IqCD2LvRoikpN7OYnZO3CdKpArGn32b4o-T8ZD6XCPxmUBtE1-sPBi6J05y5_UrfbWSMTjNpldKYzM3xjXoC0iWU7q_a7Ktfi2S1hVHLY8uq1986yp_pgEjQn3elNuSUbJ/s1600/SCINHero.png" style="display: none;" /> <p> Health datasets play a crucial role in research and medical education, but it can be challenging to create a dataset that represents the real world. For example, dermatology conditions are diverse in their appearance and severity and manifest differently across skin tones. Yet, existing dermatology image datasets often lack representation of everyday conditions (like rashes, allergies and infections) and skew towards lighter skin tones. Furthermore, race and ethnicity information is frequently missing, hindering our ability to assess disparities or create solutions. </p> <a name='more'></a> <p> To address these limitations, we are releasing the <a href="https://github.com/google-research-datasets/scin">Skin Condition Image Network (SCIN) dataset</a> in collaboration with physicians at <a href="https://med.stanford.edu/">Stanford Medicine</a>. We designed SCIN to reflect the broad range of concerns that people search for online, supplementing the types of conditions typically found in clinical datasets. It contains images across various skin tones and body parts, helping to ensure that future AI tools work effectively for all. We've made <a href="https://github.com/google-research-datasets/scin">the SCIN dataset</a> freely available as an open-access resource for researchers, educators, and developers, and have taken careful steps to protect contributor privacy. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-lvUDxsY1bC8xXeRFKGtdyRiCk25knKK3tKzW2dCVtfvzFMUYvM7laqOBS0yP6Dnur5Fd945gbC96OMoiJ2nvguO6uguDArYkvnLUz5glvPlNpI1THL_bctcQCGlR670V4szxkHlcdvAJbP7T8HS7U3ASnHh_sWhSxoKJSsLN-1IPUpysj5ErdHaduz5r/s1327/image1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1118" data-original-width="1327" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-lvUDxsY1bC8xXeRFKGtdyRiCk25knKK3tKzW2dCVtfvzFMUYvM7laqOBS0yP6Dnur5Fd945gbC96OMoiJ2nvguO6uguDArYkvnLUz5glvPlNpI1THL_bctcQCGlR670V4szxkHlcdvAJbP7T8HS7U3ASnHh_sWhSxoKJSsLN-1IPUpysj5ErdHaduz5r/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Example set of images and metadata from the SCIN dataset.</td></tr></tbody></table> <div style="line-height:40%;"> <br> </div> <h2>Dataset composition</h2> <p> The SCIN dataset currently contains over 10,000 images of skin, nail, or hair conditions, directly contributed by individuals experiencing them. All contributions were made voluntarily with informed consent by individuals in the US, under an institutional-review board approved study. To provide context for retrospective dermatologist labeling, contributors were asked to take images both close-up and from slightly further away. They were given the option to self-report demographic information and <a href="https://en.wikipedia.org/wiki/Fitzpatrick_scale">tanning propensity</a> (self-reported Fitzpatrick Skin Type, i.e., sFST), and to describe the texture, duration and symptoms related to their concern. </p> <p> One to three dermatologists labeled each contribution with up to five dermatology conditions, along with a confidence score for each label. The SCIN dataset contains these individual labels, as well as an aggregated and weighted differential diagnosis derived from them that could be useful for model testing or training. These labels were assigned retrospectively and are not equivalent to a clinical diagnosis, but they allow us to compare the distribution of dermatology conditions in the SCIN dataset with existing datasets. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7oYE7nKEvgBaW6SEHfGFzCrhnKqX5w86_7ujHMbpMENOByxcUTgAzXJrZCgv6kbDVmTN8NmKSBBSvF4XkWKcKf5DT_b3A5D50ZpAr-93i3a69KUFOZy54diZxH_wcf1PeKdFlRbEe_OZODxS0N4ZrHSaiki8ZslUfFUatw4w-0p0zzD4GRwlqgmPLR6gw/s1851/image2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="775" data-original-width="1851" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7oYE7nKEvgBaW6SEHfGFzCrhnKqX5w86_7ujHMbpMENOByxcUTgAzXJrZCgv6kbDVmTN8NmKSBBSvF4XkWKcKf5DT_b3A5D50ZpAr-93i3a69KUFOZy54diZxH_wcf1PeKdFlRbEe_OZODxS0N4ZrHSaiki8ZslUfFUatw4w-0p0zzD4GRwlqgmPLR6gw/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The SCIN dataset contains largely allergic, inflammatory and infectious conditions while datasets from clinical sources focus on benign and malignant <a href="https://en.wikipedia.org/wiki/Neoplasm">neoplasms</a>.</td></tr></tbody></table> <p> While many existing dermatology datasets focus on malignant and benign tumors and are intended to assist with skin cancer diagnosis, the SCIN dataset consists largely of common allergic, inflammatory, and infectious conditions. The majority of images in the SCIN dataset show early-stage concerns — more than half arose less than a week before the photo, and 30% arose less than a day before the image was taken. Conditions within this time window are seldom seen within the health system and therefore are underrepresented in existing dermatology datasets. </p> <p> We also obtained dermatologist estimates of Fitzpatrick Skin Type (estimated FST or eFST) and layperson labeler estimates of <a href="https://en.wikipedia.org/wiki/Monk_Skin_Tone_Scale">Monk Skin Tone</a> (eMST) for the images. This allowed comparison of the skin condition and skin type distributions to those in existing dermatology datasets. Although we did not selectively target any skin types or skin tones, the SCIN dataset has a balanced Fitzpatrick skin type distribution (with more of Types 3, 4, 5, and 6) compared to similar datasets from clinical sources. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNnhVt5yEHKdMsi-tMYH9Q9oITruBrlrrfaQk8oopWHBr1qq6lfPrZnLrav-y2w7i9vgptlNDw_xKX3J8W0fZ1NfU-cOeINXc6bgf2vHJL3bc-UCWA7T846QQHkTvob6QbB3sR0HbwI9Vms3oXtAZ_zbrd4w_eAKLTo5-obYoG3A2urPmiF7RS5GcgVRhH/s1851/image3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="620" data-original-width="1851" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNnhVt5yEHKdMsi-tMYH9Q9oITruBrlrrfaQk8oopWHBr1qq6lfPrZnLrav-y2w7i9vgptlNDw_xKX3J8W0fZ1NfU-cOeINXc6bgf2vHJL3bc-UCWA7T846QQHkTvob6QbB3sR0HbwI9Vms3oXtAZ_zbrd4w_eAKLTo5-obYoG3A2urPmiF7RS5GcgVRhH/s16000/image3.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Self-reported and dermatologist-estimated Fitzpatrick Skin Type distribution in the SCIN dataset compared with existing un-enriched dermatology datasets <a href="https://github.com/mattgroh/fitzpatrick17k">(Fitzpatrick17k</a>, <a href="https://www.fc.up.pt/addi/ph2%20database.html">PH²</a>, <a href="https://www.it.pt/AutomaticPage?id=3459">SKINL2</a>, and<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7479321/"> PAD-UFES-20</a>).</td></tr></tbody></table> <p> The <a href="https://en.wikipedia.org/wiki/Fitzpatrick_scale">Fitzpatrick Skin Type</a> scale was originally developed as a photo-typing scale to measure the response of skin types to UV radiation, and it is widely used in dermatology research. The Monk Skin Tone scale is a newer 10-shade scale that measures skin tone rather than skin phototype, capturing more nuanced differences between the darker skin tones. While neither scale was intended for retrospective estimation using images, the inclusion of these labels is intended to enable future research into skin type and tone representation in dermatology. For example, the SCIN dataset provides an initial benchmark for the distribution of these skin types and tones in the US population. </p> <p> The SCIN dataset has a high representation of women and younger individuals, likely reflecting a combination of factors. These could include differences in skin condition incidence, propensity to seek health information online, and variations in willingness to contribute to research across demographics. </p> <div style="line-height:40%;"> <br> </div> <h2>Crowdsourcing method</h2> <p> To create the SCIN dataset, we used a novel crowdsourcing method, which we describe in the accompanying <a href="https://arxiv.org/abs/2402.18545">research paper</a> co-authored with investigators at <a href="https://med.stanford.edu/">Stanford Medicine</a>. This approach empowers individuals to play an active role in healthcare research. It allows us to reach people at earlier stages of their health concerns, potentially before they seek formal care. Crucially, this method uses advertisements on web search result pages — the starting point for many people’s health journey — to connect with participants. </p> <p> Our results demonstrate that crowdsourcing can yield a high-quality dataset with a low spam rate. Over 97.5% of contributions were genuine images of skin conditions. After performing further filtering steps to exclude images that were out of scope for the SCIN dataset and to remove duplicates, we were able to release nearly 90% of the contributions received over the 8-month study period. Most images were sharp and well-exposed. Approximately half of the contributions include self-reported demographics, and 80% contain self-reported information relating to the skin condition, such as texture, duration, or other symptoms. We found that dermatologists’ ability to retrospectively assign a differential diagnosis depended more on the availability of self-reported information than on image quality. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1QMRINpok_qmh5jjtktgqytapBRfHWDFxLKffzY9L_jG8uE8oJXA7QwtGY76gPksw5EH0yLuO7Ihk3IitXQDCjQ54DXlxFtpClbIIZzZAb6fDufHR-aW1m81cAMBqxmPIZsN8p3VYlys8b9cczZOzI-VB9d1Nwzk8nCnPTSCDwwh1fmEf4Q8DRdJHo6dR/s1999/image4.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1610" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1QMRINpok_qmh5jjtktgqytapBRfHWDFxLKffzY9L_jG8uE8oJXA7QwtGY76gPksw5EH0yLuO7Ihk3IitXQDCjQ54DXlxFtpClbIIZzZAb6fDufHR-aW1m81cAMBqxmPIZsN8p3VYlys8b9cczZOzI-VB9d1Nwzk8nCnPTSCDwwh1fmEf4Q8DRdJHo6dR/s16000/image4.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Dermatologist confidence in their labels (scale from 1-5) depended on the availability of self-reported demographic and symptom information.</td></tr></tbody></table> <p> While perfect image de-identification can never be guaranteed, protecting the privacy of individuals who contributed their images was a top priority when creating the SCIN dataset. Through informed consent, contributors were made aware of potential re-identification risks and advised to avoid uploading images with identifying features. Post-submission privacy protection measures included manual redaction or cropping to exclude potentially identifying areas, reverse image searches to exclude publicly available copies and metadata removal or aggregation. The SCIN <a href="https://github.com/google-research-datasets/scin?tab=License-1-ov-file#readme">Data Use License</a> prohibits attempts to re-identify contributors. </p> <p> We hope the SCIN dataset will be a helpful resource for those working to advance inclusive dermatology research, education, and AI tool development. By demonstrating an alternative to traditional dataset creation methods, SCIN paves the way for more representative datasets in areas where self-reported data or retrospective labeling is feasible. </p> <div style="line-height:40%;"> <br> </div> <h2>Acknowledgements</h2> <p> <em>We are grateful to all our co-authors Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya Tiyasirisokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado, Dale R. Webster, Dawn Siegel (Stanford Medicine), Steven Lin (Stanford Medicine), Justin Ko (Stanford Medicine), Alan Karthikesalingam and Christopher Semturs. We also thank Yetunde Ibitoye, Sami Lachgar, Lisa Lehmann, Javier Perez, Margaret Ann Smith (Stanford Medicine), Rachelle Sico, Amit Talreja, Annisah Um’rani and Wayne Westerlind for their essential contributions to this work. Finally, we are grateful to Heather Cole-Lewis, Naama Hammel, Ivor Horn, Michael Howell, Yun Liu, and Eric Teasley for their insightful comments on the study design and manuscript. </em> </p>

Mar 19, 2024

MELON: Reconstructing 3D objects from images with unknown poses

<span class="byline-author">Posted by Mark Matthews, Senior Software Engineer, and Dmitry Lagun, Research Scientist, Google Research</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8LjCbKjfNXVUyCpGiZysx_pNF5BK8p5VBCJXXPaz_Bb75CW-33weoMh0YaNcn4AdmGN-Pufd_XlsRzo2MWZLQxqgtri7Nip9tXoGX0CritvRKF-63StOWxp_gVaY-MTnOk9IvJdVt_CczVR6Ip_R8Yv32MHTw2-FckCTF4UOFrgMyq3PCPCkZaZ-nyMcE/s320/MELON%20HERO.jpg" style="display: none;" /> <p> A person's prior experience and understanding of the world generally enables them to easily infer what an object looks like in whole, even if only looking at a few 2D pictures of it. Yet the capacity for a computer to reconstruct the shape of an object in 3D given only a few images has remained a difficult algorithmic problem for years. This fundamental computer vision task has applications ranging from the creation of e-commerce 3D models to autonomous vehicle navigation. </p> <a name='more'></a> <p> A key part of the problem is how to determine the exact positions from which images were taken, known as <em>pose inference</em>. If camera poses are known, a range of successful techniques — such as <a href="https://www.matthewtancik.com/nerf">neural radiance fields</a> (NeRF) or <a href="https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/">3D Gaussian Splatting</a> — can reconstruct an object in 3D. But if these poses are not available, then we face a difficult “chicken and egg” problem where we could determine the poses if we knew the 3D object, but we can’t reconstruct the 3D object until we know the camera poses. The problem is made harder by pseudo-symmetries — i.e., many objects look similar when viewed from different angles. For example, square objects like a chair tend to look similar every 90° rotation. Pseudo-symmetries of an object can be revealed by rendering it on a turntable from various angles and plotting its photometric <a href="https://en.wikipedia.org/wiki/Self-similarity">self-similarity</a> map. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt0nP5M8f5UodttSIPoY5t0JRXEuLosGgock3B0lyOzIn4icGF5jwVuxgX0PiRqc0kBbJ36CLiGA3KPrmaQbjKElGeHrsSRmkpDppU9abE84nuYu9MquqE3gULDzz_INDutmL2i1Wv3_tUpTh5U9UwSck9YRUeVyg-md2GByg3EQYYy7Vs_aeTEk5akpSo/s1764/image5.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="923" data-original-width="1764" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjt0nP5M8f5UodttSIPoY5t0JRXEuLosGgock3B0lyOzIn4icGF5jwVuxgX0PiRqc0kBbJ36CLiGA3KPrmaQbjKElGeHrsSRmkpDppU9abE84nuYu9MquqE3gULDzz_INDutmL2i1Wv3_tUpTh5U9UwSck9YRUeVyg-md2GByg3EQYYy7Vs_aeTEk5akpSo/s16000/image5.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Self-Similarity map of a toy truck model. <strong>Left:</strong> The model is rendered on a turntable from various <a href="https://en.wikipedia.org/wiki/Azimuth">azimuthal angles</a>, θ. <strong>Right:</strong> The average <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm">L2</a> RGB similarity of a rendering from θ with that of θ*. The pseudo-similarities are indicated by the dashed red lines.</td></tr></tbody></table> <p> The diagram above only visualizes one dimension of rotation. It becomes even more complex (and difficult to visualize) when introducing more degrees of freedom. Pseudo-symmetries make the problem <em>ill-posed</em>, with naïve approaches often converging to local minima. In practice, such an approach might mistake the back view as the front view of an object, because they share a similar silhouette. Previous techniques (such as <a href="https://chenhsuanlin.bitbucket.io/bundle-adjusting-NeRF/">BARF</a> or <a href="https://arxiv.org/abs/2205.15768">SAMURAI</a>) side-step this problem by relying on an initial pose estimate that starts close to the global minima. But how can we approach this if those aren’t available? </p> <p> Methods, such as <a href="https://openaccess.thecvf.com/content/ICCV2021/papers/Meng_GNeRF_GAN-Based_Neural_Radiance_Field_Without_Posed_Camera_ICCV_2021_paper.pdf">GNeRF</a> and <a href="https://dl.acm.org/doi/10.1145/3503161.3548078">VMRF</a> leverage <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">generative adversarial networks</a> (GANs) to overcome the problem. These techniques have the ability to artificially “amplify” a limited number of training views, aiding reconstruction. GAN techniques, however, often have complex, sometimes unstable, training processes, making robust and reliable convergence difficult to achieve in practice. A range of other successful methods, such as <a href="https://openaccess.thecvf.com/content/CVPR2023/html/Sinha_SparsePose_Sparse-View_Camera_Pose_Regression_and_Refinement_CVPR_2023_paper.html">SparsePose</a> or <a href="https://rust-paper.github.io/">RUST</a>, can infer poses from a limited number views, but require pre-training on a large dataset of posed images, which aren’t always available, and can suffer from “domain-gap” issues when inferring poses for different types of images. </p> <p> In “<a href="https://arxiv.org/abs/2303.08096">MELON: NeRF with Unposed Images in SO(3)</a>”, spotlighted at <a href="https://3dvconf.github.io/2024/">3DV 2024</a>, we present a technique that can determine object-centric camera poses entirely from scratch while reconstructing the object in 3D. <a href="https://melon-nerf.github.io/">MELON</a> (Modulo Equivalent Latent Optimization of NeRF) is one of the first techniques that can do this without initial pose camera estimates, complex training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. We demonstrate that MELON can reconstruct a NeRF from unposed images with state-of-the-art accuracy while requiring as few as 4–6 images of an object. </p> <div style="line-height: 40%;"> <br /> </div> <h2>MELON</h2> <p> We leverage two key techniques to aid convergence of this ill-posed problem. The first is a very lightweight, dynamically trained <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural network</a> (CNN) encoder that regresses camera poses from training images. We pass a downscaled training image to a four layer CNN that infers the camera pose. This CNN is initialized from noise and requires no pre-training. Its capacity is so small that it forces similar looking images to similar poses, providing an implicit regularization greatly aiding convergence. </p> <p> The second technique is a <em>modulo loss</em> that simultaneously considers pseudo symmetries of an object. We render the object from a fixed set of viewpoints for each training image, backpropagating the loss only through the view that best fits the training image. This effectively considers the plausibility of multiple views for each image. In practice, we find <em>N</em>=2 views (viewing an object from the other side) is all that’s required in most cases, but sometimes get better results with <em>N</em>=4 for square objects. </p> <p> These two techniques are integrated into standard NeRF training, except that instead of fixed camera poses, poses are inferred by the CNN and duplicated by the modulo loss. Photometric gradients back-propagate through the best-fitting cameras into the CNN. We observe that cameras generally converge quickly to globally optimal poses (see animation below). After training of the neural field, MELON can synthesize novel views using standard NeRF rendering methods. </p> <p> We simplify the problem by using the <a href="https://github.com/bmild/nerf">NeRF-Synthetic</a> dataset, a popular benchmark for NeRF research and common in the pose-inference literature. This synthetic dataset has cameras at precisely fixed distances and a consistent “up” orientation, requiring us to infer only the <a href="https://en.wikipedia.org/wiki/Spherical_coordinate_system">polar coordinates</a> of the camera. This is the same as an object at the center of a globe with a camera always pointing at it, moving along the surface. We then only need the latitude and longitude (2 degrees of freedom) to specify the camera pose. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEjisRopoeGPgbCRa3sQ7hmBUtnfI6TRapBD7Yn96xeDA_LxzTayiw3DMijPHS0ovkLVTcQGpp2_gAyA_P5BCPwXuEcz7lApC8WQbGfMvj_aAxShjgsmcklf_-4ekgbFH6VZ92Ey3Ta4XAhZvEdc00D2o7SzPIOSnFAj8CgrdmdJunijsGaw1Zx46b94wk/s1315/image4.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="395" data-original-width="1315" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEjisRopoeGPgbCRa3sQ7hmBUtnfI6TRapBD7Yn96xeDA_LxzTayiw3DMijPHS0ovkLVTcQGpp2_gAyA_P5BCPwXuEcz7lApC8WQbGfMvj_aAxShjgsmcklf_-4ekgbFH6VZ92Ey3Ta4XAhZvEdc00D2o7SzPIOSnFAj8CgrdmdJunijsGaw1Zx46b94wk/s16000/image4.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">MELON uses a dynamically trained lightweight CNN encoder that predicts a pose for each image. Predicted poses are replicated by the <em>modulo loss, </em>which only penalizes the smallest L2 distance from the ground truth color. At evaluation time, the neural field can be used to generate novel views.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h2>Results</h2> <p> We compute two key metrics to evaluate MELON’s performance on the NeRF Synthetic dataset. The error in orientation between the ground truth and inferred poses can be quantified as a single angular error that we average across all training images, the pose error. We then test the accuracy of MELON’s rendered objects from novel views by measuring the <a href="https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio">peak signal-to-noise ratio</a> (PSNR) against held out test views. We see that MELON quickly converges to the approximate poses of most cameras within the first 1,000 steps of training, and achieves a competitive PSNR of 27.5 dB after 50k steps. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU5wdw89PfwRbvZeaIWLM3rNEAo69__A-ovDwB5x8emIkAGZq05FgF-wDMNlkXPS6tOcC_0NJVD4Glq8eX02yb3CDIiqXbadI4lnvcZ_MI9sHUkz8risxP1orPA8ZnTZUq-PcRLPoEc_AmFuARCokXHQlTOv_q35TH1tivuK2PpA54hO7q7kh_M8ZynO-J/s960/image1.gif" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="960" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU5wdw89PfwRbvZeaIWLM3rNEAo69__A-ovDwB5x8emIkAGZq05FgF-wDMNlkXPS6tOcC_0NJVD4Glq8eX02yb3CDIiqXbadI4lnvcZ_MI9sHUkz8risxP1orPA8ZnTZUq-PcRLPoEc_AmFuARCokXHQlTOv_q35TH1tivuK2PpA54hO7q7kh_M8ZynO-J/s16000/image1.gif" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Convergence of MELON on a toy truck model during optimization. <strong>Left</strong>: Rendering of the NeRF. <strong>Right</strong>: Polar plot of predicted (blue <em>x</em>), and ground truth (red dot) cameras.</td></tr></tbody></table> <p> MELON achieves similar results for other scenes in the NeRF Synthetic dataset. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWEC7CE_iWu1QZ_jgEUHHCEdqaUBMO7cK-1DZuHaZRDq4Y59_CriUlb_aOSJP5psB6Cbs1E41mm81EsfwVM0zAUojRKToWwiDmPfaWFPr2UGqf6F4n3P8ZpgYxiqyWIgst6op3Fhsbu0nlR727zLVV38KqJvNFY_KDeoJbdOjJFpHjLZkEd95Z9TqSg4R_/s1999/image2.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="644" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWEC7CE_iWu1QZ_jgEUHHCEdqaUBMO7cK-1DZuHaZRDq4Y59_CriUlb_aOSJP5psB6Cbs1E41mm81EsfwVM0zAUojRKToWwiDmPfaWFPr2UGqf6F4n3P8ZpgYxiqyWIgst6op3Fhsbu0nlR727zLVV38KqJvNFY_KDeoJbdOjJFpHjLZkEd95Z9TqSg4R_/s16000/image2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Reconstruction quality comparison between ground-truth (GT) and MELON on NeRF-Synthetic scenes after 100k training steps.</td></tr></tbody></table> <br /> <div style="line-height: 40%;"> <br /> </div> <h3>Noisy images</h3> <p> MELON also works well when performing <a href="https://en.wikipedia.org/wiki/View_synthesis">novel view synthesis</a> from extremely noisy, unposed images. We add varying amounts, <em>σ</em>, of <a href="https://en.wikipedia.org/wiki/Additive_white_Gaussian_noise">white Gaussian noise</a> to the training images. For example, the object in <em>σ</em>=1.0 below is impossible to make out, yet MELON can determine the pose and generate novel views of the object. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHKYFcj-CKKc5kUvfsoOD5rBTp2QMnd3CdYiVzXjMClNwJrcgSrvIZngAdLgxUthE-aiXx5NapxcMx66i-Bi9RhC0zTRVkA0R8fj2A7lOnIdFDIE3YkTh_hWO2PhPa0FjYWYHuNUuae_tPhsrmVHJAkCeeI1f0ooJGe44KgpcO7jVNyLcnUvwtMX-KpJdD/s1182/image1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="568" data-original-width="1182" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHKYFcj-CKKc5kUvfsoOD5rBTp2QMnd3CdYiVzXjMClNwJrcgSrvIZngAdLgxUthE-aiXx5NapxcMx66i-Bi9RhC0zTRVkA0R8fj2A7lOnIdFDIE3YkTh_hWO2PhPa0FjYWYHuNUuae_tPhsrmVHJAkCeeI1f0ooJGe44KgpcO7jVNyLcnUvwtMX-KpJdD/s16000/image1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Novel view synthesis from noisy unposed 128×128 images. Top: Example of noise level present in training views. Bottom: Reconstructed model from noisy training views and mean angular pose error.</td></tr></tbody></table> <p> This perhaps shouldn’t be too surprising, given that techniques like <a href="https://bmild.github.io/rawnerf/">RawNeRF</a> have demonstrated NeRF’s excellent de-noising capabilities with known camera poses. The fact that MELON works for noisy images of unknown camera poses so robustly was unexpected. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Conclusion</h2> <p> We present MELON, a technique that can determine object-centric camera poses to reconstruct objects in 3D without the need for approximate pose initializations, complex GAN training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. Though we only demonstrated MELON on synthetic images we are adapting our technique to work in real world conditions. See the <a href="https://arxiv.org/abs/2303.08096">paper</a> and <a href="https://melon-nerf.github.io/">MELON site</a> to learn more. </p> <div style="line-height: 40%;"> <br /> </div> <h2>Acknowledgements</h2> <p> <em>We would like to thank our paper co-authors Axel Levy, Matan Sela, and Gordon Wetzstein, as well as Florian Schroff and Hartwig Adam for continuous help in building this technology. We also thank Matthew Brown, Ricardo Martin-Brualla and Frederic Poitevin for their helpful feedback on the paper draft. We also acknowledge the use of the computational resources at the SLAC Shared Scientific Data Facility (SDF).</em> </p>

Mar 18, 2024

HEAL: A framework for health equity assessment of machine learning performance

<span class="byline-author">Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer &amp; Director, Google Core</span> <img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYi3V0CsXup8WA6SSjPagoMWfkIpbr9oRWEaUM1vIWOX8_TsZs6ikqOn6qIGbqUzAPhOxwhEPNfWSkECIxRz5fJ629cRGScLraFn2CSw53Sr5_li8Fe7A9I1nMShys_15IiUZNhNiPh_ueFVcu_7f34A-A0pMXXVdDaSoSAf2h0jETJ1PemIR5I6o9pIIW/s1600/HEAL-Hero.png" style="display: none;" /> <p> Health equity is a major societal concern worldwide with disparities having many causes. These sources include limitations in access to healthcare, differences in clinical treatment, and even fundamental differences in the diagnostic technology. In dermatology for example, skin cancer outcomes are worse for populations such as minorities, those with lower socioeconomic status, or individuals with limited healthcare access. While there is great promise in recent advances in machine learning (ML) and artificial intelligence (AI) to help improve healthcare, this transition from research to bedside must be accompanied by a careful understanding of whether and how they impact health equity. </p> <a name='more'></a> <p> <em>Health equity</em> is defined by public health organizations as fairness of opportunity for everyone to be as healthy as possible. Importantly, equity may be different from <em>equality</em>. For example, people with greater barriers to improving their health may require more or different effort to experience this fair opportunity. Similarly, equity is not <em>fairness</em> as defined in the AI for healthcare literature. Whereas AI fairness often strives for equal performance of the AI technology across different patient populations, this does not center the goal of prioritizing performance with respect to pre-existing health disparities. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi21VRS33NG-Imj1XlKXWtrwUrl4loEEywV0tO8M0JWtUFFksbTLOhilTZtMdJTgOBdXACUPQX-f5TMAFkABFhdv_cEDmFn4d-JirU78covJI32sHus6XQVJ1C1elwM_MExsQfeVCpFYlq9QZeynLNpLqmW8GqM-DKWiGSyi_18n8Xb3-8IeepHSyBZ6_2l/s1999/image2.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1999" data-original-width="1609" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi21VRS33NG-Imj1XlKXWtrwUrl4loEEywV0tO8M0JWtUFFksbTLOhilTZtMdJTgOBdXACUPQX-f5TMAFkABFhdv_cEDmFn4d-JirU78covJI32sHus6XQVJ1C1elwM_MExsQfeVCpFYlq9QZeynLNpLqmW8GqM-DKWiGSyi_18n8Xb3-8IeepHSyBZ6_2l/s16000/image2.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Health equity considerations. An intervention (e.g., an ML-based tool, indicated in dark blue) promotes health equity if it helps reduce existing disparities in health outcomes (indicated in lighter blue).</td></tr></tbody></table> <p> In “<a href="https://www.thelancet.com/journals/eclinm/article/PIIS2589-5370(24)00058-0/fulltext">Health Equity Assessment of machine Learning performance (HEAL): a framework and dermatology AI model case study</a>”, published in <a href="https://www.thelancet.com/journals/eclinm/home"><i>The Lancet eClinicalMedicine</i></a>, we propose a methodology to quantitatively assess whether ML-based health technologies perform equitably. In other words, does the ML model perform well for those with the worst health outcomes for the condition(s) the model is meant to address? This goal anchors on the principle that health equity should prioritize and measure model performance with respect to disparate health outcomes, which may be due to a number of factors that include structural inequities (e.g., demographic, social, cultural, political, economic, environmental and geographic). </p> <br /> <h2>The health equity framework (HEAL)</h2> <p> The HEAL framework proposes a 4-step process to estimate the likelihood that an ML-based health technology performs equitably: </p> <ol> <li> Identify factors associated with health inequities and define tool performance metrics, </li> <li> Identify and quantify pre-existing health disparities, </li> <li> Measure the performance of the tool for each subpopulation, </li> <li> Measure the likelihood that the tool prioritizes performance with respect to health disparities. </li> </ol> <p> The final step’s output is termed the HEAL metric, which quantifies how anticorrelated the ML model’s performance is with health disparities. In other words, does the model perform better with populations that have the worse health outcomes? </p> <p> This 4-step process is designed to inform improvements for making ML model performance more equitable, and is meant to be iterative and re-evaluated on a regular basis. For example, the availability of health outcomes data in step (2) can inform the choice of demographic factors and brackets in step (1), and the framework can be applied again with new datasets, models and populations. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoGLCxn9QWS5QQpW39mJH1A_pw9wniWKIGGapN_gBC5WdxAWo4jHRS29GhNq7XBgNdZ867tMdP7TcszMz2WxUR4sYBFz0-dJ4cQZCODN2YFRjCP14QhNh_kMVGUdklbToOCYwHXV-UofhZdwZzDZudaVedOqvcC-QbW3LtMGb04FwFclbfzKHVUcqHodW_/s1999/image1.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1352" data-original-width="1999" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoGLCxn9QWS5QQpW39mJH1A_pw9wniWKIGGapN_gBC5WdxAWo4jHRS29GhNq7XBgNdZ867tMdP7TcszMz2WxUR4sYBFz0-dJ4cQZCODN2YFRjCP14QhNh_kMVGUdklbToOCYwHXV-UofhZdwZzDZudaVedOqvcC-QbW3LtMGb04FwFclbfzKHVUcqHodW_/s16000/image1.jpg" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Framework for Health Equity Assessment of machine Learning performance (HEAL).&nbsp;Our guiding principle is to avoid exacerbating health inequities, and these steps help us identify disparities and assess for inequitable model performance to move towards better outcomes for all.</td></tr></tbody></table> <p> With this work, we take a step towards encouraging explicit assessment of the health equity considerations of AI technologies, and encourage prioritization of efforts during model development to reduce health inequities for subpopulations exposed to structural inequities that can precipitate disparate outcomes. We should note that the present framework does not model causal relationships and, therefore, cannot quantify the actual impact a new technology will have on reducing health outcome disparities. However, the HEAL metric may help identify opportunities for improvement, where the current performance is not prioritized with respect to pre-existing health disparities. </p> <br /> <h2>Case study on a dermatology model</h2> <p> As an illustrative case study, we applied the framework to a dermatology model, which utilizes a convolutional neural network similar to that described in <a href="https://blog.research.google/2019/09/using-deep-learning-to-inform.html">prior work</a>. This example dermatology model was trained to classify 288 skin conditions using a development dataset of 29k cases. The input to the model consists of three photos of a skin concern along with demographic information and a brief structured medical history. The output consists of a ranked list of possible matching skin conditions. </p> <p> Using the HEAL framework, we evaluated this model by assessing whether it prioritized performance with respect to pre-existing health outcomes. The model was designed to predict possible dermatologic conditions (from a list of hundreds) based on photos of a skin concern and patient metadata. Evaluation of the model is done using a top-3 agreement metric, which quantifies how often the top 3 output conditions match the most likely condition as suggested by a dermatologist panel. The HEAL metric is computed via the anticorrelation of this top-3 agreement with health outcome rankings. </p> <p> We used a dataset of 5,420 teledermatology cases, enriched for diversity in age, sex and race/ethnicity, to retrospectively evaluate the model’s HEAL metric. The dataset consisted of “store-and-forward” cases from patients of 20 years or older from primary care providers in the USA and skin cancer clinics in Australia. Based on a review of the literature, we decided to explore race/ethnicity, sex and age as potential factors of inequity, and used sampling techniques to ensure that our evaluation dataset had sufficient representation of all race/ethnicity, sex and age groups. To quantify pre-existing health outcomes for each subgroup we relied on measurements from <a href="https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/global-health-estimates-leading-causes-of-dalys">public</a> <a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30925-9/fulltext">databases</a> endorsed by the World Health Organization, such as <a href="https://www.who.int/data/gho/indicator-metadata-registry/imr-details/4427">Years of Life Lost</a> (YLLs) and <a href="https://www.who.int/data/gho/indicator-metadata-registry/imr-details/158">Disability-Adjusted Life Years</a> (DALYs; years of life lost plus years lived with disability). </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSS4J8AzS5iaHYvB7RyUVEDkx1ykrC7zOEAbUvjb8ZybZRZ0C71fRlJjPYBzGYVu9D3Ok0zRdz4MUdHMX6rOqnYKoHv91QNPw0TiqHJ6MKjtgn_UIqW-xoZeihO-A-ZrPgWT8bs-t9bSZWmMQ9AJaQh85BZWHH-T0KPWMx2unNO9HpTzYXiD_24gwNYWot/s1511/Table1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="602" data-original-width="1511" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSS4J8AzS5iaHYvB7RyUVEDkx1ykrC7zOEAbUvjb8ZybZRZ0C71fRlJjPYBzGYVu9D3Ok0zRdz4MUdHMX6rOqnYKoHv91QNPw0TiqHJ6MKjtgn_UIqW-xoZeihO-A-ZrPgWT8bs-t9bSZWmMQ9AJaQh85BZWHH-T0KPWMx2unNO9HpTzYXiD_24gwNYWot/s16000/Table1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">HEAL metric for all dermatologic conditions across race/ethnicity subpopulations, including health outcomes (YLLs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance.<br />(* Higher is better; measures the likelihood the model performs equitably with respect to the axes in this table.)</td></tr></tbody></table> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMAQjyuGMXvzq4FxZg5Vhlgozwwnzza-QS-mjr3i0oOnDFIeqUGTrPxX2c7ssbpCZtLUoT2lpr8bXg_nJ3ToaaVe6Grge-HcWQl8SFy1gaBCoT-6ZHtFmQV4_S2sA6eOsdMFryegLjZFwOcPiqZDfFFItxqS96ysTZZn1OXVcbQSOG5WazZGjxSkNt9JQK/s1518/Table2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="316" data-original-width="1518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMAQjyuGMXvzq4FxZg5Vhlgozwwnzza-QS-mjr3i0oOnDFIeqUGTrPxX2c7ssbpCZtLUoT2lpr8bXg_nJ3ToaaVe6Grge-HcWQl8SFy1gaBCoT-6ZHtFmQV4_S2sA6eOsdMFryegLjZFwOcPiqZDfFFItxqS96ysTZZn1OXVcbQSOG5WazZGjxSkNt9JQK/s16000/Table2.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">HEAL metric for all dermatologic conditions across sexes, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.)</td></tr></tbody></table <p> Our analysis estimated that the model was 80.5% likely to perform equitably across race/ethnicity subgroups and 92.1% likely to perform equitably across sexes. </p> <p> However, while the model was likely to perform equitably across age groups for cancer conditions specifically, we discovered that it had room for improvement across age groups for non-cancer conditions. For example, those 70+ have the poorest health outcomes related to non-cancer skin conditions, yet the model didn't prioritize performance for this subgroup. </p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4s5yfNQCksLIqP3kYuDXahUlOcJSCEtt-JkSTsecDft21uJ8JR0imnsPVGYHVQnc7OPo1WOkcwx2Yevu6su-rbqc1Fl6_NfzCKl0_vOvZA3PPnLkVWKFk7jHPJCm-x69MupVih_zct1YOXJVvSNUIsvn4rICk-_RWbOeuKj4HdRphBOakRXsiJ4lETJ_M/s1508/Table3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="644" data-original-width="1508" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4s5yfNQCksLIqP3kYuDXahUlOcJSCEtt-JkSTsecDft21uJ8JR0imnsPVGYHVQnc7OPo1WOkcwx2Yevu6su-rbqc1Fl6_NfzCKl0_vOvZA3PPnLkVWKFk7jHPJCm-x69MupVih_zct1YOXJVvSNUIsvn4rICk-_RWbOeuKj4HdRphBOakRXsiJ4lETJ_M/s16000/Table3.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">HEAL metrics for all cancer and non-cancer dermatologic conditions across age groups, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.)</td></tr></tbody></table> <br /> <h2>Putting things in context</h2> <p> For holistic evaluation, the HEAL metric cannot be employed in isolation. Instead this metric should be contextualized alongside many other factors ranging from computational efficiency and data privacy to ethical values, and aspects that may influence the results (e.g., selection bias or differences in representativeness of the evaluation data across demographic groups). </p> <p> As an adversarial example, the HEAL metric can be artificially improved by deliberately reducing model performance for the most advantaged subpopulation until performance for that subpopulation is worse than all others. For illustrative purposes, given subpopulations A and B where A has worse health outcomes than B, consider the choice between two models: Model 1 (M1) performs 5% better for subpopulation A than for subpopulation B. Model 2 (M2) performs 5% worse on subpopulation A than B. The HEAL metric would be higher for M1 because it prioritizes performance on a subpopulation with worse outcomes. However, M1 may have absolute performances of just 75% and 70% for subpopulations A and B respectively, while M2 has absolute performances of 75% and 80% for subpopulations A and B respectively. Choosing M1 over M2 would lead to worse overall performance for all subpopulations because some subpopulations are worse-off while no subpopulation is better-off. </p> <p> Accordingly, the HEAL metric should be used alongside a <a href="https://en.wikipedia.org/wiki/Pareto_efficiency">Pareto condition</a> (discussed further in the paper), which restricts model changes so that outcomes for each subpopulation are either unchanged or improved compared to the status quo, and performance does not worsen for any subpopulation. </p> <p> The HEAL framework, in its current form, assesses the likelihood that an ML-based model prioritizes performance for subpopulations with respect to pre-existing health disparities for specific subpopulations. This differs from the goal of understanding whether ML will reduce disparities in outcomes across subpopulations in reality. Specifically, modeling improvements in outcomes requires a causal understanding of steps in the care journey that happen both before and after use of any given model. Future research is needed to address this gap. </p> <br /> <h2>Conclusion</h2> <p> The HEAL framework enables a quantitative assessment of the likelihood that health AI technologies prioritize performance with respect to health disparities. The case study demonstrates how to apply the framework in the dermatological domain, indicating a high likelihood that model performance is prioritized with respect to health disparities across sex and race/ethnicity, but also revealing the potential for improvements for non-cancer conditions across age. The case study also illustrates limitations in the ability to apply all recommended aspects of the framework (e.g., mapping societal context, availability of data), thus highlighting the complexity of health equity considerations of ML-based tools. </p> <p> This work is a proposed approach to address a grand challenge for AI and health equity, and may provide a useful evaluation framework not only during model development, but during pre-implementation and real-world monitoring stages, e.g., in the form of health equity dashboards. We hold that the strength of the HEAL framework is in its future application to various AI tools and use cases and its refinement in the process. Finally, we acknowledge that a successful approach towards understanding the impact of AI technologies on health equity needs to be more than a set of metrics. It will require a set of goals agreed upon by a community that represents those who will be most impacted by a model. </p> <br /> <h2>Acknowledgements</h2> <p> <em>The research described here is joint work across many teams at Google. We are grateful to all our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We also thank Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um'rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their support of this project.</em> </p>

Mar 15, 2024