Everyone working from home this year has figured out their own ways to stay focused. For Chris Lunt, it’s a squat red sphere he keeps on his desk, painted to depict the furrowed brow of Bodhidharma, the founder of Zen Buddhism — a Daruma doll.
“It’s a doll you buy when you take on a difficult task,” explained Lunt, chief technology officer for the United States’ All of Us research program. The hollow figurine comes painted with two white, unblinking eyes: one to be filled in when the challenge begins, and the other once it has been completed.
“I’ve had this now for four years,” said Lunt, holding up his one-eyed Daruma to the camera in a recent video call. That’s when All of Us, a $1.5 billion federal initiative to collect health data from 1 million Americans of diverse backgrounds, got underway. It’s been heralded as an ambitious bet to broaden the reach of precision medicine. But the technology being developed in lockstep with All of Us also stands to democratize research.
“All this work was to create new science,” said Lunt. “We want to bring this research to an audience who’s been denied access. That is a participant problem, but it’s also a researcher problem.”
Over the last four years, Lunt has worked with partners at the Broad Institute, Vanderbilt University Medical Center, and Alphabet’s life science company Verily to build Researcher Workbench, a cloud-based analytics platform opened up to approved researchers around the country in recent months. As more researchers and more data enter its digital walls, the tool — and its back end, an open-source system called Terra — will be put to the test.
Big datasets like those All of Us is building — which include health records, imaging data, and one day, fully sequenced genomes — are a gold mine of insights about medicine. They’re also growing far more common as scientists analyze unprecedented amounts of biological data, from the microbes in our guts and proteins our cells express to the pixels in our MRIs and the mess of information in our medical records.
“When I started Google Genomics back in 2013,” said David Glazer, now an engineering director at Verily, “I used my standard warmup joke: ‘What’s the definition of big data to a biologist?’ And the answer was, ‘It doesn’t fit in Excel.’ The thing that I didn’t really realize is in 2013 it was a massive new achievement for biology to be able to generate enough data to not fit in a spreadsheet.”
But tapping into multiple big datasets quickly becomes unwieldy, and relatively few researchers have had the technical infrastructure, money, and state-of-the-art security to do so.
In 2017, Glazer and colleagues at four academic institutions published a Medium post laying out some of the biggest challenges, including accessibility. Together, they proposed a new framework to enable a new era of biomedical analysis: a “data biosphere.” It would enable researchers to store, share, organize, and analyze data at scale, built under open-source licenses and to standards for responsible biological data sharing.
That manifesto ultimately manifested in Terra, a cloud-based platform launched in 2017 by the Broad Institute and Verily.
“The questions researchers are asking increasingly require combining multiple datasets with different kinds of data,” said Clare Bernard, a senior director at the Broad Institute who works on the Terra platform, which currently has more than 15,000 users. Terra allows them all to live on one platform, stored a single time in the cloud — cutting down on the cost and security risk of storing many local copies.
But the promise of Terra’s platform isn’t simply in those technical improvements. It’s how they will directly impact the people using the system.
The data sandbox
Most critically, Terra built a massive sandbox to play with all that diverse data. It enables Python and R-based analysis, common languages used in bioinformatics research, and built in some of the field’s favorite applications for analysis, including Jupyter Notebook, RStudio, and Galaxy. And it allows researchers to save workflows, a standardized way of compiling all the computational steps that go into a piece of scientific analysis.
“Alongside the growth in genomic data and the growth in diversity of data is the increase in collaboration, both between different labs and between people with different functions,” said Bernard. “Data generators, tool developers, and researchers.”
Terra’s centralized workflows will allow many more types of scientists to work together.
Consider, for example, a researcher who wants to dive into the data All of Us has collected on patients with diabetes. “What the platform does very nicely is let workspaces organize around teams,” said Andrea Ramirez, an endocrinologist at Vanderbilt University Medical Center and All of Us contributor. “So when you have a principal investigator who’s the world expert in monogenic diabetes and he cannot write Python, that’s fine. You can work with a bioinformatician who understands the data model and have a lovely team environment.”
That collaboration doesn’t just happen within teams; it can exist with anyone else on the platform. On both Terra and the All of Us front end, users have the ability to make their analytical workflows public — to show their work. Other scientists can then reproduce their code-based analysis, a critical step in the scientific process. It’s also an extremely difficult one to execute when data analysis is run on a local computer, using whatever software a particular team is used to.
So Terra points toward something bigger than enabling big data analysis for biology: It could help democratize science by inviting more and more diverse researchers to play in the sandbox.
For All of Us, fulfilling the mandate of democratization is especially important. With that goal in mind, the Researcher Workbench uses a “data passport” that allows researchers to gain access to all the project’s data, not just what’s needed for a specific study. And it includes graphical interfaces that make it easier for researchers with less engineering experience to use the platform.
The lower cost of cloud computing can open up the field to more researchers, too. On Terra, users pay only for the cloud data storage and computing resources they use (after using up $300 of free credits); All of Us is letting researchers run analysis for free while the Researcher Workbench is still in beta mode. Once those promos run out, “the average analyses we’ve done with months of work have been less than $15 per project,” said Ramirez, who was an author on a preprint in June validating early data from All of Us through the workbench. “A postdoc in another analysis tried a huge machine learning project and couldn’t even get the cost over $50.”
Genomics as the next frontier
The democratized costs of cloud computing aren’t a given, though. The bargain basement prices Ramirez is citing? Those are based on analyses of phenotypic data, the clinical information from EHRs and surveys, and physical measurements collected at visits from All of Us participants.
“That’s a huge number of data points, but a very small number of bytes,” said Glazer. Soon, though, researchers will be able to analyze genomic datasets on the scale of petabytes. All of Us only began releasing genetic results to its participants in December; by the end of 2021, according to Lunt, it plans to release some of those sequences, as many as six figures’ worth, to the Researcher Workbench.
With datasets of that size, “you could very quickly run up extremely high cloud bills without efficient search and pull strategies,” said Ramirez. So the Broad Institute and its partners on Terra have been working to store genomic data as efficiently as possible. At the same time, All of Us is building front-end tools to help researchers make the most of that data.
“To find a variant in their gene of interest, they can use a drop-down menu in a graphical user interface,” said Ramirez. “This is the first time at this scale user-centric tools are trying to be built.”
These custom features will be critical to keeping All of Us’s genetic data as accessible as possible. But they also gesture to the needs of different kinds of users — ones who don’t necessarily want to share their work, and instead want to use Terra for proprietary research.
It’s those users that Terra is poised to serve in the coming year. Since its launch, Terra has run on Google Cloud. But in January, the platform announced a multiyear partnership with Microsoft, which has a strong network of health industry clients using its services, including its own Azure cloud. In the future, Terra will support work on both clouds — and in so doing, will likely pull in a fair number of Microsoft’s clients.
When it announced the Microsoft partnership, Terra quietly put out a call for enterprise clients. A single line in its press release leads to a Google form that allows you to sign up to be a “Trusted Tester for early access to new commercial offerings.” Terra’s site suggests early access to those products will open up in the second quarter of 2021.
“Fundamentally, Terra will remain an open platform,” said Greg Moore, a Microsoft corporate vice president who founded Google Cloud’s health care division. “It’s really allowing researchers to bring proprietary datasets into the Terra platform that will allow the acceleration of biomedical research.” That will include the practical aspects of making it easy for them to load that data in, along with all the beefed up trust, security, and compliance capabilities that corporate entities expect.
While a commercial offering will close some doors within Terra, the platform’s expansion to Microsoft’s cloud may enable democratization in a different way. “Terra is a very global offering,” said Bernard, with thousands of researchers a month connecting from around the world and hosting datasets on the platform. Microsoft, with its internationally distributed Azure data centers, could make it even easier for the global research community to work together.
For Lunt, those kinds of collaborations are what he’s kept in mind while working to build Researcher Workbench. And he’s starting to see the project’s goals come to fruition.
In late January, researchers from the University of San Diego published results online in the American Journal of Ophthalmology validating a machine learning model that predicts the need for surgery in patients with glaucoma. Then, they retrained it with new electronic health record data from All of Us participants, which the authors said offered a “greater diversity of participants compared to our cohort for the original model, which was derived from a single academic center.” The retrained model performed notably better than the original.
It is the first peer-reviewed publication using data from the Researcher Workbench.
“I said when I started the program, I can fill in the other eye when the first paper comes out,” Lunt said, holding up his desktop Daruma. Now, it will finally have its depth perception.