Data Rescue

Efforts to keep data from going dark pay dividends in understanding the present and learning from the past

On a late-February Saturday morning, more than 20 volunteers filled an upstairs classroom in the University of Colorado Boulder’s Wolf Law Building with gentle keyboard clicks, mouse clacks, and collaborative chatter about uploads, IP addresses, and large files. Their charge: data rescue.

The event launched Data Rescue Boulder, one of more than 30 hackathons that have popped up in cities across the United States and Canada. Functioning as a member of the Data Refuge Project and coordinating with the Environmental Data and Governance Initiative (EDGI), Data Rescue Boulder is focused on harvesting and archiving scientific datasets from “at-risk” federal sites.

At-risk data is data that’s currently available but may be threatened or vulnerable to suddenly going missing or “dark.” In some cases data may be at risk as a result of censorship; in other situations there’s a looming threat that budget cuts could mean de-prioritized and lost data storage.

“Sometimes it’s neglect, sometimes it’s malfeasance,” says Dave Gallaher, who works for the National Snow and Ice Data Center (NSIDC) in Boulder and serves as chair of the Research Data Alliance, an international group focused on enabling open data sharing. “For whatever reason, that [at-risk] data is now not safe.”

Data Rescue Boulder volunteers were driven by national politics and changes seen and tracked on federal websites. EDGI members monitor changes to federal agency websites on a daily basis and have sounded the alarm on deleted language around climate change, along with various rules, reports and datasets. In May 2017, the Washington Post reported that as of February 2017 there were 195,245 public datasets available on data.gov—that number dropped to 156,000 in late April before rising to 192,648 the week of May 14. And though watchdogs are concerned about information being withheld from the public, experts say the decrease may reflect the consolidation of datasets, while federal officials say the disappearance of datasets was due to a glitch that happened during work on metadata.

In their efforts at Data Rescue Boulder, volunteers targeted information compiled by the U.S. Environmental Protection Agency, National Oceanic and Atmospheric Administration, National Institutes of Health, and others. Over the two weeks following the rescue event, volunteers collected and republished the entire 2.65 million StreamCat datasets on small U.S. streams and waterways, plus millions of other sets on climate, water, air quality, health and more, putting volunteers on track to create a refuge for more than 1 billion scientific records, according to Joan Saez, an executive at the Colorado data company Cloud BIRST, who spearheaded and sponsored the Boulder event. As of May 1, 2017, EDGI volunteers and members across the country “seeded” or marked 63,076 webpages for a webcrawler to copy the material on those pages and their subdomains.

Although EDGI was newly started in November 2016, and this public rescue effort was especially massive, driven by fear that scientific data will be lost as a result of decisions made by the new administration, data archiving projects aren’t unusual. Whenever federal administration changes hands, libraries conduct end-of-term harvests where they cooperate with the Government Publishing Office to copy outgoing administration websites. At the same time, other folks like Gallaher keep a steady and constant focus on rescuing at-risk and dark data across the world.

“All kinds of people have all kinds of data,” Gallaher says, including old at-risk data. “Things like weather records or streamflow gauge records from all over Africa. I’ve seen pictures of storage warehouses that are just full of paper records literally being chewed on by rats,” he says. Old British ship records, too, offer temperature, precipitation and barometric pressure data from more than 300 years ago, but the paper is disintegrating. The same is true of old film. And while other media may not disintegrate, some has been dumped because it was too expensive to store. Holding data for the sake of hoarding information isn’t helpful, but with historical records may come insights into problems we face today and a better ability to map out the future.

For almost a decade Gallaher, alongside his team at NSIDC, has been shedding light on dark data. The team’s great victory thus far has been the recovery, reprocessing and digitizing of the National Aeronautics and Space Administration’s (NASA) data from the Nimbus I, II and III missions, when Nimbus flew as a weather satellite in 1964, 1966 and 1969. Their data recovery effort, which extended the polar sea ice record back from 1979 to the 1960s, was recognized when the team won the 2016 International Data Rescue Award in Geosciences. An extended sea ice record means more accurate climate modeling and a better understanding of how sea ice has changed over the past 50 years.

Bringing this dark data to light wasn’t easy. After the Nimbus missions were complete, spools of the data along with reels of film containing hundreds of thousands of time-stamped images were boxed up. The information wasn’t lost, but it wasn’t easy to access—the data went dark.

Nimbus III testing at the Goddard Space Flight Center

Nimbus III, photographed in 1967 at the Goddard Space Flight Center, flew in 1969 to collect meteorological data. Over the past decade, the National Snow and Ice Data Center recovered and digitized information from the Nimbus I, II, and III missions, using that data to extend the polar sea ice record. Photo courtesy of NASA’s Goddard Space Flight Center

In 2009, when Gallaher heard that NASA scientists had successfully recovered high resolution infrared radiometer images from the Nimbus missions, he requested them. Gallaher found that the data was stored in boxes, the boxes were full of canisters, and each canister contained about 200 feet of film. None of it had been opened for at least 40 years. Using the data would mean scanning and looking at more than 250,000 individual frames.

And there were other roadblocks. Gallaher’s team had to piece the timestamped images together with Nimbus orbital data that was stored by the North American Aerospace Defense Command to determine what the satellite was looking at. Then there were the clouds. Nimbus was designed to look at weather, so the team had to craft special techniques to interpret the images beyond the storm clouds, finding the darkest pixel in each photo to indicate ocean and ice, in order to mark the ice edge.

Still, the team got it done. Now that the imagery is published and public, Gallaher is on to a new project looking at old Environmental Sciences Services Administration data with the hope that its operational weather satellites, which began taking daily snapshots of the earth in 1965, can fill in the Nimbus data gaps and evaluate how often drought occurred over the study period. With a more complete and continuous record, researchers will be able to make better connections between polar ice changes and shifts in climate and sea levels over time.

Meanwhile, the Research Data Alliance is looking at the big questions of data sustainability like how to make and store data so it will last. Although the alliance has developed standards, there’s no perfect format or media guaranteed, or even likely, to remain relevant in another 10, 100 or 1,000 years. Regardless, it’s a question worth grappling with, Gallaher says. “It’s the basis of every science. If you lose your own historic data, what are you doing?”