What is an archive and why do we need archives?
Originally developed in 2018 by L. M. Rebull for NASA Open Data website
Lots of the issues I discuss here are also in the February 05, 2017 episode of the Spacepod podcast].
Why are there archives of NASA's astronomy data? The main reason is, of course, that there is a lot of blood, sweat, tears, and money that have gone into acquiring those data, and we don't want to lose the data! But today's post is on the wider question of why we havearchives of astronomy data.
Why are there archives at all? You have some sense of the larger problem if you take a lot of photos. You probably have photos that you took on your cell phone. But what about your partner's or kid's cell phone photos of the same event? And do you have a separate digital camera? Does your partner/kid also have such a camera? Then you might start looking at the files you have amassed on your computer(s). These pictures were taken on this date. Or were they just unloaded on this date? Or was it unloaded to this computer on this date, and that computer on this other date? Wait, whose camera was this? The files have a different numbering system. What if you find an SD disk floating around the bottom of your camera bag or purse? Was that ever unloaded? Do you have a backup of the photos from the cell phone you just accidentally dropped on the pavement? Maybe your Windows machine just got hijacked by one of those 'ransom' viruses and unless you pay the hackers off, you will lose your last six months of photos. Every time you get a new computer, it takes longer and longer to transfer all your photos.
Even if/when you have sorted all of these kinds of organizational things, you still have problems. Where is the photo you need to find now of this specific event that might have happened in the first six months of 2010? Here in Southern California, the weather is often nice, so remembering that it was a sunny day or that you decided not to take a jacket that day doesn't help that much to weed down the list of possible months in which you should look. And, if you've gotten into family history research or even just been cleaning out a relative's attic, basement, or garage, you've undoubtedly found old photos, where you can identify a person but not a place, or one of the 10 people in the photos, and forget coming up with a date.
An astronomy archive has to not only keep all the data, and keep it safely, on upgraded operating systems and functioning disks, but also make it easy to search and find what you need quickly, by target and/or date. And also, help you when you get stuck.
An astronomy archive has three main purposes, on which I elaborate below:
- Ingest new data (and new reprocessing of old data).
- Maintain/serve vital repository of irreplaceable data.
- Enable cutting-edge research.
Operating missions produce new data all the time; old data are often reprocessed, and making those newly reprocessed data available is also important. Sometimes the reprocessing of old data is just making small improvements, and sometimes this is new products built from old data. For example, high-level data products are designed to be used by astronomers who are not necessarily experts in those data, that instrument, and/or that telescope. These high-level data are ready-to-use in many cases.
Archives also need to support both observation planning and mission planning. Particularly in the context of NASA astronomy archives, new mission planning is happening all the time. If you want to plan for future missions, you need to learn from past missions.
There is more science in the archives waiting to be done. We are still learning things from an old NASA mission called IRAS that flew in the 1980s; it was the first all-sky infrared survey. And yet, those data are still being used for new discoveries. New science discoveries are lurking in these archives. You need good tools to make those discoveries. Archives need to continuously develop new tools for working with the data, new ways for filtering the data, new ways to merge data sets (say, across wavelengths or even archives, or as a function of time), and (note that this is an important bit) continue to provide user support by experts. It's not just a matter of throwing the data on the web and letting the world have at it. It's having tools to make it easy and lots of documentation that tells you how to use the data, and providing a helpdesk to which you can submit even really detailed questions when you have them.
The archive I work for is IRSA (the Infrared Science Archive). IRSA is the home for NASA's long-wavelength data so that means infrared and longer wavelength light.
IRSA has literally petabytes of data. (1000 GB = TB, a terabyte; 1000 TB= PB, a petabyte). We have (invoking my inner Carl Sagan) billions and billions of rows in databases.
IRSA has so much data that 10% of all the world-wide refereed journal articles use data that come, at least in part, from data housed at IRSA. That means that probably at least one out of every 10 professional astronomers are working with data right now that originally came from IRSA.
IRSA holds data from, among other missions, the Spitzer Space Telescope. By tracking Spitzer publications, we know that half of the refereed publications from Spitzer come not from the people who originally requested the data, but from people who accessed the data via the archive. Hubble's archive finds a similar result for Hubble. Archives (good archives!) can literally double the scientific productivity of a mission. That's a really powerful statement.
IRSA, as well as other archives, are always developing new tools and new ways of manipulating data. Archives are currently undergoing a transformation from "give me the data and let me take it home to process it" (which is where we've been for decades) to "let me do some processing at the archive and just send me home with part of it." Databases are now getting so large that some part of the analysis needs to be done near the data, e.g., at the archive. These new powerful tools will help everyone (professional astronomers, educators, amateurs) make better use of the archives.
There are other NASA astrophysics archives too -- I'll just list them here for now:
- NED, the NASA Extragalactic Database, also at Caltech-IPAC
- The NASA Exoplanet Archive, also at Caltech-IPAC
- MAST, the Mikulski Archive for Space Telescopes (at Space Telescope Science Institute in Baltimore)
- HEASARC, the High Energy Astrophysics Science Archive Research Center (at NASA/Goddard Space Flight Center in Greenbelt, MD)
- ADS, the Astrophysics Data System (all the astronomy literature, at Harvard Center for Astrophysics)