/ data analysis

Open data

Breakthrough Listen generates huge amounts of data, particularly from the two main radio telescopes where we're doing routine observations, Green Bank and Parkes (and we'll be generating even more once we begin routine observations with MeerKAT. Part of our mission is to make as much of that data as possible publicly accessible, and to that end we have a substantial amount of data already in the Breakthrough Listen Open Data Archive, with much more to come in the near future. We've also made available some tutorials to accompany the data, including one that enables you to see the signal from the Voyager spacecraft in data taken with the Green Bank Telescope in amazing detail, given that it's 20 billion kilometers from Earth.

The volumes of raw data (a stream of what are known as "complex voltages" from the telescope) generated by our instruments are measured in hundreds of terabytes per day - simply impractical to archive in the long term. Fortunately for many purposes it's sufficient to archive the data in a form that takes up about 2 percent as much space, by performing a mathematical operation, known as a Fourier transform, on the raw voltage data, and thereby turning it into a spectrogram.

Spectrograms, also referred to as dynamic spectra, filterbanks, or waterfall data, consist of arrays of measured intensities as a function of frequency and time. Essentially they can be treated as images by common programming languages such as Python. A variety of operations can be performed on them, including searches for narrowband signals (which have formed the basis for most SETI experiments done to date), running more complex digital signal processing or machine learning algorithms, or visualizing the data in various ways. When the project began we stored most of our spectrogram data in "filterbank" format, but we have now started to switch to a format called hdf5 that has several advantages, including being more compressible, easier to access in small chunks without reading the whole file into memory, and being a more common standard for data interchange.

Sometimes, though, we want to preserve raw data, particularly on certain targets of interest. Although the data volumes are so much larger, raw data preserve both the amplitude and phase of the incoming signal (important for a number of signal detection and processing methods) and preserve the full frequency and time resolution (rather than averaging in time or frequency as we do for the filterbanks). One example of raw data that we have made publicly available is the raw data on the fast radio burst, FRB 121102, that we announced earlier in 2019. In the blog post that accompanies that release, my colleague, Vishal Gajjar, goes into more detail about the technical aspects of the raw format, as well as the filterbanks that we generated for this target.

The raw datasets we've made available so far are in a format called GUPPI RAW, familiar to radio astronomers but not commonly used elsewhere. So we were excited when our collaborators at a company called DeepSig proposed that we try making some of our data available in a format called SigMF. At the FOSDEM conference in Brussels in February 2019, DeepSig announced that they were opening up libraries that can read and write this format to the world as free, open source software.

Working with Curtin University's Greg Hellbourg (a former postdoc with Breakthrough Listen) we've converted some small samples of data from Listen observations at the Green Bank Telescope into SigMF. Additionally, our collaborators at the SETI Institute are now releasing some of their data from the Allen Telescope Array as SigMF files.

We're hopeful that releasing raw data in this open format will encourage more collaboration between SETI researchers and experts in the tech and RF industries, and that we'll be able to work together on algorithms to identify a wide range of signals both in our data, and in similar data from other sources, in search of that elusive signal from a technological civilization beyond Earth.

You can access SigMF format raw voltage data from Breakthrough Listen observations of the Voyager spacecraft with the Green Bank Telescope here (one data file per polarization, with associated SigMF headers). SigMF files from ATA observations of several other objects, including Voyager, are here.

Image: Voyager carrier signal from Breakthrough Listen Green Bank observations - spectrogram generated by Greg Hellbourg (Curtin University)