Breakthrough Listen announces release of 400 terabytes of Green Bank Telescope data from the repeating fast radio burst FRB 121102

Fast radio bursts (FRBs) are one of the most mysterious classes of objects in the Universe. They originate outside our Galaxy, only last for around a millisecond, and can release as much energy in a fraction of a second as our sun does in an entire year. As of the end of 2018, about 65 FRBs had been discovered, out of which only one (FRB 121102) was known to show repeated pulses. Last week, the CHIME collaboration announced the discovery of 13 new FRBs, including the discovery of another “repeater” (FRB 180814). As with FRB 121102, FRB 180814 exhibits remarkable spectral and temporal variability. This variability, together with the ultimate source of FRBs, remains unexplained.

In August 2017, the Breakthrough Listen program conducted a five-hour observation of FRB 121102 using the Green Bank Telescope, as part of its campaign of targeting exotic and anomalous astronomical phenomena. The Listen team recorded around 380 TB of baseband voltage data collected with the Breakthrough Listen digital backend (MacMahon et al. 2018, PASP, 130, 044502), observing across 4 to 8 GHz. These unique data have already led to a number of discoveries. An initial search found around 21 pulses, with emission seen up to 8 GHz, the highest frequency FRBs have ever been observed (Gajjar et al. 2018, ApJ, 863, 9). These pulses were used to investigate the polarization properties of the source, and it was found that FRB 121102 is embedded in a highly dense and magnetized environment (Michilli D. 2018, Nature, 553, 182). Later in 2018, we used novel machine learning tools to search these data and found 72 new bursts (Zhang et al. 2018, ApJ, 866, 149).

Today we are releasing the entire baseband raw-voltage dataset collected during these observations, totaling nearly 400 TB. This data release marks the first time raw voltage data for FRB detections have been released to the public. We hope that the citizen science and engineering community will help us to utilize these data to their maximum potential and extract the additional insights in to FRBs that surely lie buried within them.

The data are available for download through the Breakthrough Initiatives Open Data Portal: under target name FRB121102.

The files are large and in technical formats, as described below.

Good hunting! If you have any questions on these datasets, feel free to contact me at vishalg@berkeley.edu

DATA FORMAT:

For these observations, raw voltages were recorded using the Breakthrough Listen digital backend and data recording system, as described in this paper.

The Breakthrough Listen team uses the RAW file format to store channelized voltages from radio telescopes. This format is based on the GUPPI RAW format, which was originally developed to store pulsar data for the “GUPPI” pulsar processor (hereafter; raw-file). Both RAW and GUPPI RAW are loosely related to the FITS file format. The basic structure of a raw-file is a series of "header data units". A header data unit consists of a header section followed by a data section. The header section consists of ASCII text. The data section is binary. In some cases, the header section is followed by padding bytes that are neither part of the header nor part of the data. Every header section is followed by a data section. The header section contains metadata that describes the data section and provides other relevant details (e.g. time, sky position, frequency, etc.) that correspond to the voltage samples in the data section. A detailed description of the RAW format can be found at here and here.

FRB 121102 DATA:

These observations were conducted in 10 sessions, each 30 minutes in length, and are denoted by scan numbers 11 to 20. Scan number 10 was used for noise diode calibration (with switching frequency of 25 Hz or 0.04 seconds) recording on the source for 1 minute. The number of bursts already found during these observations are listed in Table 2 in this paper with the arrival time listed in seconds from the beginning of the observations. Each scan is precisely 1800 seconds long, so for any given burst, the relevant scan number can be determined from the arrival time. Data were recorded across 32 individual compute nodes spanning the entire 4 to 8 GHz of bandwidth. Each node covered a bandwidth of 187.5 MHz. Node names and the corresponding range of frequencies are listed below.

Node    Start Frequency (MHz)	    End Frequency (MHz)
blc00   9220.21484375 		    9032.71484375
blc01   9032.71484375 		    8845.21484375
blc02   8845.21484375 		    8657.71484375
blc03   8657.71484375 		    8470.21484375
blc04   8470.21484375 		    8282.71484375
blc05   8282.71484375 		    8095.21484375
blc06   8095.21484375 		    7907.71484375
blc07   7907.71484375 		    7720.21484375
blc10   7907.71484375 		    7720.21484375
blc11   7720.21484375 		    7532.71484375
blc12   7532.71484375 		    7345.21484375
blc13   7345.21484375 		    7157.71484375
blc14   7157.71484375 		    6970.21484375
blc15   6970.21484375 		    6782.71484375
blc16   6782.71484375 		    6595.21484375
blc17   6595.21484375 		    6407.71484375
blc20   6595.21484375 		    6407.71484375
blc21   6407.71484375 		    6220.21484375
blc22   6220.21484375 		    6032.71484375
blc23   6032.71484375 		    5845.21484375
blc24   5845.21484375 		    5657.71484375
blc25   5657.71484375 		    5470.21484375
blc26   5470.21484375 		    5282.71484375
blc27   5282.71484375 		    5095.21484375
blc30   5282.71484375 		    5095.21484375
blc31   5095.21484375 		    4907.71484375
blc32   4907.71484375 		    4720.21484375
blc33   4720.21484375 		    4532.71484375
blc34   4532.71484375 		    4345.21484375
blc35   4345.21484375 		    4157.71484375
blc36   4157.71484375 		    3970.21484375
blc37   3970.21484375 		    3782.71484375

File names of these raw data files contain helpful information. An example of the raw filename format is:

blc00_guppi_57991_49836_DIAG_FRB121102_0010.0000.raw

This is parsed as follows (with _ as the delimiter):

blc00 : the server node which recorded this data (blc00-blc37 octal)
guppi : keyword noting this was recorded using guppidaq software
57991 : the modified Julian date (MJD) of the start of this observation
49836 : the seconds since midnight (UT) of this observation
DIAG_FRB121102 : the target name
0010.0000.raw : a sequence and suffix, which is further parsed as (with . as the delimiter):
0010 : is the "sequence number" or scan number of this target during a single observation session, and is an arbitrary increasing integer
0000 : is the data order number. The first data file starts at 0000, and when it "rolls over" the next file is 0001, etc.
raw : to denote this is a raw data product

We have also made filterbank format data products available for each scan. These files concatenate all data for a given scan/compute node pair and reduce it to a stream of n-bit numbers corresponding to total intensity data for multiple polarization and/or frequency channels.

The file names for the reduced filterbank products are similar:

blc00_guppi_57991_49836_DIAG_FRB121102_0010.gpuspec.0000.fil

The difference being that part of the sequence and suffix (0000.raw from above) has been replaced by:

gpuspec : keyword noting this was reduced using gpuspec software
0000 : code noting frequency/time resolution, where:
0000 : fine frequency (~3Hz frequency bins, ~18 second time bins)
0001 : fine time (~360KHz frequency bins, ~350 microsecond time bins)
0002 : mid frequency/time (~3KHz frequency bins, ~1 second time bins)
fil : to denote this is a filterbank data product

In order to know the frequency (and other necessary information) of the corresponding raw file, one can use Linux ‘fold’ command.

fold -w 80 <raw file> | more

Each scan on each node is further divided into ~23 second segments to keep the file sizes manageable. The Breakthrough Listen team has developed this python package to display and manipulate raw files using python. The most important routine is extract_blocks.py, which can be used to extract raw voltages around a given time. For example, burst number 1 occurred 16.22 seconds after the recording started (i.e. scan 11). If we want to extract a few seconds of data around this burst from a single compute node (for example blc00), we can use extract_blocks.py with the following command line arguments.

python extract_blocks.py <path to all raw files from blc00> blc00_guppi_57991_49905_DIAG_FRB121102_0011 15.7505 17.250 <output path>

This command will examine all the raw files from scan 11 for the blc00 compute node and find the appropriate raw file to extract the requested 1.5 seconds of data. If the given time interval is spread across two raw files, it can combine the data appropriately.

Once individual files are extracted from all compute nodes (32 extracted raw files for a single burst), one can use the splicer_raw.py routine to combine these raw data files into one single contiguous raw data file. This single raw file can then be coherently dedispersed to any desired spectral and temporal resolution. A description of how to perform coherent dedispersion on RAW data, as well as perform other tasks with standard pulsar tools, can be found here.

Data are released under the CC BY 4.0 license. If you make use of these datasets for academic work, please cite the following papers:

Gajjar et al. 2018, ApJ, 863, 2
Zhang et al. 2018, ApJ, 866, 149