Monday, September 21, 2020

Periodogram vs FFT for Exploring X-Rays from Cygnus X-3

This month we ran data challenge, a chance to test and stretch our data exploration and detective skills. This article walks through one approach to finding a pattern in the data, underlining good data mining practice in the process.



A video walkthrough of the challenge and solution is here: [youtube]. The slides are here: [link]. 

The challenge guide is here: [pdf].



The Mysterious Cygnus X-3

Astronomers and astrophysicists work out what objects far away in deep space are by analysing the only thing that reaches us - radiation emitted by those objects. Some objects are easy to classify, some difficult. 

Cygnus X-3 has always been a difficult object to classify because it's emitted radiation doesn't easily fall into a known category. Over recent years evidence has grown that it is a binary system or two objects rotating around each other. It still gives off x-rays which are powerful and erratic, like explosions, suggesting a non-stable state.



For this challenge, a set of x-ray measurements from the SWIFT mission was provided to participants as a CSV file. Amongst other data, it contains measurements of received x-rays (counts per area per time), and associated timestamps.

The challenge was to find any pattern in the data, and if possible, infer what it meant for the physics of Cygnus X-3.

Given that there is still lots to understand about Cygnus X-3, it is entirely possible that we might uncover a new insight!



A First Look At The Data

It is always a good idea to look at any data we are intending to explore and apply tools to. In fact we should do this after every step of a data pipeline to ensure the data does look like what we expect it to.

You can follow the python notebook we used here:


The following summarises the shape of our data.

RangeIndex: 67220 entries, 0 to 67219
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    67220 non-null  float64
 1   rate    67220 non-null  float64
dtypes: float64(2)
memory usage: 1.0 MB

We can see there are 67,220 data rows, organised as two columns labelled time and rate (amount of radiation).

The following shows the very first view of our data.


We can see straight away that there are some very large outliers. Before we rush to remove that data from our set,  we should consider whether they are genuine measurements recording a flare up of radiation from Cygnus X-3. This is a valid hypothesis shouldn't be discounted immediately.

As data exploration should be exploration, we can see what happens if we do remove those very large values. We can consider a hypothesis that a radiation sensor that measures photon counts should never give negative readings, and that huge spike is a large negative number. It might be too simplistic to remove the smaller negative values because there may be a systemic shift in the data as it is recorded. One common approach is to keep values within 1 or 3 standard deviations. Again, just because everyone does it, is false comfort. The larger values might be the most informative.

We'll proceed with a brutal cutting off of negative values, and keeping the option open to revisit this decision.

The following shows the data after it has had negative values removed.


We can now see more detail in the data. There appear to be a train of spikes, with much of the data within an upper magnitude of 0.1 approximately. One participant suggested exploring the train of spikes to see if there was a layering of progressively larger spikes following an exponential pattern. 

Int64Index: 64403 entries, 0 to 67219
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    64403 non-null  float64
 1   rate    64403 non-null  float64
dtypes: float64(2)
memory usage: 1.5 MB

The cleaning process removed 2817 data rows, leaving 64,403 rows,


Fourier Analysis

A common first strategy is to see if data contains repeating elements. The go-to tool for uncovering a periodic pattern is the Fourier transform.

A Fourier transform switches the view of data from time to frequency. The following illustrates the idea.


We can see a simple pure sine wave sin(x) has only one frequency, and so has only one peak in the frequency time. Similarly a faster pure sine wave sin(2x) has only frequency, but a higher frequency, and so the peak is further to the right at a larger value.

The following shows signals that are made of two pure sine waves combined.


Although the first signal looks less simple, a Fourier analysis shows that it is made up of just two frequencies, that is sin(x) + sin(2x). The final signal is similar but this time the larger frequency component has a larger magnitude, that is, sin(x) + 2sin(2x).

You can experiment with building signals yourself:

Fourier transforms are very useful and very powerful. We can see how they can reveal the presence of dominant frequencies if periodic patterns do indeed exist in the data.

The following shows the results of the discrete cosine transform DCT, a slight simplification of the fast Fourier transform FFT.


Looking the results, and also zooming into the region near the origin, shows no apparent periodicity in the data at all. The large first spoke is the presence of a zero-frequency "DC" component in the data, and can be ignored here.

Should we give up on trying to find any periodicity in the data?


Irregular Data Sampling

Before giving up, let's revisit the data. We assumed the radiation measurements were taken at regular intervals. Looking closely at the data shows this isn't true. There are many ways to show this, such as viewing the differences between consecutive time stamps. 

The following shows the time stamps plotted one after another. If they were regularly spaced apart, they would follow the straight blue line - we can see they don't.


The popular FFT (and the simpler DCT) assume data is regularly sampled. That means we applied them inappropriately here.


Re-Sampling Data

One approach we can take to remedy the irregularity of the data is to resample it into regular time intervals. Python's rich libraries provide many data processing tools, including the ability to resample data, see the notebook for example code.

The following shows the DCTs of the data resampled to 1 day, 1 week and 1 month intervals.


Again, there doesn't appear to be any standout peaks in frequency. 

The process of resampling loses information by definition, which may have destroyed any weak periodic signal in the data. 


Periodograms

Looking back at the data again, we see that we have 64403 rows over about 13.9 years. That about 10 samples per day on average. That seems very sparse, and perhaps too infrequent to find any periodic data. 

The challenges seem stacked against us.

Luckily we can use a different algorithm for finding periodicities in data, called a periodogram. There are several variants of these methods, but essentially they are a brute-force search for which frequency in a given range best fits the data. 


They are much simpler than the very optimised FFT which also makes assumptions about the data. For us, the key benefit is that they don't require data to be regularly sampled.

The following shows the results of a Lomb-Scargle periodogram after the algorithm was configured to search across a rather wide range of frequencies.


This time there is a definite peak in the data. It corresponds to 4.79 hours


Cygnus X-3 Rotation Period

Through several and distinct methods, scientific consensus is that Cygnus X-3 is a binary system where the two bodies rotate around each other with a period of 4.79 hours

It is remarkable that we arrived so accurately at the same conclusion with this very simple exploration of the data.


Principles

As we worked through this data challenge we were reminded of good principles for working with data:



Further Reading

Although simple in concept, there is a lot of complexity in effectively using Fourier transform algorithms.  The following additional resources includes guides to avoiding the pitfalls.