Introduction

Solar flares are sudden bursts of energy on the sun’s surface that can have significant impacts on near-Earth space. In this project, we aim to leverage machine learning techniques to classify solar flares based on vector magnetic field data. Historically, solar flares have been observed and recorded for over a century; however, early observations were primarily based on visual and photographic methods, allowing astronomers to categorize solar flares by their appearance and intensity. We hope to achieve more accurate results through our dataset containing 8,874 solar records spanning from May 2010 to December 2019. This dataset was obtained through Python’s Sunpy library by collecting data from the Joint Science Operations Center (JSOC) and the Space Weather Prediction Center (SWPC). Each record in the dataset is provided in CSV format and contains valuable information that can be used for predicting solar flare events. Furthermore, the authors of the dataset specifically stated “This data set can be used to support the research of solar flare forecasting” (Hollanda 1).

Link to cleaned data

Problem Definition

The project aims to address the critical challenge of predicting solar flares accurately. This is important as solar flares have the potential to disrupt communication systems, navigation, and power grids, posing risks to space missions and astronauts. Currently, it is very difficult to predict solar flares occuring by human forecasting (Nishizuka, Sugiura, et al. 1). According to NASA “The damage to satellites and power grids can be very expensive and disruptive.” (NASA 2). Our motivation is to mitigate this damage by classifying solar flares effectively.

Methods

We began by cleaning the dataset by formatting it to fit correctly in an excel spreadsheet, splitting values by a delimiter to ensure consistent data, and removing outliers/NULL values that were present in the data. The cleaned dataset formed the basis for subsequent analysis.

To classify solar flares, we used PCA as our data preprocessing method and Random Forests as our machine learning method.

We used PCA to extract essential information from data, reduce computational complexity, and remove multicollinearity among variables which aids in more efficient and accurate modeling. The initial dataset underwent a cleaning process to ensure the quality of the data. Once we used PCA to reduce the dimensionality of the dataset, we utilized Random Forests as our model.

Random Forests was chosen for several reasons, one being its ability to handle complex non-linear relationships. Solar flare prediction involves many irregular variables creating multitudes of parameters that do not simply follow linear trends. The Random Forests model excels at analyzing non-linear relationships due to its utilization of decision trees, which made it a good fit for our dataset and problem definition. Furthermore, due to their ensemble nature, where multiple decision trees are combined, Random Forests are generally more robust against overfitting compared to single decision trees. This is crucial in solar flare prediction where the dataset might have high variability creating high dimensionality. Finally, Random Forests can efficiently handle large datasets and high-dimensional feature spaces, making them ideal for our data with thousands of solar records.

Logistic Regression was chosen based on several key considerations. One of the primary advantages of Logistic Regression is its ability to simplify complex relationships and classifications, which aligns with the nature of predicting solar flares. Despite the complex and non-linear relationships inherent in the dataset, Logistic Regression is effective at capturing the probability of an event occurring. The model’s interpretability is another advantage, as its binary nature makes it easy to understand. Logistic Regression’s simplicity and efficiency are particularly advantageous when dealing with large, high-dimensional datasets like ours. The algorithm also mitigates overfitting concerns well. This makes Logistic Regression a reliable choice, offering a balance between interpretability and efficiency while dealing with irregular solar parameters.

Finally, we implemented our unsupervised model: K-Means. The K-Means algorithm partitions a dataset into K distinct, non-overlapping clusters. The algorithm aims to group similar data points into the same cluster. As an unsupervised model, the K-Means algorithm does not use labels to classify the data. A few reasons we chose K-Means as opposed to other unsupervised learning algorithms were simplicity, efficiency, and scalability. The algorithm is computationally efficient and very easy to implement. With a large dataset like the solar flares dataset we used, efficiency was a priority. We also didn’t need to worry about the size of our dataset because the algorithm is very scalable. However, due to the several dimensions in our dataset, we found that a supervised learning algorithm would better suit our needs. The results of the K-Means clustering further defend this idea.

Random Forests Results and Discussion

We chose to implement our supervised model using random forest first as this would give us better results over our unsupervised k means choice. As we expected, our random forest model performed exceptionally well, with a 99% accuracy as shown below. This suggests that the model performs well in terms of overall predictions.

Accuracy — Model classification accuracy

Our Precision and recall for class 0 (No flare, or minor non - disruptive flare) were both 1.00, which is exceptional. This means everytime the model predicted class zero, it was correct. For class 1, (disruptive flares), we had a score of 0.94, which means it correctly identifies 94% of disruptive flares. The F1-score, which is the harmonic mean of precision and recall, is understandably also 1.00 and 0.94 respectively, reflecting excellent performance. The macro average for precision, recall, and F1-score is 0.97, which means that the model performs consistently across both classes without significant bias toward one class, despite the imbalance in the number of class instances. The weighted average takes further into account the imbalance in the dataset by weighting the metrics based on the number of instances of each class. With a weighted average of 0.99, the model’s performance is weighted towards the majority class but still maintains high performance overall.

We also generated a confusion matrix, as seen below.

The results were as follows:

True Positives (TP): The upper left square (1684) indicates the number of times the model correctly predicted the absence of solar flares or classified them as A, B, or C class flares. This is the count of correct predictions for the majority class.
True Negatives (TN): The lower right square (78) shows the number of times the model correctly predicted the presence of significant flares (X or M class).
False Positives (FP): The upper right square (5) indicates the number of times the model incorrectly predicted the significant flares (X or M class) when there were actually no flares or they were A, B, or C class.
False Negatives (FN): The lower left square (5) shows the number of times the model failed to predict the significant flares (X or M class), instead predicting no flare or a minor one.

Based on these values, we can conclude that the model exceeds in predicting the majority class (in which there are no noteworthy/significant flares). This is evident from the high precision and recall scores. However, the model does perform slightly less well in terms of identifying significant flares (these are the class X and M flares), which is indicated by the slightly lower precision and recall scores. The false positive rate is low, suggesting a relatively low number of instances where the model incorrectly predicts significant flares when there are none.

We then wanted to investigate which features were the most important to the classification of flares. We generated the following plot:

However, this is a combined importance chart, as PCA reduced the amount of features to 11. Using the code below, we map the feature importances from a reduced PCA feature space back onto the original feature space to identify which features were most significant in the Random Forest model’s decisions.

This resulted in the following graph:

So what are these important features?

Quality - This attribute refers to a flag in the data set showing whether the flare record is noisy. When errors occur during the original data processing, the quality attribute reports them by holding values higher than 65,536.
ABSNJZH - Absolute value of the net current helicity. FlareNumber - Refers to data from the GOES Event representing whether a flare occurred or not. Attribute’s values labeled as 1 are related to M- or X-class flare events. On the other hand, when their values equal 0, they are related to A-, B-, or C-class events, or no event.
SAVNCPP - Sum of the absolute value of the net current per polarity.
TOTFY - Sum of y-component of Lorentz force

Logistic Regression Results and Discussion

We also implemented Logistic Regression, another supervised model. We wanted to see if the high accuracies from our random forest would replicate itself in a different, yet still supervised model. Logistic Regression performed similarly to our Random Forest model with a 99% accuracy as shown below. This suggests that the model also performs well in terms of overall predictions, just like our Random Forest.

Our Precision and recall for class 0 were both 1.00, which is exceptional. This means everytime the model predicted class zero, it was correct. For class 1, we had a precision score of 0.91, and a recall score of 0.95.. The F1-score, which is the harmonic mean of precision and recall, is understandably also 1.00 and 0.93 respectively, once again reflecting excellent performance. The macro average for precision is 0.95, 0.97 for recall, and overall F1-score is 0.96, which means that the model performs consistently across both classes without significant bias toward one class, despite the imbalance in the number of class instances. (Sound familiar?) The weighted average takes further into account the imbalance in the dataset by weighting the metrics based on the number of instances of each class. With a weighted average of 0.99 across the board once again, the model’s performance is weighted towards the majority class but still maintains high performance overall.

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. The curve is created by plotting the True Positive Rate (also known as recall) against the False Positive Rate at various threshold settings.

Our ROC curve displays an AUC (Area Under the Curve) of nearly 1.0. The dashed blue line represents perfect randomness, with an Area under the curve of 0.5.

Here’s what this means:

True Positive Rate (Y-axis): The TPR is plotted on the y-axis and represents the proportion of positive data points that are correctly considered positive by the model. With TPR at almost 1.0 across all thresholds, it indicates near perfect recall.
False Positive Rate (X-axis): The FPR is plotted on the x-axis and shows the proportion of negative data points that are mistakenly considered positive by the model. Our curve for FPR starts at close to (0,1), which indicates that the FPR is almost perfect.
AUC - Area Under the Curve: FOllowing the above, the AUC for our ROC curve is practically 1.0, which means the model can almost perfectly differentiate between the positive and negative class across all thresholds.

Overall, our logistic regression performed a miniscule step below our Random Forest, which still claims the highest accuracy and scores across the board. Evidently, Supervised Learning is a perfect match with the true labels for our data set.

K-Means Result and Discussion

The results of our K-Means model were not as great as our two supervised models.

With a low silhouette score of .194 (in a range of -1 to 1), we can see our model wasn’t very accurate as a higher score means the clusters were well matched while a score of 0 indicates random matching. However, the Calinski-Harabasz Index Score is high meaning the clusters are more dense and separated. But, with a low Rand Index of .043, this means our data clusterings only agreed on very few pairs of points.

We also found the most optimal amount of clusters by using the elbow method.

kmeans Elbow — Elbow Method for Optimal Clusters

Seen in the chart above, four clusters is the most optimal as that is where the “elbow” in the graph is.

Here is our clustering visualized:

The T-distributed Stochastic Neighbor Embedding (t-SNE) algorithm reduces the dimensionality of the data from eleven PCA dimensions to two dimensions for the purpose of visualization while trying to maintain the high-dimensional structure. Here’s how to interpret the chart:

The data points are colored according to the cluster they belong to, with four different colors representing the four clusters (0, 1, 2, 3). If clusters are clearly separated, it suggests that the KMeans algorithm has found distinct groups within the data. While there is some separation, there’s also some degree of overlap, particularly between the purple and yellow clusters, which suggests that there may be some ambiguity or similarity between the data points in these clusters. It’s important to note that t-SNE is a stochastic algorithm, meaning the exact layout can change each time you run it, though the overall structure should be consistent. The distances between clusters in t-SNE space do not have a meaningful interpretation; for example, one cluster appearing farther away from the others doesn’t necessarily mean it is more distinct in the original high-dimensional space.

Why K-means performed so poorly

When KMeans clustering does not perform well on a dataset where supervised algorithms have achieved high accuracy, there are several potential reasons for the discrepancy. First, the supervised model has clear guidance on what to predict based on labeled examples. KMeans has no information about the correct output, the algorithm only groups data based on feature similarity. KMeans optimizes for intra-cluster similarity and inter-cluster differences, not for label prediction accuracy. KMeans assumes also clusters are spherical and equally sized, which was not be the case with complex data like solar flare predictions - first, relations may be non linear, and furthermore, the “clusters” were absolutely not equally sized, as there were many more “no flare” occurrences than “flare” occurrences. Overall, this is the result that we expected from KMeans, however it was interesting to see just how differently the models performed.

Conclusion

In conclusion, our project focused on leveraging machine learning techniques to address the critical challenge of accurately predicting solar flares. By utilizing a comprehensive dataset sourced from multiple repositories, we aimed to contribute to the improvement of solar flare forecasting. The methods employed involved data preprocessing using PCA and implementing supervised models such as Random Forests and Logistic Regression, as well as an unsupervised model, K-Means.

The Random Forests model exhibited exceptional performance, achieving a 99.4% accuracy, with precision and recall scores reflecting its effectiveness in both identifying and classifying solar flares. Logistic Regression, although slightly below Random Forests in accuracy, still demonstrated remarkable performance, emphasizing the suitability of supervised learning for our dataset.

On the other hand, the K-Means unsupervised model faced challenges, yielding results that were not as promising as the supervised models. The silhouette score indicated suboptimal accuracy, and the Calinski-Harabasz Index suggested dense but poorly matched clusters. We identified potential reasons for this discrepancy, such as KMeans’ sensitivity to noise and the assumption of spherical, equally-sized clusters, which may not align with the complex, non-linear nature of solar flare predictions.

Our findings contribute valuable insights into the effectiveness of different machine learning approaches for solar flare prediction. Future work could involve further exploration of feature engineering and addressing the challenges associated with unsupervised clustering in this context. Our journey and insights are detailed in our final ML project presentation, accessible through the provided link.

References

Hollanda, A. (2021, June 9) et al. Data set for solar flare prediction using helioseismic and magnetic imager vector magnetic field data. Data in Brief. https://www.sciencedirect.com/science/article/pii/S235234092100487X?via%3Dihub

NASA. (n.d.). The Impact of Flares. NASA. https://hesperia.gsfc.nasa.gov/rhessi3/mission/science/the-impact-of-flares/index.html

Nishizuka, N., Sugiura, K., Kubo, Y., Den, M., & Ishii, M. (2018). Deep Flare Net (DEFN) model for Solar flare prediction. The Astrophysical Journal, 858(2), 113. https://doi.org/10.3847/1538-4357/aab9a7

Classifying Solar Flares

A study by Anish Kanduri, Shirley Benedict, Andrew Gao, and Rushda Umrani