S&P 500 index correction prediction with Machine Learning

Sergiu Iatco
3 min readOct 3, 2021

--

Everything comes with an intention; the next step is to start and then even harder is to accept that there will never be the perfect time to get the best version of what you achieved without sharing the work with the community. As with everything it requires time for developing and time for incremental ideas.

The initial intention was to create an online machine learning proof of concept starting from a blank sheet.

I am curious about the stock market which is the modern foundation that connects our needs and aspirations as society and as individuals.

Nowadays stock market participants acknowledge that valuations are high, but on the other hands no one wants to lose the trend which is your friend and everyone wants at the same to be aware of forthcoming corrections. Impossible to achieve both, but with historical data one can evaluate the probabilities of event occurrences, science probably is better than pure guessing. 😊

1. The purpose is to predict with supervised learning regression the US stock market corrections within the next six months based on historical data.

2. Reasoning on data used in the model. The P/E ratio and the Buffet Indicator.

To evaluate a stock or index one of the most used indicator is P/E ratio. This is the first feature.

As an aggregate indicator for market valuation Buffet Indicator is preferred in 21st century because before Dot-com burst Warren Buffet methods of evaluation were considered by investors outdated, but after burst of many stocks to dust and other stocks that have not recovered to previous highs even after 20 years many investors turned back with appreciation even nowadays to the Buffet Indicator. This is the second primary feature.

3. Collecting data. S&P 500 index and S&P 500 P/E ratio are extracted from data tables. Buffet Indicator is calculated based on GDP and Wilshire 5000 Full Cap Price Index data extracted with FRED API. It requires and individual API key, but you can apply for it online for free.

4. Preparing data. Target data is the market correction. Market corrections is calculated using S&P 500 index by dividing value from the beginning of period to the minimum value from next 6 months.

Primary data for features are P/E and Buffet Indicator; to use them they have to be converted in meaningfully statistics. To explore different samples of statistics I used band definition which is how many past months are taken into account as cause of a correction (12, 24, 36, 48, 60}.

I used snippets code from Kaggle competition Earthquake and https://machinelearningmastery.com/. The code I wrote using official documentation of libraries, but for quick targeted issues or solution I mostly used https://stackoverflow.com/ or https://www.geeksforgeeks.org/. I tried to use https://www.kite.com/, but it is was helpful only as start point to official documentation, not for ready for use code.

5. Model training. I used for training XGBoost library because is the first model for me that works out of the box with great results. Worth training to build a model with Recurrent Neural Network in Keras to compare scores. Worth training prediction of price instead of correction. The model is flexible, features can be added by aligning data to index date of target market correction.

6. Prediction. I tried different ways of prediction as grid search for primary features and band size. As we know the score varies due to the stochastic nature of optimization, therefore additionally I applied random cross fold bootstrapping to average the predictions.

7. Financial value of the S&P 500 index prediction. With P/E ratio the model predicts a correction of 4%, with Buffet Indicator the model predicts a correction of 8%, with a set of features {P/E, Price, Buffet Indicator} the model predicts a correction of 7%, R2score is about 0.8.

8. The model is available online on https://github.com/itsergiu/ with a link to https://mybinder.org/ where you can start the Jupyter Notebook to execute it and play with data and analyze results. Execution requires processing power, however mybinder can handle it (no guarantee) from what I tested with 60 bands which means 60 months frame for training and prediction. To speed up decrease the value from variable n_times =10 to n_times = 1.

The model works with provided data. The model is blind to other events like current ongoing pandemic with emerging lockdowns, China Crackdown and Evergrande debt crisis which might be a snowball trigger.

What are the other valuable available features to include and how to exclude the noise from model to get a more accurate prediction?

--

--