Machine learning in the Stock Market
Use Machine Learning To Possibly Become A Millionaire: Predicting The Stock Market? Stock market is one of the most well-known infrastructures through which anyone can potentially make a fortune. If anyone could crack the code to predicting what future stock prices are, they’ll practically rule the world.here’s just one problem. It’s pretty much impossible to accurately predict the future of the stock market.
In this project, we will work with historical data about the stock prices of a publicly listed company. We will implement a mix of machine learning algorithms to predict the future stock price of the company, starting with simple algorithm like decision tree, and then move on to advanced techniques like Auto ARIMA and Neural Networks.
You can find the code I used on my Github repo
Decision Tree to Trade Bank of America Stock (Real Time Value Analysis)
Decision Tree to Trade Bank of America Stock (Real Time Value Analysis)
- Decision trees are one of the more popular machine-learning algorithms for their ability to model noisy data, easily pick up non-linear trends, and capture relationships between your indicators; they also have the benefit of being easy to interpret.
- Decision trees take a top-down, “divide-and-conquer” approach to analyzing data. They look for the indicator, and indicator value, that best splits the data into two distinct groups.
- The algorithm then repeats this process on each subsequent group until it correctly classifies every data point or a stopping criteria is reached.
- Each split, known as a “node”, tries to maximize the purity of the resulting “branches”. The purity is basically the probability that a data point falls in a given class, in our case “up” or “down”, and is measured by the “information gain” of each split.
- In this model, we are going to use real time data of bofa stock from 2000- 2020.For this we will use quantmod package
Building a Strategy
- Let’s see how we can quickly build a strategy using 4 technical indicators to see whether today’s price of BoA’s stock is going to close up or down.The 4 technical indicators are:
- RSI -> Calculate a 3-period relative strength index (RSI) off the open price
- EMA -> Calculate a 5-period exponential moving average (EMA)
- MACD -> Calculate a MACD with standard parameters
- SMI -> Stochastic Oscillator with standard parameters
- Then we calculate the variable we are looking to predict and build our data sets.
- Class -> Calculate the difference between the close price and open price. If PriceChange>0,“UP”,“DOWN”
[1] "BAC"
[1] "RSI3" "EMAcross" "MACDsignal" "Stochastic" "Class"
Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.75 percent contributes to training & 25 percent of data contributes to test.
Training the Decision Tree classifier with criterion as information gain
We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” to use repeatedcv i.e, repeated cross-validation.
The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.
To select the specific strategy for decision tree, we need to pass a parameter “parms” in our train() method. It should contain a list of parameters for our rpart method. For splitting criterions, we need to add a “split” parameter with values either “information” for information gain & “gini” for gini index. We are using information gain as a criterion.
Trained Decision Tree classifier results:
- We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.
CART
3908 samples
4 predictor
2 classes: 'DOWN', 'UP'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 3517, 3518, 3517, 3517, 3517, 3517, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.001388889 0.5141566 0.0301167276
0.001562500 0.5145822 0.0308771188
0.001822917 0.5152644 0.0322216484
0.001909722 0.5154349 0.0325509692
0.002083333 0.5159460 0.0335094058
0.002343750 0.5170542 0.0345975203
0.002951389 0.5177349 0.0358283997
0.003645833 0.5188436 0.0377622413
0.018229167 0.5058795 0.0006184276
0.019531250 0.5053680 -0.0004969420
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.003645833.
Plot Decision Tree
#### Decision Tree Prediction
Now, our model is trained with cp = 0.003645833.Accuracy of the model is 0.5141566.
Confusion Matrix and Statistics
Reference
Prediction DOWN UP
DOWN 317 279
UP 345 360
Accuracy : 0.5204
95% CI : (0.4928, 0.5478)
No Information Rate : 0.5088
P-Value [Acc > NIR] : 0.210682
Kappa : 0.0422
Mcnemar's Test P-Value : 0.009266
Sensitivity : 0.4789
Specificity : 0.5634
Pos Pred Value : 0.5319
Neg Pred Value : 0.5106
Prevalence : 0.5088
Detection Rate : 0.2437
Detection Prevalence : 0.4581
Balanced Accuracy : 0.5211
'Positive' Class : DOWN
ARIMA Forecasting Expedia Stock Price incorporating COVID-19 (Real Time)
ARIMA Forecasting Expedia Stock Price incorporating COVID-19 (Real Time)
The goal of this project is to predict the future stock price of Expedia using various predictive forecasting models and then analysing the models using ARIMA.
The dataset for Expedia stocks is obtained from Yahoo Finance using Quantmod package in R.
The timeline of the data is from 2019 till present day(11/26/2020).
We shall also try and understand the impact of COVID-19 disaster on the stock prices of Expedia.
Forecasting
- A forecasting algorithm is a process that seeks to predict future values based on the past and present data.
- This historical data points are extracted and prepared trying to predict future values for a selected variable of the dataset.
Data Preparation
Importing the data : We obtain the data of Expedia from “2019-07-01” to “2020-11-26” of Expedia Stock price for our analysis using the quantmod package. To analyse the impact of COVID-19 on the Expedia Stock price, we take two sets of data from the quantmod package.
Data from “2019-07-01” - “2020-03-28” is data before covid.
Data from “2020-04-01” - “till date” is data before covid.
All the analysis and the models will be made on both the datasets to analyse the impact of COVID-19, if any.
[1] "EXPE"
[1] "EXPE"
colnames(google_data_after_covid)
1 EXPE.Open
2 EXPE.High
3 EXPE.Low
4 EXPE.Close
5 EXPE.Volume
6 EXPE.Adjusted
Graphical Representation of Data
ARIMA Model
- Let us first analyse the ACF and PACF Graph of each of the two datasets.
- We then use the auto.arima function to determine the time series model for each of the datasets.
Series: tsData_before_covid_close
ARIMA(3,1,2)
Box Cox transformation: lambda= -0.5522943
Coefficients:
ar1 ar2 ar3 ma1 ma2
0.9436 -0.2552 -0.2793 -0.9659 0.7154
s.e. 0.1333 0.1520 0.0991 0.1174 0.1094
sigma^2 estimated as 1.264e-05: log likelihood=791.34
AIC=-1570.69 AICc=-1570.22 BIC=-1551.3
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.3468986 3.779995 2.064975 -0.4697368 2.119611 1.117624
ACF1
Training set -0.01611792
Series: tsData_after_covid_close
ARIMA(1,1,0) with drift
Box Cox transformation: lambda= 1.049021
Coefficients:
ar1 drift
-0.0993 0.5238
s.e. 0.0785 0.3099
sigma^2 estimated as 19.12: log likelihood=-470.79
AIC=947.57 AICc=947.73 BIC=956.86
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.002349432 3.471361 2.361182 -0.07665327 2.827689 0.9868023
ACF1
Training set 0.002088858
From the auto.arima function, we conclude the following models for the two datasets:
- Before COVID-19: ARIMA(3,1,2)
- After COVID-19: ARIMA(1,1,0)
After obtaining the model, we then perform residual diagnostics for each of the fitted models.
From the residual plot , we can confirm that the residual has a mean of 0 and the variance is constant as well . The ACF is 0 for lag> 0 , and the PACF is 0 as well.
So, we can say that the residual behaves like white noise and conclude that the models ARIMA(3,1,2) and ARIMA(1,1,0) fits the data well. Alternatively, we can also test at a significance level using the Box-Ljung Test.
Diagnostic measures
Try to find out the pattern in the residuals of the chosen model by plotting the ACF of the residuals, and doing a portmanteau test. We need to try modified models if the plot doesn’t look like white noise.
Once the residuals look like white noise, calculate forecasts.
Box-Ljung test is a test of independence at all lags up to the one specified. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags, and is therefore a portmanteau test. It is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation.
The ACF of the residuals shows no significant autocorrelations.
The p-values for the Ljung-Box Q test all are well above 0.05, indicating “non-significance.” The values are normal as they rest on a line and aren’t all over the place.
Augmented Dickey-Fuller & Kwiatkowski-Phillips-Schmidt-Shin
- We then conduct an ADF (Augmented Dickey-Fuller) test and KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test to check for the stationarity of the time series data for both the datasets closing price.
Augmented Dickey-Fuller Test
data: tsData_before_covid_close
Dickey-Fuller = -2.1153, Lag order = 5, p-value = 0.5279
alternative hypothesis: stationary
Augmented Dickey-Fuller Test
data: tsData_after_covid_close
Dickey-Fuller = -2.4665, Lag order = 5, p-value = 0.3817
alternative hypothesis: stationary
From the above ADF tests, we can conclude the following:
For the dataset before COVID-19, the ADF tests gives a p-value of 0.5279which is greater than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the ADF tests gives a p-value of 0.3817 which is lesser greater 0.05, thus implying that the time series data is not stationary.
KPSS Test for Level Stationarity
data: tsData_before_covid_close
KPSS Level = 2.3907, Truncation lag parameter = 4, p-value = 0.01
KPSS Test for Level Stationarity
data: tsData_after_covid_close
KPSS Level = 2.6532, Truncation lag parameter = 4, p-value = 0.01
From the above KPSS tests, we can conclude the following:
For the dataset before COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
Thus, we can conclude from the above tests that the time series data is not stationary.
Forecasting with ARIMA Models
Forecast errors for before covid dataset
- Forecast errors for after covid dataset
Holt Winters
To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “HoltWinters()” function in R. To use HoltWinters() for simple exponential smoothing, we need to set the parameters beta=FALSE and gamma=FALSE in the HoltWinters() function (the beta and gamma parameters are used for Holt’s exponential smoothing, or Holt-Winters exponential smoothing, as described below).
HoltWinters() function returns a list variable, that contains several named elements.The output of HoltWinters() tells us that the estimated value of the alpha parameter is about 0.982813.
We can plot the original time series.
Holt-Winters exponential smoothing with trend and without seasonal component.
Call:
HoltWinters(x = tsData_before_covid_close, gamma = FALSE)
Smoothing parameters:
alpha: 0.982813
beta : 0.02381423
gamma: FALSE
Coefficients:
[,1]
a 60.091954
b -0.869578
- As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-of-squared-errors is stored in a named element of the list variable “rainseriesforecasts” called “SSE”, so we can get its value by typing:
[1] "skirtsseriesforecasts$SSE"
[1] 2798.291
Forecasting with HoltWinters
- We can make forecasts for further time points by using the “forecast.HoltWinters()” function in the R “forecast” package.
- When using the forecast.HoltWinters() function, as its first argument (input), you pass it the predictive model that you have already fitted using the HoltWinters() function.
- You specify how many further time points you want to make forecasts for by using the “h” parameter in forecast.Here we are going to predict price for 45 days.
- forecast.HoltWinters() function gives you the forecast for a year, a 80% prediction interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted stock price for 233 day is about 20.96095, with a 95% prediction interval of (-56.8291332 98.75102).
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
189 59.22238 54.261714 64.18304 51.6356981 66.80905
190 58.35280 51.315518 65.39008 47.5902068 69.11539
191 57.48322 48.787761 66.17868 44.1846635 70.78178
192 56.61364 46.469767 66.75752 41.0999238 72.12736
193 55.74406 44.280562 67.20757 38.2121512 73.27598
194 54.87449 42.178858 67.57011 35.4581988 74.29077
195 54.00491 40.140333 67.86948 32.8008698 75.20895
196 53.13533 38.149294 68.12137 30.2161653 76.05449
197 52.26575 36.194958 68.33655 27.6875932 76.84391
198 51.39617 34.269562 68.52279 25.2032811 77.58907
199 50.52660 32.367315 68.68588 22.7543724 78.29882
200 49.65702 30.483773 68.83026 20.3340704 78.97997
201 48.78744 28.615447 68.95943 17.9370389 79.63784
202 47.91786 26.759545 69.07618 15.5590078 80.27672
203 47.04828 24.913796 69.18277 13.1965057 80.90006
204 46.17871 23.076330 69.28108 10.8466714 81.51074
205 45.30913 21.245588 69.37267 8.5071196 82.11114
206 44.43955 19.420256 69.45884 6.1758412 82.70326
207 43.56997 17.599217 69.54073 3.8511288 83.28882
208 42.70039 15.781515 69.61927 1.5315205 83.86927
209 41.83082 13.966326 69.69531 -0.7842448 84.44588
210 40.96124 12.152934 69.76954 -3.0972610 85.01974
211 40.09166 10.340716 69.84261 -5.4084829 85.59180
212 39.22208 8.529124 69.91504 -7.7187474 86.16291
213 38.35250 6.717675 69.98733 -10.0287917 86.73380
214 37.48293 4.905945 70.05991 -12.3392679 87.30512
215 36.61335 3.093553 70.13314 -14.6507544 87.87745
216 35.74377 1.280164 70.20738 -16.9637667 88.45131
217 34.87419 -0.534523 70.28291 -19.2787648 89.02715
218 34.00461 -2.350779 70.36001 -21.5961611 89.60539
219 33.13504 -4.168844 70.43892 -23.9163256 90.18640
220 32.26546 -5.988937 70.51985 -26.2395913 90.77051
221 31.39588 -7.811255 70.60302 -28.5662585 91.35802
222 30.52630 -9.635974 70.68858 -30.8965985 91.94920
223 29.65672 -11.463254 70.77670 -33.2308566 92.54431
224 28.78715 -13.293243 70.86754 -35.5692552 93.14355
225 27.91757 -15.126070 70.96121 -37.9119963 93.74713
226 27.04799 -16.961857 71.05784 -40.2592630 94.35524
227 26.17841 -18.800712 71.15754 -42.6112221 94.96805
228 25.30884 -20.642734 71.26040 -44.9680254 95.58570
229 24.43926 -22.488014 71.36653 -47.3298109 96.20832
230 23.56968 -24.336634 71.47599 -49.6967045 96.83606
231 22.70010 -26.188669 71.58887 -52.0688212 97.46902
232 21.83052 -28.044188 71.70523 -54.4462656 98.10731
233 20.96095 -29.903253 71.82514 -56.8291332 98.75102
- To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:
- Prediction for next ten days.
Ljung-Box Tests
Box Test for modelfit_after_covid$residuals.
To check for correlations between successive forecast errors, we can make a correlogram and use the Ljung-Box test.
Box-Ljung test
data: modelfit_after_covid$residuals
X-squared = 1.6058e-05, df = 1, p-value = 0.9968
- Box Test for modelfit_before_covid$residuals
Box-Ljung test
data: modelfit_before_covid$residuals
X-squared = 0.0047812, df = 1, p-value = 0.9449
Conclusion
Here, the p value for both the models is greater than 0.05 . Hence, at a significance level of 0.05 we fail to reject the null hypothesis and conclude that the residual follows white noise. This means that the model fits the data well.
Using a Neural Network to Model the Amazon Stock Price Change (Real Time Analysis)
Using a Neural Network to Model the Amazon Stock Price Change (Real Time Analysis)
Artificial neural networks are very powerful and popular machine-learning algorithms that mimic how a brain works in order find patterns in your data.
In this project, we will build a basic neural network to model the price change of Amazon stocks in real Time.
ANNs make predictions by sending the inputs (in our case, the indicators) through the network of neurons, with the neurons firing off depending on the weights of the incoming signals. The final output is determined by the strength of the signals coming from the previous layer of neurons.
Data Import
Use “quantmod” package to download information for Amazon stocks.
Let’s see how we can quickly build a strategy using 4 technical indicators to see whether today’s price of Amazon’s stock .The 4 technical indicators are:
- RSI -> Calculate a 3-period relative strength index (RSI) off the open price
- EMA -> Calculate a 5-period exponential moving average (EMA)
- MACD -> Calculate a MACD with standard parameters
- SMI -> Stochastic Oscillator with standard parameters
[1] "AMZN"
Normalize Data
One of the most important procedures when forming a neural network is data normalization. This involves adjusting the data to a common scale so as to accurately compare predicted and actual values. Failure to normalize the data will typically result in the prediction value remaining the same across all observations, regardless of the input values.
Max-Min Normalization: For this method, we invoke the following function to normalize our data:
Data slicing:
- Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.75 percent contributes to training & 25 percent of data contributes to test.
'data.frame': 5209 obs. of 5 variables:
$ RSI3 : num 0.0853 0.0409 0.0402 0.6396 0.4188 ...
$ EMAcross : num 0.538 0.534 0.539 0.553 0.545 ...
$ MACDsignal: num 0.727 0.705 0.681 0.661 0.643 ...
$ BollingerB: num 0.423 0.353 0.362 0.477 0.408 ...
$ Price : num 0.417 0.424 0.444 0.422 0.433 ...
- TrainingParameters :
- train() method is passed with repeated cross-validation resampling method for 10 number of resampling iterations repeated for 3 times.
Machine Learning: Classification using Neural Networks
- Model Training
- We can us neuralnet() to train a NN model. Also, the train() function from caret can help us tune parameters. We can plot the result to see which set of parameters is fit our data the best.
- nnnet package by defualt uses the Logisitc Activation function.
- Data Pre-Processing With Caret: The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.
- The center transform calculates the mean for an attribute and subtracts it from each value.
- Combining the scale and center transforms will standardize your data.
- Attributes will have a mean value of 0 and a standard deviation of 1.
- Training transforms can prepared and applied automatically during model evaluation.
- Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.
- Backpropagation algorithm is a supervised learning method for multilayer feed-forward networks from the field of Artificial Neural Networks.
- The principle of the backpropagation approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state.
- We use Backpropagation as algorithm in neural network package.
nnetGrid <- expand.grid(size = seq(from = 1, to = 5, by = 1)
,decay = seq(from = 0.1, to = 0.2, by = 0.1)
)
str(subTrain)
'data.frame': 12507 obs. of 5 variables:
$ RSI3 : num 0.0853 0.0409 0.0402 0.4188 0.3651 ...
$ EMAcross : num 0.538 0.534 0.539 0.545 0.544 ...
$ MACDsignal: num 0.727 0.705 0.681 0.643 0.625 ...
$ BollingerB: num 0.423 0.353 0.362 0.408 0.386 ...
$ Price : num 0.417 0.424 0.444 0.433 0.426 ...
nn_model <- train(Price~RSI3+EMAcross+MACDsignal+BollingerB, subTrain,
method = "nnet", algorithm = 'backprop',
trControl= TrainingParameters,
preProcess=c("scale","center"),
na.action = na.omit,
tuneGrid = nnetGrid,
trace=FALSE,
verbose=FALSE)
- Based on the caret neural network model, train sets hidden layer.caret neural network picks the best neural network based on size, decay.
- We can visualize accuracy for different hidden layers below:
size decay RMSE Rsquared MAE RMSESD RsquaredSD
1 1 0.1 0.03933584 0.01238798 0.01586244 0.002504076 0.01443134
2 1 0.2 0.03934006 0.01239107 0.01583777 0.002507265 0.01451383
3 2 0.1 0.03933346 0.01268185 0.01586449 0.002503906 0.01468389
4 2 0.2 0.03933083 0.01281302 0.01584319 0.002506894 0.01488929
5 3 0.1 0.03933819 0.01287279 0.01586899 0.002504096 0.01467428
6 3 0.2 0.03933054 0.01293707 0.01584766 0.002504418 0.01501688
7 4 0.1 0.03932220 0.01314280 0.01584623 0.002496180 0.01459575
8 4 0.2 0.03932984 0.01300564 0.01584778 0.002504590 0.01508909
9 5 0.1 0.03932381 0.01361986 0.01584481 0.002522508 0.01473453
10 5 0.2 0.03932938 0.01304931 0.01584684 0.002504771 0.01512491
MAESD
1 0.0007305646
2 0.0007313067
3 0.0007312625
4 0.0007335536
5 0.0007326783
6 0.0007321606
7 0.0007377006
8 0.0007321347
9 0.0007322239
10 0.0007327730
- Prediction
- Now, our model is trained with accuracy = 0.8889 We are ready to predict classes for our test set.
actual prediction
7 0.4362379 0.4274795
10 0.4256909 0.4273008
97 0.4278731 0.4284938
139 0.4238725 0.4280213
483 0.4315391 0.4283605
783 0.4282804 0.4271653
896 0.4283095 0.4268096
908 0.4300552 0.4267116
911 0.4315391 0.4280867
952 0.4203374 0.4286907
1012 0.4257782 0.4279932
1711 0.4255164 0.4265486
1746 0.4299679 0.4280298
1804 0.4386092 0.4280147
2058 0.4317427 0.4280102
2073 0.4374163 0.4289039
2176 0.4056444 0.4292499
2276 0.4372417 0.4271038
2695 0.4251672 0.4280948
2746 0.4161477 0.4282272
2772 0.4228978 0.4272442
2784 0.4386092 0.4272447
3057 0.4236834 0.4273921
3139 0.4282513 0.4273274
3178 0.4408495 0.4281902
3220 0.4289496 0.4283887
3335 0.4316845 0.4266736
3443 0.4418097 0.4276801
3503 0.4245853 0.4275239
4018 0.4396857 0.4311302
4237 0.4353504 0.4281576
4261 0.4384055 0.4282462
4421 0.4145185 0.4284854
4526 0.5582194 0.4272961
4664 0.4444863 0.4258982
4786 0.4411985 0.4281155
4843 0.3776257 0.4279618
4865 0.4139076 0.4277111
4937 0.4610998 0.4286968
5030 0.5170493 0.4270784
Accuracy of Neural Network model:
99% accuracy is acheived to predict the price using the stock technical indicators.
RSI3 EMAcross MACDsignal BollingerB Price
7 0.32459911 0.5442847 0.6075494 0.3572573 0.4362379
10 0.20618845 0.5412815 0.5634201 0.3204323 0.4256909
97 0.24607269 0.5460137 0.2333221 0.3782227 0.4278731
139 0.84671114 0.5597144 0.7114260 0.7928638 0.4238725
483 0.80588690 0.5503068 0.6449523 0.7104961 0.4315391
783 0.54896975 0.5481855 0.7942074 0.5794676 0.4282804
896 0.36179313 0.5467044 0.7485756 0.4977961 0.4283095
908 0.36059256 0.5461656 0.7215884 0.5917109 0.4300552
911 0.87072321 0.5542228 0.7250983 0.8051809 0.4315391
952 0.72799440 0.5495230 0.5471357 0.5809809 0.4203374
1012 0.25508071 0.5462964 0.4156671 0.3042368 0.4257782
1711 0.17516899 0.5466259 0.7149182 0.3455139 0.4255164
1746 0.42878584 0.5473307 0.5385448 0.3833825 0.4299679
1804 0.95547536 0.5649359 0.7302750 0.9376553 0.4386092
2058 0.75475738 0.5528730 0.6755122 0.7233058 0.4317427
2073 0.85834756 0.5557843 0.6024193 0.5896256 0.4374163
2176 0.61647522 0.5560750 0.3465414 0.4212848 0.4056444
2276 0.38169960 0.5460403 0.7248211 0.4346285 0.4372417
2695 0.79630557 0.5568765 0.7103222 0.6807527 0.4251672
2746 0.86740868 0.5570476 0.6578077 0.8334958 0.4161477
2772 0.33325260 0.5425081 0.6137948 0.4871775 0.4228978
2784 0.11853439 0.5357345 0.5229431 0.2911652 0.4386092
3057 0.37948310 0.5450882 0.6474615 0.4269868 0.4236834
3139 0.47640198 0.5473919 0.6464690 0.6264217 0.4282513
3178 0.64602849 0.5532853 0.6476852 0.4856036 0.4408495
3220 0.78578515 0.5543646 0.6158716 0.6913156 0.4289496
3335 0.14403635 0.5345840 0.5990995 0.4992964 0.4316845
3443 0.95580068 0.6123559 0.6557729 0.9230118 0.4418097
3503 0.48426281 0.5469840 0.6377598 0.5614055 0.4245853
4018 0.04419073 0.4504553 0.4264717 0.1988550 0.4396857
4237 0.59029580 0.5558939 0.5694519 0.5221716 0.4353504
4261 0.87353837 0.5836426 0.6483177 0.7248558 0.4384055
4421 0.88473691 0.5823825 0.5743706 0.7673681 0.4145185
4526 0.59590136 0.5793888 0.7331421 0.5560403 0.5582194
4664 0.98636455 0.6589197 0.6674518 0.8129811 0.4444863
4786 0.62740722 0.5626872 0.5948031 0.5374183 0.4411985
4843 0.54126423 0.5618588 0.6234117 0.4116830 0.3776257
4865 0.63945676 0.5788562 0.5987941 0.6760626 0.4139076
4937 0.20789866 0.4876023 0.5509471 0.2674240 0.4610998
5030 0.44684424 0.5332037 0.7226938 0.6420079 0.5170493
[1] 0.0034655906 -0.0006396635 -0.0002464325 -0.0016497116 0.0012600367
[6] 0.0004426085 0.0005953472 0.0013262634 0.0013685815 -0.0033262027
[11] -0.0008800795 -0.0004101826 0.0007687922 0.0041881569 0.0014795282
[16] 0.0033666456 -0.0094548054 0.0040098370 -0.0011635185 -0.0048179652
[21] -0.0017289311 0.0044925247 -0.0014748155 0.0003667483 0.0049999668
[26] 0.0002225924 0.0019863283 0.0055785501 -0.0011681378 0.0033806843
[31] 0.0028470762 0.0040164223 -0.0055743799 0.0494191079 0.0073310991
[36] 0.0051666092 -0.0203900965 -0.0055105203 0.0126964783 0.0344970986
[1] 0.9974791
- Plotting nnet variable importance
nnet variable importance
Overall
EMAcross 100.000
MACDsignal 11.556
BollingerB 6.033
RSI3 0.000
#### Graphical Representation of our Neural Network
Conclusion
From the above implementation, the results are impressive(99% accuracy) and convincing in terms of using a machine learning algorithm to decide on the price of the stock Majority of the attributes in the dataset contribute significantly to the building of a predictive model.
Using SVM to predict price change for WALMART (Real Time Analysis)
Using SVM to predict price change for WALMART (Real Time Analysis)
Machine Learning: Classification using SVM
- SVM is another classification method that can be used to predict if a client falls into either ‘yes’ or ‘no’ class.
- The linear, polynomial and RBF or Gaussian kernel in SVM are simply different in case of making the hyperplane decision boundary between the classes.
- The kernel functions are used to map the original dataset (linear/nonlinear ) into a higher dimensional space with view to making it linear dataset.
- Usually linear and polynomial kernels are less time consuming and provides less accuracy than the rbf or Gaussian kernels.
- The k cross validation is used to divide the training set into k distinct subsets. Then every subset is used for training and others k-1 are used for validation in the entire trainging phase. This is done for the better training of the classification task.Overall, if you are unsure which kernel method would be best, a good practice is use of something like 10-fold cross-validation for each training set and then pick the best algorithm.
colnames(DataSetWalmart)
1 Class
2 RSI
3 EMAcross
4 MACD
5 SMI
6 WPR
7 ADX
8 CCI
9 CMO
10 ROC
SVM Classifier using Linear Kernel
Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method.
We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this project, let’s try to use repeatedcv i.e, repeated cross-validation.
The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.
Before training our SVM classifier, set.seed().
For training SVM classifier, train() method should be passed with “method” parameter as “svmLinear”. We are passing our target variable Term_Deposit. The “Term_Deposit.~.” denotes a formula for using all attributes in our classifier and Term_Deposit. as the target variable. The “trControl” parameter should be passed with results from our trianControl() method. The “preProcess” parameter is for preprocessing our training data.
As discussed earlier for our data, preprocessing is a mandatory task. We are passing 2 values in our “preProcess” parameter “center” & “scale”. These two help for centering and scaling the data. After preProcessing these convert our training data with mean value as approximately “0” and standard deviation as “1”. The “tuneLength” parameter holds an integer value. This is for tuning our algorithm.
Support Vector Machines with Linear Kernel
18651 samples
7 predictor
2 classes: 'DOWN', 'UP'
Pre-processing: centered (7), scaled (7)
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 16786, 16786, 16786, 16786, 16786, 16786, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.8209209 0.6412469
0.50 0.8211354 0.6416789
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 0.5.
- The above model is showing that our classifier is giving best accuracy on C = 0.25 Let’s try to make predictions using this model for our test set and check its accuracy.
predictionsvm DOWN UP
DOWN 28 8
UP 5 28
- Accuracy on the test set by train control is 81% using C=0.25.
[1] 0.8115942
Confusion Matrix and Statistics
Reference
Prediction DOWN UP
DOWN 28 8
UP 5 28
Accuracy : 0.8116
95% CI : (0.6994, 0.8957)
No Information Rate : 0.5217
P-Value [Acc > NIR] : 5.253e-07
Kappa : 0.6239
Mcnemar's Test P-Value : 0.5791
Sensitivity : 0.8485
Specificity : 0.7778
Pos Pred Value : 0.7778
Neg Pred Value : 0.8485
Prevalence : 0.4783
Detection Rate : 0.4058
Detection Prevalence : 0.5217
Balanced Accuracy : 0.8131
'Positive' Class : DOWN
SVM Classifier using Non-Linear Kernel
- Now, we will try to build a model using Non-Linear Kernel like Radial Basis Function. For using RBF kernel, we just need to change our train() method’s “method” parameter to “svmRadial”. In Radial kernel, it needs to select proper value of Cost “C” parameter and “sigma” parameter.
Support Vector Machines with Radial Basis Function Kernel
18651 samples
7 predictor
2 classes: 'DOWN', 'UP'
Pre-processing: centered (7), scaled (7)
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 16786, 16786, 16786, 16786, 16786, 16786, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.8565757 0.7127826
0.50 0.8618302 0.7233515
Tuning parameter 'sigma' was held constant at a value of 0.5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.5 and C = 0.5.
- SVM-RBF kernel calculates variations and will present us best values of sigma & C. Based on the output best values of sigma= 0.5 & C=0.5 Let’s check our trained models’ accuracy on the test set.
[1] 0.7971014
- Final prediction accuracy on the test set is 0.7971014
Comparision between SVM models
- Comparision between SVM Linear and Radial Models.
Call:
summary.resamples(object = algo_results)
Models: SVM_RADIAL, SVM_LINEAR
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM_RADIAL 0.8509383 0.8571046 0.8630027 0.8618302 0.8657375 0.8723861 0
SVM_LINEAR 0.8144772 0.8171582 0.8209115 0.8211354 0.8248692 0.8284182 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM_RADIAL 0.7011356 0.7139651 0.7255579 0.7233515 0.7312067 0.7446252 0
SVM_LINEAR 0.6280912 0.6334363 0.6410743 0.6416789 0.6492305 0.6567207 0
Conclusion
From the above implementation, the results are impressive and convincing in terms of using a machine learning algorithm to decide on the price change of walmart . Majority of the attributes in the dataset contribute significantly to the building of a predictive model. All the two SVM approach acheives good accuracy rate(>80%) and are easier to implement.
Amazon Price Trend Predict - Random Forest (Real Time Analysis)
Amazon Price Trend Predict - Random Forest (Real Time Analysis)
Stock market prediction is an incredibly difficult task, due to the randomness and noisiness found in the market. Yet, predicting market behaviors is a very important task. Correctly predicting stock price directions can be used to maximize profits, as well as to minimize risk. There are two types of methods to predicting market behavior. One is predicting the future price of an asset. This is usually done using time series analysis to fit a specific model, like ARIMA or GARCH, to some historical data. The other is predicting the future trend of an asset. That is, whether one thinks it will go up or down in price, treating it as a classification problem.
The goal of this project is to create an intelligent model, using the Random Forest model, that can correctly forecast the behavior of a stock’s price n days out.
Data Import
Use “quantmod” package to download information for Amazon stocks.
The data used for this project consists of regular stock data (open, close, volume, etc.) from Yahoo finance, and ranges from the year 2000 to 2018.
From this data, technical indicators were calculated for every stock. Below are all the technical indicators used for this model:
- Relative Strength Index - Stochastic Oscillator - William %R - Moving Average Convergence Divergence - Price Rate of Change - On Balance Volume
The last step of pre-processing the data was calculating the response variable. Since we are treating this as a classification problem, the response variable was binary. The equation for calculating the response variable is below:Response=Closet+n−Closet
It states that the adjusted close price at t+n, where n is the number of days out you want to predict, minus the current adjusted close price will map to a value that says the stock price went up from the point at time t, or that it went down.
Methadology
[1] "^GSPC"
- Dataset used for this project:
colnames(dataset1)
1 rsi
2 EMA
3 signal
4 pctB
5 GSPC.Close
Check for missing data in relativeStrengthIndex20, exponentialMovingAverage20, MACDsignal, PercentageChngpctB, Price. Omit the n/a values in dataset.
Print the number of missed value for each attribute in dataset.
[1] "rsi"
[1] 20
[1] "EMA"
[1] 19
[1] "signal"
[1] 33
[1] "pctB"
[1] 19
[1] "GSPC.Close"
[1] 0
RandoM Forest Machine Learning Model
Split the dataset into training & test dataset.80 % of the data is training data.20 % of the data is test dataset.
Feature Scaling -> Normalization /Scale and dropping the feature varaibles.
# Feature Scaling (Normalization and dropping the predicted variable)
training_set[-5] = scale(training_set[-5])
test_set[-5] = scale(test_set[-5])
- Applying Random Forest Model on the Training set.
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 1884 factor numeric
err.rate 30 -none- numeric
confusion 6 -none- numeric
votes 3768 matrix numeric
oob.times 1884 -none- numeric
classes 2 -none- character
importance 4 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 1884 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
Prediction & Accuracy.
- After building the model, then we can check the accuracy of forecasting using confusion matrix.
- Final Model Accuracy of the model is 51 percent.
predict_val
0 1
0 89 128
1 102 152
[1] "Model Accuracy is"
[1] 0.5116773
Ensemble Method for Random Forest
Random Forest
1884 samples
4 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 1257, 1256, 1255
Resampling results across tuning parameters:
mtry Accuracy Kappa
3 0.5058141 -0.003131317
6 0.5021045 -0.009128875
9 0.5132502 0.013836823
Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 9.
Overall insights obtained from the implemented project
- As we can see, the model has the highest accuracy of 51 percent . While this may not seem any good, it is often extremely hard to predict the price of stocks. Even the 2.5% improvement over random guessing can make a difference given the amount of money at stake. After all, if it was that easy to predict the prices, wouldn’t we all be trading in stocks for the easy money instead of learning these algorithms?
Conclusion
This is a beginning of the use of ML algorithms for predicting the time series data or the stock prices.It can be modified and optimized in a lot of ways to produce much better and much more efficient and accurate results.
Choosing the right technical indicators which will influence the price change is daunting.In this project we tried to predict the price change through variety of technical indicators in different ML algorithms.
Although we have acheived accuracy of 99 percent in few ML algorithms, there are many other things that impact the prices of stocks such as:Political and social upheavals ,current affairs etc
Thus, we can say stock market price change is quite a dynamic movement.