Stock Market Analysis

Machine learning in the Stock Market

Use Machine Learning To Possibly Become A Millionaire: Predicting The Stock Market? Stock market is one of the most well-known infrastructures through which anyone can potentially make a fortune. If anyone could crack the code to predicting what future stock prices are, they’ll practically rule the world.here’s just one problem. It’s pretty much impossible to accurately predict the future of the stock market.

In this project, we will work with historical data about the stock prices of a publicly listed company. We will implement a mix of machine learning algorithms to predict the future stock price of the company, starting with simple algorithm like decision tree, and then move on to advanced techniques like Auto ARIMA and Neural Networks.

You can find the code I used on my Github repo

Decision Tree to Trade Bank of America Stock (Real Time Value Analysis)

Decision Tree to Trade Bank of America Stock (Real Time Value Analysis)

Decision trees are one of the more popular machine-learning algorithms for their ability to model noisy data, easily pick up non-linear trends, and capture relationships between your indicators; they also have the benefit of being easy to interpret.
Decision trees take a top-down, “divide-and-conquer” approach to analyzing data. They look for the indicator, and indicator value, that best splits the data into two distinct groups.
The algorithm then repeats this process on each subsequent group until it correctly classifies every data point or a stopping criteria is reached.
Each split, known as a “node”, tries to maximize the purity of the resulting “branches”. The purity is basically the probability that a data point falls in a given class, in our case “up” or “down”, and is measured by the “information gain” of each split.
In this model, we are going to use real time data of bofa stock from 2000- 2020.For this we will use quantmod package

Building a Strategy

Let’s see how we can quickly build a strategy using 4 technical indicators to see whether today’s price of BoA’s stock is going to close up or down.The 4 technical indicators are:
- RSI -> Calculate a 3-period relative strength index (RSI) off the open price
- EMA -> Calculate a 5-period exponential moving average (EMA)
- MACD -> Calculate a MACD with standard parameters
- SMI -> Stochastic Oscillator with standard parameters
Then we calculate the variable we are looking to predict and build our data sets.
- Class -> Calculate the difference between the close price and open price. If PriceChange>0,“UP”,“DOWN”

[1] "BAC"

[1] "RSI3"       "EMAcross"   "MACDsignal" "Stochastic" "Class"

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.75 percent contributes to training & 25 percent of data contributes to test.
Training the Decision Tree classifier with criterion as information gain
We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” to use repeatedcv i.e, repeated cross-validation.
The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.
To select the specific strategy for decision tree, we need to pass a parameter “parms” in our train() method. It should contain a list of parameters for our rpart method. For splitting criterions, we need to add a “split” parameter with values either “information” for information gain & “gini” for gini index. We are using information gain as a criterion.

Trained Decision Tree classifier results:

We can check the result of our train() method by a print dtree_fit variable. It is showing us the accuracy metrics for different values of cp. Here, cp is complexity parameter for our dtree.

CART

3908 samples
   4 predictor
   2 classes: 'DOWN', 'UP'

Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 3517, 3518, 3517, 3517, 3517, 3517, ...
Resampling results across tuning parameters:

  cp           Accuracy   Kappa
  0.001388889  0.5141566   0.0301167276
  0.001562500  0.5145822   0.0308771188
  0.001822917  0.5152644   0.0322216484
  0.001909722  0.5154349   0.0325509692
  0.002083333  0.5159460   0.0335094058
  0.002343750  0.5170542   0.0345975203
  0.002951389  0.5177349   0.0358283997
  0.003645833  0.5188436   0.0377622413
  0.018229167  0.5058795   0.0006184276
  0.019531250  0.5053680  -0.0004969420

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.003645833.

Plot Decision Tree

#### Decision Tree Prediction

Now, our model is trained with cp = 0.003645833.Accuracy of the model is 0.5141566.

Confusion Matrix and Statistics

          Reference
Prediction DOWN  UP
      DOWN  317 279
      UP    345 360

               Accuracy : 0.5204
                 95% CI : (0.4928, 0.5478)
    No Information Rate : 0.5088
    P-Value [Acc > NIR] : 0.210682

                  Kappa : 0.0422

 Mcnemar's Test P-Value : 0.009266

            Sensitivity : 0.4789
            Specificity : 0.5634
         Pos Pred Value : 0.5319
         Neg Pred Value : 0.5106
             Prevalence : 0.5088
         Detection Rate : 0.2437
   Detection Prevalence : 0.4581
      Balanced Accuracy : 0.5211

       'Positive' Class : DOWN

ARIMA Forecasting Expedia Stock Price incorporating COVID-19 (Real Time)

ARIMA Forecasting Expedia Stock Price incorporating COVID-19 (Real Time)

The goal of this project is to predict the future stock price of Expedia using various predictive forecasting models and then analysing the models using ARIMA.
The dataset for Expedia stocks is obtained from Yahoo Finance using Quantmod package in R.
The timeline of the data is from 2019 till present day(11/26/2020).
We shall also try and understand the impact of COVID-19 disaster on the stock prices of Expedia.
Forecasting
- A forecasting algorithm is a process that seeks to predict future values based on the past and present data.
- This historical data points are extracted and prepared trying to predict future values for a selected variable of the dataset.
Data Preparation
- Importing the data : We obtain the data of Expedia from “2019-07-01” to “2020-11-26” of Expedia Stock price for our analysis using the quantmod package. To analyse the impact of COVID-19 on the Expedia Stock price, we take two sets of data from the quantmod package.
- Data from “2019-07-01” - “2020-03-28” is data before covid.
- Data from “2020-04-01” - “till date” is data before covid.
- All the analysis and the models will be made on both the datasets to analyse the impact of COVID-19, if any.

[1] "EXPE"

[1] "EXPE"

  colnames(google_data_after_covid)
1                         EXPE.Open
2                         EXPE.High
3                          EXPE.Low
4                        EXPE.Close
5                       EXPE.Volume
6                     EXPE.Adjusted

Graphical Representation of Data

ARIMA Model

Let us first analyse the ACF and PACF Graph of each of the two datasets.

We then use the auto.arima function to determine the time series model for each of the datasets.

Series: tsData_before_covid_close
ARIMA(3,1,2)
Box Cox transformation: lambda= -0.5522943

Coefficients:
         ar1      ar2      ar3      ma1     ma2
      0.9436  -0.2552  -0.2793  -0.9659  0.7154
s.e.  0.1333   0.1520   0.0991   0.1174  0.1094

sigma^2 estimated as 1.264e-05:  log likelihood=791.34
AIC=-1570.69   AICc=-1570.22   BIC=-1551.3

Training set error measures:
                     ME     RMSE      MAE        MPE     MAPE     MASE
Training set -0.3468986 3.779995 2.064975 -0.4697368 2.119611 1.117624
                    ACF1
Training set -0.01611792

Series: tsData_after_covid_close
ARIMA(1,1,0) with drift
Box Cox transformation: lambda= 1.049021

Coefficients:
          ar1   drift
      -0.0993  0.5238
s.e.   0.0785  0.3099

sigma^2 estimated as 19.12:  log likelihood=-470.79
AIC=947.57   AICc=947.73   BIC=956.86

Training set error measures:
                       ME     RMSE      MAE         MPE     MAPE      MASE
Training set -0.002349432 3.471361 2.361182 -0.07665327 2.827689 0.9868023
                    ACF1
Training set 0.002088858

From the auto.arima function, we conclude the following models for the two datasets:
- Before COVID-19: ARIMA(3,1,2)
- After COVID-19: ARIMA(1,1,0)
After obtaining the model, we then perform residual diagnostics for each of the fitted models.
From the residual plot , we can confirm that the residual has a mean of 0 and the variance is constant as well . The ACF is 0 for lag> 0 , and the PACF is 0 as well.
So, we can say that the residual behaves like white noise and conclude that the models ARIMA(3,1,2) and ARIMA(1,1,0) fits the data well. Alternatively, we can also test at a significance level using the Box-Ljung Test.

Diagnostic measures

Try to find out the pattern in the residuals of the chosen model by plotting the ACF of the residuals, and doing a portmanteau test. We need to try modified models if the plot doesn’t look like white noise.
Once the residuals look like white noise, calculate forecasts.
Box-Ljung test is a test of independence at all lags up to the one specified. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags, and is therefore a portmanteau test. It is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation.
The ACF of the residuals shows no significant autocorrelations.
The p-values for the Ljung-Box Q test all are well above 0.05, indicating “non-significance.” The values are normal as they rest on a line and aren’t all over the place.

Augmented Dickey-Fuller & Kwiatkowski-Phillips-Schmidt-Shin

We then conduct an ADF (Augmented Dickey-Fuller) test and KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test to check for the stationarity of the time series data for both the datasets closing price.


    Augmented Dickey-Fuller Test

data:  tsData_before_covid_close
Dickey-Fuller = -2.1153, Lag order = 5, p-value = 0.5279
alternative hypothesis: stationary


    Augmented Dickey-Fuller Test

data:  tsData_after_covid_close
Dickey-Fuller = -2.4665, Lag order = 5, p-value = 0.3817
alternative hypothesis: stationary

From the above ADF tests, we can conclude the following:
- For the dataset before COVID-19, the ADF tests gives a p-value of 0.5279which is greater than 0.05, thus implying that the time series data is not stationary.
- For the dataset after COVID-19, the ADF tests gives a p-value of 0.3817 which is lesser greater 0.05, thus implying that the time series data is not stationary.


    KPSS Test for Level Stationarity

data:  tsData_before_covid_close
KPSS Level = 2.3907, Truncation lag parameter = 4, p-value = 0.01


    KPSS Test for Level Stationarity

data:  tsData_after_covid_close
KPSS Level = 2.6532, Truncation lag parameter = 4, p-value = 0.01

From the above KPSS tests, we can conclude the following:
For the dataset before COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
For the dataset after COVID-19, the KPSS tests gives a p-value of 0.01 which is less than 0.05, thus implying that the time series data is not stationary.
Thus, we can conclude from the above tests that the time series data is not stationary.
Forecasting with ARIMA Models
Forecast errors for before covid dataset

Forecast errors for after covid dataset

Holt Winters
- To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “HoltWinters()” function in R. To use HoltWinters() for simple exponential smoothing, we need to set the parameters beta=FALSE and gamma=FALSE in the HoltWinters() function (the beta and gamma parameters are used for Holt’s exponential smoothing, or Holt-Winters exponential smoothing, as described below).
- HoltWinters() function returns a list variable, that contains several named elements.The output of HoltWinters() tells us that the estimated value of the alpha parameter is about 0.982813.
- We can plot the original time series.

Holt-Winters exponential smoothing with trend and without seasonal component.

Call:
HoltWinters(x = tsData_before_covid_close, gamma = FALSE)

Smoothing parameters:
 alpha: 0.982813
 beta : 0.02381423
 gamma: FALSE

Coefficients:
       [,1]
a 60.091954
b -0.869578

As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-of-squared-errors is stored in a named element of the list variable “rainseriesforecasts” called “SSE”, so we can get its value by typing:

[1] "skirtsseriesforecasts$SSE"

[1] 2798.291

Forecasting with HoltWinters

We can make forecasts for further time points by using the “forecast.HoltWinters()” function in the R “forecast” package.
When using the forecast.HoltWinters() function, as its first argument (input), you pass it the predictive model that you have already fitted using the HoltWinters() function.
You specify how many further time points you want to make forecasts for by using the “h” parameter in forecast.Here we are going to predict price for 45 days.
forecast.HoltWinters() function gives you the forecast for a year, a 80% prediction interval for the forecast, and a 95% prediction interval for the forecast. For example, the forecasted stock price for 233 day is about 20.96095, with a 95% prediction interval of (-56.8291332 98.75102).

    Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
189       59.22238  54.261714 64.18304  51.6356981 66.80905
190       58.35280  51.315518 65.39008  47.5902068 69.11539
191       57.48322  48.787761 66.17868  44.1846635 70.78178
192       56.61364  46.469767 66.75752  41.0999238 72.12736
193       55.74406  44.280562 67.20757  38.2121512 73.27598
194       54.87449  42.178858 67.57011  35.4581988 74.29077
195       54.00491  40.140333 67.86948  32.8008698 75.20895
196       53.13533  38.149294 68.12137  30.2161653 76.05449
197       52.26575  36.194958 68.33655  27.6875932 76.84391
198       51.39617  34.269562 68.52279  25.2032811 77.58907
199       50.52660  32.367315 68.68588  22.7543724 78.29882
200       49.65702  30.483773 68.83026  20.3340704 78.97997
201       48.78744  28.615447 68.95943  17.9370389 79.63784
202       47.91786  26.759545 69.07618  15.5590078 80.27672
203       47.04828  24.913796 69.18277  13.1965057 80.90006
204       46.17871  23.076330 69.28108  10.8466714 81.51074
205       45.30913  21.245588 69.37267   8.5071196 82.11114
206       44.43955  19.420256 69.45884   6.1758412 82.70326
207       43.56997  17.599217 69.54073   3.8511288 83.28882
208       42.70039  15.781515 69.61927   1.5315205 83.86927
209       41.83082  13.966326 69.69531  -0.7842448 84.44588
210       40.96124  12.152934 69.76954  -3.0972610 85.01974
211       40.09166  10.340716 69.84261  -5.4084829 85.59180
212       39.22208   8.529124 69.91504  -7.7187474 86.16291
213       38.35250   6.717675 69.98733 -10.0287917 86.73380
214       37.48293   4.905945 70.05991 -12.3392679 87.30512
215       36.61335   3.093553 70.13314 -14.6507544 87.87745
216       35.74377   1.280164 70.20738 -16.9637667 88.45131
217       34.87419  -0.534523 70.28291 -19.2787648 89.02715
218       34.00461  -2.350779 70.36001 -21.5961611 89.60539
219       33.13504  -4.168844 70.43892 -23.9163256 90.18640
220       32.26546  -5.988937 70.51985 -26.2395913 90.77051
221       31.39588  -7.811255 70.60302 -28.5662585 91.35802
222       30.52630  -9.635974 70.68858 -30.8965985 91.94920
223       29.65672 -11.463254 70.77670 -33.2308566 92.54431
224       28.78715 -13.293243 70.86754 -35.5692552 93.14355
225       27.91757 -15.126070 70.96121 -37.9119963 93.74713
226       27.04799 -16.961857 71.05784 -40.2592630 94.35524
227       26.17841 -18.800712 71.15754 -42.6112221 94.96805
228       25.30884 -20.642734 71.26040 -44.9680254 95.58570
229       24.43926 -22.488014 71.36653 -47.3298109 96.20832
230       23.56968 -24.336634 71.47599 -49.6967045 96.83606
231       22.70010 -26.188669 71.58887 -52.0688212 97.46902
232       21.83052 -28.044188 71.70523 -54.4462656 98.10731
233       20.96095 -29.903253 71.82514 -56.8291332 98.75102

To plot the predictions made by forecast.HoltWinters(), we can use the “plot.forecast()” function:
Prediction for next ten days.

Ljung-Box Tests
Box Test for modelfit_after_covid$residuals.
To check for correlations between successive forecast errors, we can make a correlogram and use the Ljung-Box test.


    Box-Ljung test

data:  modelfit_after_covid$residuals
X-squared = 1.6058e-05, df = 1, p-value = 0.9968

Box Test for modelfit_before_covid$residuals


    Box-Ljung test

data:  modelfit_before_covid$residuals
X-squared = 0.0047812, df = 1, p-value = 0.9449

Conclusion

Here, the p value for both the models is greater than 0.05 . Hence, at a significance level of 0.05 we fail to reject the null hypothesis and conclude that the residual follows white noise. This means that the model fits the data well.

Using a Neural Network to Model the Amazon Stock Price Change (Real Time Analysis)

Using a Neural Network to Model the Amazon Stock Price Change (Real Time Analysis)

Artificial neural networks are very powerful and popular machine-learning algorithms that mimic how a brain works in order find patterns in your data.
- In this project, we will build a basic neural network to model the price change of Amazon stocks in real Time.
- ANNs make predictions by sending the inputs (in our case, the indicators) through the network of neurons, with the neurons firing off depending on the weights of the incoming signals. The final output is determined by the strength of the signals coming from the previous layer of neurons.
Data Import
- Use “quantmod” package to download information for Amazon stocks.
- Let’s see how we can quickly build a strategy using 4 technical indicators to see whether today’s price of Amazon’s stock .The 4 technical indicators are:
  - RSI -> Calculate a 3-period relative strength index (RSI) off the open price
  - EMA -> Calculate a 5-period exponential moving average (EMA)
  - MACD -> Calculate a MACD with standard parameters
  - SMI -> Stochastic Oscillator with standard parameters

[1] "AMZN"

Normalize Data

One of the most important procedures when forming a neural network is data normalization. This involves adjusting the data to a common scale so as to accurately compare predicted and actual values. Failure to normalize the data will typically result in the prediction value remaining the same across all observations, regardless of the input values.
Max-Min Normalization: For this method, we invoke the following function to normalize our data:
Data slicing:
- Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.75 percent contributes to training & 25 percent of data contributes to test.

'data.frame':   5209 obs. of  5 variables:
   $ RSI3      : num  0.0853 0.0409 0.0402 0.6396 0.4188 ...
   $ EMAcross  : num  0.538 0.534 0.539 0.553 0.545 ...
   $ MACDsignal: num  0.727 0.705 0.681 0.661 0.643 ...
   $ BollingerB: num  0.423 0.353 0.362 0.477 0.408 ...
   $ Price     : num  0.417 0.424 0.444 0.422 0.433 ...

TrainingParameters :
- train() method is passed with repeated cross-validation resampling method for 10 number of resampling iterations repeated for 3 times.

Machine Learning: Classification using Neural Networks

Model Training
- We can us neuralnet() to train a NN model. Also, the train() function from caret can help us tune parameters. We can plot the result to see which set of parameters is fit our data the best.
- nnnet package by defualt uses the Logisitc Activation function.
- Data Pre-Processing With Caret: The scale transform calculates the standard deviation for an attribute and divides each value by that standard deviation.
- The center transform calculates the mean for an attribute and subtracts it from each value.
- Combining the scale and center transforms will standardize your data.
- Attributes will have a mean value of 0 and a standard deviation of 1.
- Training transforms can prepared and applied automatically during model evaluation.
- Transforms applied during training are prepared using the preProcess() and passed to the train() function via the preProcess argument.
- Backpropagation algorithm is a supervised learning method for multilayer feed-forward networks from the field of Artificial Neural Networks.
- The principle of the backpropagation approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state.
- We use Backpropagation as algorithm in neural network package.

nnetGrid <-  expand.grid(size = seq(from = 1, to = 5, by = 1)
                           ,decay = seq(from = 0.1, to = 0.2, by = 0.1)
  )
  str(subTrain)

'data.frame':   12507 obs. of  5 variables:
   $ RSI3      : num  0.0853 0.0409 0.0402 0.4188 0.3651 ...
   $ EMAcross  : num  0.538 0.534 0.539 0.545 0.544 ...
   $ MACDsignal: num  0.727 0.705 0.681 0.643 0.625 ...
   $ BollingerB: num  0.423 0.353 0.362 0.408 0.386 ...
   $ Price     : num  0.417 0.424 0.444 0.433 0.426 ...

nn_model <- train(Price~RSI3+EMAcross+MACDsignal+BollingerB, subTrain,
                    method = "nnet",  algorithm = 'backprop',
                    trControl= TrainingParameters,
                    preProcess=c("scale","center"),
                    na.action = na.omit,
                    tuneGrid = nnetGrid,
                    trace=FALSE,
                    verbose=FALSE)

Based on the caret neural network model, train sets hidden layer.caret neural network picks the best neural network based on size, decay.
- We can visualize accuracy for different hidden layers below:

   size decay       RMSE   Rsquared        MAE      RMSESD RsquaredSD
  1     1   0.1 0.03933584 0.01238798 0.01586244 0.002504076 0.01443134
  2     1   0.2 0.03934006 0.01239107 0.01583777 0.002507265 0.01451383
  3     2   0.1 0.03933346 0.01268185 0.01586449 0.002503906 0.01468389
  4     2   0.2 0.03933083 0.01281302 0.01584319 0.002506894 0.01488929
  5     3   0.1 0.03933819 0.01287279 0.01586899 0.002504096 0.01467428
  6     3   0.2 0.03933054 0.01293707 0.01584766 0.002504418 0.01501688
  7     4   0.1 0.03932220 0.01314280 0.01584623 0.002496180 0.01459575
  8     4   0.2 0.03932984 0.01300564 0.01584778 0.002504590 0.01508909
  9     5   0.1 0.03932381 0.01361986 0.01584481 0.002522508 0.01473453
  10    5   0.2 0.03932938 0.01304931 0.01584684 0.002504771 0.01512491
            MAESD
  1  0.0007305646
  2  0.0007313067
  3  0.0007312625
  4  0.0007335536
  5  0.0007326783
  6  0.0007321606
  7  0.0007377006
  8  0.0007321347
  9  0.0007322239
  10 0.0007327730

Prediction
- Now, our model is trained with accuracy = 0.8889 We are ready to predict classes for our test set.

        actual prediction
  7    0.4362379  0.4274795
  10   0.4256909  0.4273008
  97   0.4278731  0.4284938
  139  0.4238725  0.4280213
  483  0.4315391  0.4283605
  783  0.4282804  0.4271653
  896  0.4283095  0.4268096
  908  0.4300552  0.4267116
  911  0.4315391  0.4280867
  952  0.4203374  0.4286907
  1012 0.4257782  0.4279932
  1711 0.4255164  0.4265486
  1746 0.4299679  0.4280298
  1804 0.4386092  0.4280147
  2058 0.4317427  0.4280102
  2073 0.4374163  0.4289039
  2176 0.4056444  0.4292499
  2276 0.4372417  0.4271038
  2695 0.4251672  0.4280948
  2746 0.4161477  0.4282272
  2772 0.4228978  0.4272442
  2784 0.4386092  0.4272447
  3057 0.4236834  0.4273921
  3139 0.4282513  0.4273274
  3178 0.4408495  0.4281902
  3220 0.4289496  0.4283887
  3335 0.4316845  0.4266736
  3443 0.4418097  0.4276801
  3503 0.4245853  0.4275239
  4018 0.4396857  0.4311302
  4237 0.4353504  0.4281576
  4261 0.4384055  0.4282462
  4421 0.4145185  0.4284854
  4526 0.5582194  0.4272961
  4664 0.4444863  0.4258982
  4786 0.4411985  0.4281155
  4843 0.3776257  0.4279618
  4865 0.4139076  0.4277111
  4937 0.4610998  0.4286968
  5030 0.5170493  0.4270784

Accuracy of Neural Network model:

99% accuracy is acheived to predict the price using the stock technical indicators.

           RSI3  EMAcross MACDsignal BollingerB     Price
  7    0.32459911 0.5442847  0.6075494  0.3572573 0.4362379
  10   0.20618845 0.5412815  0.5634201  0.3204323 0.4256909
  97   0.24607269 0.5460137  0.2333221  0.3782227 0.4278731
  139  0.84671114 0.5597144  0.7114260  0.7928638 0.4238725
  483  0.80588690 0.5503068  0.6449523  0.7104961 0.4315391
  783  0.54896975 0.5481855  0.7942074  0.5794676 0.4282804
  896  0.36179313 0.5467044  0.7485756  0.4977961 0.4283095
  908  0.36059256 0.5461656  0.7215884  0.5917109 0.4300552
  911  0.87072321 0.5542228  0.7250983  0.8051809 0.4315391
  952  0.72799440 0.5495230  0.5471357  0.5809809 0.4203374
  1012 0.25508071 0.5462964  0.4156671  0.3042368 0.4257782
  1711 0.17516899 0.5466259  0.7149182  0.3455139 0.4255164
  1746 0.42878584 0.5473307  0.5385448  0.3833825 0.4299679
  1804 0.95547536 0.5649359  0.7302750  0.9376553 0.4386092
  2058 0.75475738 0.5528730  0.6755122  0.7233058 0.4317427
  2073 0.85834756 0.5557843  0.6024193  0.5896256 0.4374163
  2176 0.61647522 0.5560750  0.3465414  0.4212848 0.4056444
  2276 0.38169960 0.5460403  0.7248211  0.4346285 0.4372417
  2695 0.79630557 0.5568765  0.7103222  0.6807527 0.4251672
  2746 0.86740868 0.5570476  0.6578077  0.8334958 0.4161477
  2772 0.33325260 0.5425081  0.6137948  0.4871775 0.4228978
  2784 0.11853439 0.5357345  0.5229431  0.2911652 0.4386092
  3057 0.37948310 0.5450882  0.6474615  0.4269868 0.4236834
  3139 0.47640198 0.5473919  0.6464690  0.6264217 0.4282513
  3178 0.64602849 0.5532853  0.6476852  0.4856036 0.4408495
  3220 0.78578515 0.5543646  0.6158716  0.6913156 0.4289496
  3335 0.14403635 0.5345840  0.5990995  0.4992964 0.4316845
  3443 0.95580068 0.6123559  0.6557729  0.9230118 0.4418097
  3503 0.48426281 0.5469840  0.6377598  0.5614055 0.4245853
  4018 0.04419073 0.4504553  0.4264717  0.1988550 0.4396857
  4237 0.59029580 0.5558939  0.5694519  0.5221716 0.4353504
  4261 0.87353837 0.5836426  0.6483177  0.7248558 0.4384055
  4421 0.88473691 0.5823825  0.5743706  0.7673681 0.4145185
  4526 0.59590136 0.5793888  0.7331421  0.5560403 0.5582194
  4664 0.98636455 0.6589197  0.6674518  0.8129811 0.4444863
  4786 0.62740722 0.5626872  0.5948031  0.5374183 0.4411985
  4843 0.54126423 0.5618588  0.6234117  0.4116830 0.3776257
  4865 0.63945676 0.5788562  0.5987941  0.6760626 0.4139076
  4937 0.20789866 0.4876023  0.5509471  0.2674240 0.4610998
  5030 0.44684424 0.5332037  0.7226938  0.6420079 0.5170493

 [1]  0.0034655906 -0.0006396635 -0.0002464325 -0.0016497116  0.0012600367
   [6]  0.0004426085  0.0005953472  0.0013262634  0.0013685815 -0.0033262027
  [11] -0.0008800795 -0.0004101826  0.0007687922  0.0041881569  0.0014795282
  [16]  0.0033666456 -0.0094548054  0.0040098370 -0.0011635185 -0.0048179652
  [21] -0.0017289311  0.0044925247 -0.0014748155  0.0003667483  0.0049999668
  [26]  0.0002225924  0.0019863283  0.0055785501 -0.0011681378  0.0033806843
  [31]  0.0028470762  0.0040164223 -0.0055743799  0.0494191079  0.0073310991
  [36]  0.0051666092 -0.0203900965 -0.0055105203  0.0126964783  0.0344970986

[1] 0.9974791

Plotting nnet variable importance

nnet variable importance

             Overall
  EMAcross   100.000
  MACDsignal  11.556
  BollingerB   6.033
  RSI3         0.000

#### Graphical Representation of our Neural Network

Conclusion

From the above implementation, the results are impressive(99% accuracy) and convincing in terms of using a machine learning algorithm to decide on the price of the stock Majority of the attributes in the dataset contribute significantly to the building of a predictive model.

Using SVM to predict price change for WALMART (Real Time Analysis)

Using SVM to predict price change for WALMART (Real Time Analysis)

Machine Learning: Classification using SVM

SVM is another classification method that can be used to predict if a client falls into either ‘yes’ or ‘no’ class.
- The linear, polynomial and RBF or Gaussian kernel in SVM are simply different in case of making the hyperplane decision boundary between the classes.
- The kernel functions are used to map the original dataset (linear/nonlinear ) into a higher dimensional space with view to making it linear dataset.
- Usually linear and polynomial kernels are less time consuming and provides less accuracy than the rbf or Gaussian kernels.
- The k cross validation is used to divide the training set into k distinct subsets. Then every subset is used for training and others k-1 are used for validation in the entire trainging phase. This is done for the better training of the classification task.Overall, if you are unsure which kernel method would be best, a good practice is use of something like 10-fold cross-validation for each training set and then pick the best algorithm.

   colnames(DataSetWalmart)
  1                     Class
  2                       RSI
  3                  EMAcross
  4                      MACD
  5                       SMI
  6                       WPR
  7                       ADX
  8                       CCI
  9                       CMO
  10                      ROC

SVM Classifier using Linear Kernel

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method.
- We are setting 3 parameters of trainControl() method. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this project, let’s try to use repeatedcv i.e, repeated cross-validation.
- The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list. We are going to pass this on our train() method.
- Before training our SVM classifier, set.seed().
- For training SVM classifier, train() method should be passed with “method” parameter as “svmLinear”. We are passing our target variable Term_Deposit. The “Term_Deposit.~.” denotes a formula for using all attributes in our classifier and Term_Deposit. as the target variable. The “trControl” parameter should be passed with results from our trianControl() method. The “preProcess” parameter is for preprocessing our training data.
- As discussed earlier for our data, preprocessing is a mandatory task. We are passing 2 values in our “preProcess” parameter “center” & “scale”. These two help for centering and scaling the data. After preProcessing these convert our training data with mean value as approximately “0” and standard deviation as “1”. The “tuneLength” parameter holds an integer value. This is for tuning our algorithm.

Support Vector Machines with Linear Kernel

  18651 samples
      7 predictor
      2 classes: 'DOWN', 'UP'

  Pre-processing: centered (7), scaled (7)
  Resampling: Cross-Validated (10 fold, repeated 1 times)
  Summary of sample sizes: 16786, 16786, 16786, 16786, 16786, 16786, ...
  Resampling results across tuning parameters:

    C     Accuracy   Kappa
    0.25  0.8209209  0.6412469
    0.50  0.8211354  0.6416789

  Accuracy was used to select the optimal model using the largest value.
  The final value used for the model was C = 0.5.

The above model is showing that our classifier is giving best accuracy on C = 0.25 Let’s try to make predictions using this model for our test set and check its accuracy.


  predictionsvm DOWN UP
           DOWN   28  8
           UP      5 28

Accuracy on the test set by train control is 81% using C=0.25.

[1] 0.8115942

Confusion Matrix and Statistics

            Reference
  Prediction DOWN UP
        DOWN   28  8
        UP      5 28

                 Accuracy : 0.8116
                   95% CI : (0.6994, 0.8957)
      No Information Rate : 0.5217
      P-Value [Acc > NIR] : 5.253e-07

                    Kappa : 0.6239

   Mcnemar's Test P-Value : 0.5791

              Sensitivity : 0.8485
              Specificity : 0.7778
           Pos Pred Value : 0.7778
           Neg Pred Value : 0.8485
               Prevalence : 0.4783
           Detection Rate : 0.4058
     Detection Prevalence : 0.5217
        Balanced Accuracy : 0.8131

         'Positive' Class : DOWN

SVM Classifier using Non-Linear Kernel

Now, we will try to build a model using Non-Linear Kernel like Radial Basis Function. For using RBF kernel, we just need to change our train() method’s “method” parameter to “svmRadial”. In Radial kernel, it needs to select proper value of Cost “C” parameter and “sigma” parameter.

Support Vector Machines with Radial Basis Function Kernel

  18651 samples
      7 predictor
      2 classes: 'DOWN', 'UP'

  Pre-processing: centered (7), scaled (7)
  Resampling: Cross-Validated (10 fold, repeated 1 times)
  Summary of sample sizes: 16786, 16786, 16786, 16786, 16786, 16786, ...
  Resampling results across tuning parameters:

    C     Accuracy   Kappa
    0.25  0.8565757  0.7127826
    0.50  0.8618302  0.7233515

  Tuning parameter 'sigma' was held constant at a value of 0.5
  Accuracy was used to select the optimal model using the largest value.
  The final values used for the model were sigma = 0.5 and C = 0.5.

SVM-RBF kernel calculates variations and will present us best values of sigma & C. Based on the output best values of sigma= 0.5 & C=0.5 Let’s check our trained models’ accuracy on the test set.

[1] 0.7971014

Final prediction accuracy on the test set is 0.7971014

Comparision between SVM models

Comparision between SVM Linear and Radial Models.


  Call:
  summary.resamples(object = algo_results)

  Models: SVM_RADIAL, SVM_LINEAR
  Number of resamples: 10

  Accuracy
                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
  SVM_RADIAL 0.8509383 0.8571046 0.8630027 0.8618302 0.8657375 0.8723861    0
  SVM_LINEAR 0.8144772 0.8171582 0.8209115 0.8211354 0.8248692 0.8284182    0

  Kappa
                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
  SVM_RADIAL 0.7011356 0.7139651 0.7255579 0.7233515 0.7312067 0.7446252    0
  SVM_LINEAR 0.6280912 0.6334363 0.6410743 0.6416789 0.6492305 0.6567207    0

Conclusion

From the above implementation, the results are impressive and convincing in terms of using a machine learning algorithm to decide on the price change of walmart . Majority of the attributes in the dataset contribute significantly to the building of a predictive model. All the two SVM approach acheives good accuracy rate(>80%) and are easier to implement.

Amazon Price Trend Predict - Random Forest (Real Time Analysis)

Amazon Price Trend Predict - Random Forest (Real Time Analysis)

Stock market prediction is an incredibly difficult task, due to the randomness and noisiness found in the market. Yet, predicting market behaviors is a very important task. Correctly predicting stock price directions can be used to maximize profits, as well as to minimize risk. There are two types of methods to predicting market behavior. One is predicting the future price of an asset. This is usually done using time series analysis to fit a specific model, like ARIMA or GARCH, to some historical data. The other is predicting the future trend of an asset. That is, whether one thinks it will go up or down in price, treating it as a classification problem.
The goal of this project is to create an intelligent model, using the Random Forest model, that can correctly forecast the behavior of a stock’s price n days out.
Data Import
- Use “quantmod” package to download information for Amazon stocks.
- The data used for this project consists of regular stock data (open, close, volume, etc.) from Yahoo finance, and ranges from the year 2000 to 2018.
- From this data, technical indicators were calculated for every stock. Below are all the technical indicators used for this model:
```
- Relative Strength Index
- Stochastic Oscillator
- William %R
- Moving Average Convergence Divergence
- Price Rate of Change
- On Balance Volume
```
- The last step of pre-processing the data was calculating the response variable. Since we are treating this as a classification problem, the response variable was binary. The equation for calculating the response variable is below:Response=Closet+n−Closet
- It states that the adjusted close price at t+n, where n is the number of days out you want to predict, minus the current adjusted close price will map to a value that says the stock price went up from the point at time t, or that it went down.

Methadology

[1] "^GSPC"

Dataset used for this project:

  colnames(dataset1)
1                rsi
2                EMA
3             signal
4               pctB
5         GSPC.Close

Check for missing data in relativeStrengthIndex20, exponentialMovingAverage20, MACDsignal, PercentageChngpctB, Price. Omit the n/a values in dataset.
Print the number of missed value for each attribute in dataset.

[1] "rsi"
[1] 20
[1] "EMA"
[1] 19
[1] "signal"
[1] 33
[1] "pctB"
[1] 19
[1] "GSPC.Close"
[1] 0

RandoM Forest Machine Learning Model
- Split the dataset into training & test dataset.80 % of the data is training data.20 % of the data is test dataset.
- Feature Scaling -> Normalization /Scale and dropping the feature varaibles.

# Feature Scaling (Normalization and dropping the predicted variable)
training_set[-5] = scale(training_set[-5])
test_set[-5] = scale(test_set[-5])

Applying Random Forest Model on the Training set.

                Length Class  Mode
call               4   -none- call
type               1   -none- character
predicted       1884   factor numeric
err.rate          30   -none- numeric
confusion          6   -none- numeric
votes           3768   matrix numeric
oob.times       1884   -none- numeric
classes            2   -none- character
importance         4   -none- numeric
importanceSD       0   -none- NULL
localImportance    0   -none- NULL
proximity          0   -none- NULL
ntree              1   -none- numeric
mtry               1   -none- numeric
forest            14   -none- list
y               1884   factor numeric
test               0   -none- NULL
inbag              0   -none- NULL

Prediction & Accuracy.

After building the model, then we can check the accuracy of forecasting using confusion matrix.
- Final Model Accuracy of the model is 51 percent.

   predict_val
      0   1
  0  89 128
  1 102 152

[1] "Model Accuracy is"

[1] 0.5116773

Ensemble Method for Random Forest

Random Forest

1884 samples
   4 predictor
   2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 1257, 1256, 1255
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa
  3     0.5058141  -0.003131317
  6     0.5021045  -0.009128875
  9     0.5132502   0.013836823

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 9.

Overall insights obtained from the implemented project

As we can see, the model has the highest accuracy of 51 percent . While this may not seem any good, it is often extremely hard to predict the price of stocks. Even the 2.5% improvement over random guessing can make a difference given the amount of money at stake. After all, if it was that easy to predict the prices, wouldn’t we all be trading in stocks for the easy money instead of learning these algorithms?

Conclusion

This is a beginning of the use of ML algorithms for predicting the time series data or the stock prices.It can be modified and optimized in a lot of ways to produce much better and much more efficient and accurate results.
Choosing the right technical indicators which will influence the price change is daunting.In this project we tried to predict the price change through variety of technical indicators in different ML algorithms.
Although we have acheived accuracy of 99 percent in few ML algorithms, there are many other things that impact the prices of stocks such as:Political and social upheavals ,current affairs etc
Thus, we can say stock market price change is quite a dynamic movement.