Introduction

Nowadays, Data Science become a very powerful tool on almost every industries since we will generate a huge amount of data everyday. Finance and investment is a very hot field for a lot of data scientists, because they can predict or forecast the future price of the stock. In this project, I will guide you throught how to predict the stock price of the most active stock based on the historical data. I hope this project can give you some inspiration on stock predictions. This project is reimplemented based this wonderful tutorial: Predicting Stock Prices with Python

Before we start, let me give you some basic ideal. First, we are going to use selenium for web scraping. We need to get the name of the most active stock on yahoo finance. Selenium is a hot tool not only in web scraping but also in automated web application test. If you haven't use selenium before, I recommend you to watch this fantanstic youtube selenium tutorial to set up selenium. And use this selenium python tutorial as more detailed reference. Then, we will get the historical data of that most active stock. Next, we will perform some prediction tasks using machine simple learning models. Finally, we can send our predictions to our clients by email.

The image is from this website:goldennest.sg

Prerequisites

First, we need to import the necessary packages. Make sure you have installed all the packages below:

import numpy as np
from datetime import datetime
import smtplib
from selenium import webdriver
import os
import pandas as pd
#For Prediction
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing#,cross_validation this is no longer aviliable
from sklearn.model_selection import train_test_split #use this instead
#For Stock Data
from iexfinance.stocks import get_historical_data
from iexfinance.refdata import get_symbols
import matplotlib.pyplot as plt
import copy

Get Stock Name

First, we need to create our chrome drive then use driver.get(url) navigate to our desired webpage: https://finance.yahoo.com/most-active which will display the top 25 most active stocks in this page. If you are interested in other stocks you can change this link to the URL you want. Inside webdriver.Chrome() you will need to type your chromedriver path.

driver = webdriver.Chrome(
    'Type the directory of your chromedriver here')
url = "https://finance.yahoo.com/most-active"
driver.get(url)

Next, we want to find the xpath of the most active stock name. You can follow the following steps to get the xpath:

First, go to your desired webpage and inspect the element of that webpage.

Click the "Select an Element" button:

Click on the first ticker:

Next, copy the xpath following the instruction below:

After you find the xpath, you can get the element use the code below:

ticker = driver.find_element_by_xpath(
    '//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[1]/a')

Last, we want to get the stock name by calling ticker.text, and make a deep copy, because we will lose the stock name after we call driver.quit().

stock = copy.deepcopy(ticker.text)
driver.quit()

Finally, we successfully scraped the stock name of the most active stock using selenium. Below is the name we got from the yahoo finance:

stock
'SNDL'

Get Historical Data

Now, we can get the historical data for the most active stock. Inside the get_historical_data(), you will need to set the start and end date, the output format (we will use pandas in this project).

For the token, you will need to visit iexcloud to create an account to get your API token. You can choose the free version, but it only offers you a very limited access.

Last, we will save the historical data into a csv file.

start = datetime(2020, 3, 1)
end = datetime(2021, 3, 1)

#Download Historical stock data
df = get_historical_data(stock, start, end, output_format='pandas', token="pk_422a359c341b427ea05864740c233fe3")
csv_name = ( stock + '.csv')
df.to_csv(csv_name)

Your dataframe should look like this:

Unnamed: 0 close high low open symbol volume id key subkey ... uLow uVolume fOpen fClose fHigh fLow fVolume label change changePercent
0 2020-03-02 1.48 1.540 1.40 1.44 SNDL 1056600 HISTORICAL_PRICES SNDL NaN ... 1.40 1056600 1.44 1.48 1.540 1.40 1056600 Mar 2, 20 0.06 0.0423
1 2020-03-03 1.48 1.500 1.35 1.45 SNDL 945902 HISTORICAL_PRICES SNDL NaN ... 1.35 945902 1.45 1.48 1.500 1.35 945902 Mar 3, 20 0.00 0.0000
2 2020-03-04 1.65 1.700 1.47 1.50 SNDL 1522873 HISTORICAL_PRICES SNDL NaN ... 1.47 1522873 1.50 1.65 1.700 1.47 1522873 Mar 4, 20 0.17 0.1149
3 2020-03-05 1.45 1.650 1.44 1.60 SNDL 673430 HISTORICAL_PRICES SNDL NaN ... 1.44 673430 1.60 1.45 1.650 1.44 673430 Mar 5, 20 -0.20 -0.1212
4 2020-03-06 1.33 1.475 1.33 1.45 SNDL 494977 HISTORICAL_PRICES SNDL NaN ... 1.33 494977 1.45 1.33 1.475 1.33 494977 Mar 6, 20 -0.12 -0.0828

5 rows × 26 columns

Modeling

Before modeling, we need to clean our data a little bit. First, we will select the useful features, because there are too many columns in this dataframe. Then, add the prediction column.

#read the data
data = pd.read_csv(csv_name)
#feature selection
df = data[['close', 'high', 'low', 'open','volume', 'change']]
#add a prediction column (eg: today's prediction is the close price of tomorrow)
df['prediction'] = df['close'].shift(-1)
#drop the last row, because the value in the prediction column is nan
df.dropna(inplace=True)

Your new data frame will look like this:

close high low open volume change prediction
0 1.48 1.540 1.40 1.440 1056600 0.06 1.48
1 1.48 1.500 1.35 1.450 945902 0.00 1.65
2 1.65 1.700 1.47 1.500 1522873 0.17 1.45
3 1.45 1.650 1.44 1.600 673430 -0.20 1.33
4 1.33 1.475 1.33 1.450 494977 -0.12 1.22
... ... ... ... ... ... ... ...
246 1.43 1.600 1.40 1.425 255266388 -0.10 1.26
247 1.26 1.330 1.10 1.290 397249358 -0.17 1.45
248 1.45 1.470 1.28 1.320 433296256 0.19 1.37
249 1.37 1.640 1.36 1.540 391487356 -0.08 1.33
250 1.33 1.490 1.31 1.390 255416545 -0.04 1.35

251 rows × 7 columns

Next, we will built our regression model, a detailed explaination is commented below:

#X is the predictor variable, Y is the target variable
X = np.array(df.drop(['prediction'], 1))
Y = np.array(df['prediction'])
#Nomalize our predictor variables
X = preprocessing.scale(X)

#the last row in the predictor variable
X_prediction = X[-1:]
#the last row in the target variable
Y_ans = Y[-1:]

#Delete the last row in X and Y, because we don't want it to be in the train data. 
X = X[:-1]
Y = Y[:-1]

#Split our data into train and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

#Performing the Regression on the training data
clf = LinearRegression()
clf.fit(X_train, Y_train)
#Predict the closing price of X_prediction
prediction = (clf.predict(X_prediction))
#Accuracy score of our model
result = clf.score(X_test, Y_test)

Last, we can write a function to send result to our clients. The smtplib module allowed you to send emails to any internet machine with an SMTP. Check more details for SMTP.

def sendMessage(text):
    # If you're using Gmail to send the message, you might need to 
    # go into the security settings of your email account and 
    # enable the "Allow less secure apps" option 
    username = "Your Email Address"
    password = "The password of your email"
    
    to = "the email you will sent to ...@gmail.com"
    message = text

    Subject = "Stock Prediction"
    msg = 'Subject: {}\n\n{}'.format(Subject, text)
    
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(username, password)
    server.sendmail(username, to, msg)
    server.quit()
    print('Sent')
output = ("\n\nStock: " + str(stock) + "\nClose Price on " + str(data.loc[data.index[[-2]], 'Unnamed: 0'].item()) + ": $" + str(data.loc[data.index[[-2]].item(), 'close']) + "\nPrediction for the next day closing: $%.2f" %
(prediction[0]) + "\nActuall closing: $%.2f" % (Y_ans[0]) + "\nModel Accuracy: %.2f%%" % (result*100.0))

sendMessage(output)

The email will be in this format:

Final Application

Finally, we can add everything up to a complete application. The following code will do the job:

#collapse-show
# Final application 
def getStocks():
    #Navigating to the Yahoo stock screener
    
    driver = webdriver.Chrome(
        'Type the directory of your chromedriver here')
    url = "https://finance.yahoo.com/most-active"
    driver.get(url)

    #Creating a stock list and iterating through the ticker names on the stock screener list
    #Xpath: //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[1]/a
    ticker = driver.find_element_by_xpath(
        '//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[1]/a')

    #stock = ticker.text
    stock = copy.deepcopy(ticker.text)
    
    driver.quit()
    return predictData(stock)
    

        
def sendMessage(text):
    # If you're using Gmail to send the message, you might need to 
    # go into the security settings of your email account and 
    # enable the "Allow less secure apps" option 
    username = "Your Email Address"
    password = "The password of your email"
    
    to = "the email you will sent to ...@gmail.com"
    message = text

    Subject = "Stock Prediction"
    msg = 'Subject: {}\n\n{}'.format(Subject, text)
    
    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(username, password)
    server.sendmail(username, to, msg)
    server.quit()

    print('Sent')
    
def predictData(stock):
    
    start = datetime(2018, 10, 2)
    end = datetime(2019, 10, 2)
    
    df = get_historical_data(stock, start, end, output_format='pandas', token="pk_422a359c341b427ea05864740c233fe3") 
    csv_name = ( stock + '.csv')
    df.to_csv(csv_name)
    
    data = pd.read_csv(csv_name)
    df = data[['close', 'high', 'low', 'open','volume', 'change']]
    df['prediction'] = df['close'].shift(-1)
    df.dropna(inplace=True)

    #Predicting the Stock price in the future
    X = np.array(df.drop(['prediction'], 1))
    Y = np.array(df['prediction'])
    X = preprocessing.scale(X)
    X_prediction = X[-1:]
    Y_ans = Y[-1:]
    X = X[:-1]
    Y = Y[:-1]
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

    #Performing the Regression on the training data
    clf = LinearRegression()
    clf.fit(X_train, Y_train)
    prediction = (clf.predict(X_prediction))
    result = clf.score(X_train, Y_train)

    output = ("\n\nStock: " + str(stock) + "\nClose Price on " + str(data.loc[data.index[[-2]], 'Unnamed: 0'].item()) + ": $" + str(data.loc[data.index[[-2]].item(), 'close']) + "\nPrediction for tomorrow closing: $%.2f" %
    (prediction[0]) + "\nActuall closing: $%.2f" % (Y_ans[0]) + "\nModel Accuracy: %.2f%%" % (result*100.0))

    sendMessage(output)
    
if __name__ == '__main__':
    getStocks()

Conclusion

Thank you all for reading it, I hope this tutorial will benifit you. If you have any questions feel free to comment below. Good luck everyone!