Data Analytics Assignment: Case Scenarios Based on Intelligent Systems
Question
Task:
Data Analytics Assignment Task: Carefully read the following two questions and provide the appropriate answer.
Question 1: The bankruptcy-prediction problem can be viewed as a problem of classification. The data set you will be using for this problem includes one ratio that have been computed from the financial statements of real-world firms. These ratios have been used in studies involving bankruptcy prediction. The first sample (training set) includes 68 data value on firms that went bankrupt and firms that did not. This will be your training sample. The second sample (testing set) of 68 firms also consists of some bankrupt firms and some non-bankrupt firms. Your goal is to use different classifiers to build a training model, by randomly selecting the 40 data points (20 points from category 1 and 20 points from category 0), and then test its performance on the testing model by randomly selecting 40 data points from the testing set. (Try to analyze the new cases yourself manually before you run the neural network and see how well you do.)
Students have to use the following classifiers. The selection of the classifiers depends upon the members of the group, e.g. if the group has four members then they will use the four classifiers from the following five classifiers.
1. Neural network
2. Support vector machine
3. Nearest neighbor algorithm
4. Decision tree
5. Naive Bayes
You have to prepare a report which includes the followings:
1. Explain the process of building each classifier using the training set (add the screenshots).
2. Explain how you evaluated the classifier.
3. Create the confusion matrix based on 70% (training) / 30% (testing).
4. Predict the category of the values (any random 40 values) in table used for Testing set.
5. Compare the results between the different classifiers and discuss which one is the best and why.
Question 2: Create a DASHBOARD. For creating a dashboard, the group can use the above database or any other database. The group has to prepare a report which includes the followings:
1. Write an introduction about the dataset used and add the reference (link).
2. Create at least four figures (different graphs) and add them to dashboard.
3. Add Screenshot of each of the steps.
4. Describe the figures in the dashboard. The student can use any software to create the dashboard such as Microsoft excel, Tableau, etc.
Answer
Answer 1
The data analytics assignment has been made with the help of RapidMiner software. Two classification model has been used for the completion of this assignment: decision tree and naïve Bayes classifier.
Decision tree
Two models has been designed in order to make the required outcomes. The first models deals with the development of decision tree based on random selection of 40 data from the training set which consists of 20 from 0 category and 20 from 1 category. A breakpoint has been used to view the data being sampled for the training and testing. The model used for decision tree model design and testing is shown below:
Figure 1: Model used for decision tree
The following table shows the training and testing samples selected from the data set used:
With the help of the model and the above data samples the following decision tree is achieved:
Figure 2: Decision tree
The prediction of the data has been shown in the following table:
Result
The textual representation of the decision tree is as follows:
WC > 447.846: 0 {1=0, 0=8}
WC ? 447.846
| WC > 190.223: 1 {1=20, 0=9}
| WC ? 190.223: 0 {1=0, 0=3}
A second model has been designed alongside to determine the accuracy of the training data set by dividing it into two parts 70% for training and 30% for testing. The models designed is as follows:
Figure 3: Model designed for accuracy calculation
The confusion matrix of the training data set is as follows:
The accuracy of this model has been found to be 75%.
Naïve Bayes classifier
The second model used for the analysis of the same dataset is Naïve Bayes. This model follows the same theory as used above by sampling 40 data based on category: 20 from 0 category and 20 from 1 category. The testing sample also consist of 40 randomly chosen data sample. To make it easier for comparison both the models has been made with the help of the same random sampled data. The Naïve Bayes model made is as follows:
Figure 4: Naïve Bayes classifier model
The following data sample has been used for the analysis:
With the help of the above data set the following prediction has been made:
Result
The Simple distribution is defined as:
Distribution model for label attribute Category
Class 1 (0.500): 1 distributions
Class 0 (0.500): 1 distributions
The mean and standard deviation of the data is as follows:
Figure 5: Simple distribution of the data calculated
The accuracy is calculated as directed with the help of the training model being divided into 70% training and 30% testing data set. The following model has been used to calculate the accuracy of the model:
Figure 6: Naïve Bayes model for accuracy measurement
The confusion matrix of the model is shown as:
The accuracy of the model has been found to be 70%.
Answer 2
Figure 7: Dashboard of the result data from
Question 1
The data set used for the design of this data set are the two result data set found from the previous question. This has been chosen so as to analyse the trend of the relation between the different data in the tables. The charts has been created in Excel. The each of the tables comprised of 40 data points with six columns: Row no., Firm, prediction (Category), confidence (1), confidence (0), and WC. The four charts created depict the three main data components of the data set. The first chart shows the plotting of prediction (Category) and WC and the plotting of prediction (Category) along with confidence (1) and confidence (0). Both type of charts has been drawn for the Decision Tree dataset as well as the Naïve Bayes dataset.
The first chart shows that the value of 1 for prediction (category is mostly concentrated with the range of 0 to 500 in terms of WC. The tree points which can be seen for prediction (category) as 0 can be considered as outliers as one of the value is extremely higher than the rest. Maybe is a different value was present the distribution of the points might have changed. The second chart shows that the trend lines are consistent for a group of values and have a sharp fall or rise depending on the data points. The sharp fall when considered in the mirror opposite of the sharp rise in the trend lines. This change in value to either 0 or 1 corresponds to the predicted (category) value of 0.
The third chart has been designed for the data collected from the Naïve Bayes classification algorithm. Here also we can see that the value for prediction (category) of 1 has the highest number of values within the range of 0 to 500 and there is the inclusion of the large value which can be considered to be an outlier for this data set also. However it can be seen that there is a change in the points in the prediction (category) of 0. This is due to the process of calculation of the two different algorithms which cause minute error to be incorporated into the calculation. The fourth chart gives the ides of the prediction (Category) alongside the confidence values for 0 and 1. The 3 smaller lines which end at 1 are the ones which has been predicted to be 0.