## Description

The goal of this project is to implement a program for inferring imperfect decision trees. Theory We discussed in class the classic ID-3 approach for inferring decision trees. The algorithm keeps splitting nodes as long as the nodes have nonzero entropy and features are available. This project’s goal is to infer imperfect decision trees with a small number of nodes and with small entropy. Consider a tree that partitions the instances into k leaves: S = {s1, s2, . . . , sk}. The target attribute is T, and the purpose of the tree is to enable us to infer the target attribute. As in the case of ID-3 our goal is to minimize: Entropy(T|S) = X k j=1 Prob(sj )Entropy(T|sj ) A greedy approach for inferring such a tree is to iteratively determine what node should be partitioned further. The contribution of a single leaf sj to the total uncertainty is: Ej = Prob(sj )Entropy(T|sj ) = |sj | n Entropy(T|sj ) (1) where |sj | is the number of instances in sj and n is the total number of instances. Therefore, it makes sense to choose sj with a maximal Ej . A better criterion is obtained using lookahead. Let A1, . . . Am be the attributes. Let Tj be the target attribute restricted to the subset sj . If we choose sj , it is to be partitioned using the attribute Ai that minimizes Entropy(Tj |Ai) computed as: Entropy(Tj |Ai) = X l |al | |sj | Entropy(Tj |al) The total change in the entropy of the tree is: Eji = Prob(sj )(Entropy(Tj |Ai) − Entropy(T|sj )) = −Prob(sj )Gain(sj , Ai) where: Gain(sj , Ai) = Entropy(sj ) − Entropy(sj |Ai) Therefore, we should choose sj that minimizes Eji, or, equivalently, the one that maximizes F = max i,j |sj | n Gain(sj , Ai) (2) 1 Example A1 A2 A3 Target attribute 1 R Y 0 a 2 R X 0 a 3 R X 0 b 4 R X 1 a 5 G X 1 b 6 G X 1 a 7 G X 1 a 8 G X 1 b 9 B Y 1 a 10 B Y 0 b Starting with the single leaf {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} it must be chosen as the first leaf to split. Entropy(T) = 0.970951 Entropy(T|A1) = 0.924511 Gain(T, A1) = 0.0464393 Entropy(T|A2) = 0.965148 Gain(T, A2) = 0.00580215 Entropy(T|A3) = 0.95098 Gain(T, A3) = 0.0199731 Therefore, the best choice is A1. This produces the following partition into three leaves: s1 = {1, 2, 3, 4}, s2 = {5, 6, 7, 8} s3 = {9, 10} we have: Entropy(s1) = 0.811278 Entropy(s1|A1) = 0.811278 Gain(s1, A1) = 0 Entropy(s1|A2) = 0.688722 Gain(s1, A2) = 0.122556 Entropy(s1|A3) = 0.688722 Gain(s1, A3) = 0.122556 F1 = 0.4 × 0.122556 = 0.0490224 Entropy(s2) = 1 Entropy(s2|A1) = 1 Gain(s2, A1) = 0 Entropy(s2|A2) = 1 Gain(s2, A2) = 0 Entropy(s2|A3) = 1 Gain(s2, A3) = 0 F2 = 0.4 × 0 = 0 Entropy(s3) = 1 Entropy(s3|A1) = 1 Gain(s3, A1) = 0 Entropy(s3|A2) = 1 Gain(s3, A2) = 0 Entropy(s3|A3) = 0 Gain(s3, A3) = 1 F3 = 0.2 × 1 = 0.2 Therefore, the next leaf to split is s3, and it is splitted according to A3. This produces the following partition into 4 leaves: s1 = {1, 2, 3, 4}, s2 = {5, 6, 7, 8}, s4 = {9}, s5 = {10} 2 Input/Output The dataset is described in a file containing only integer numbers. The first number in the file is m, the number of instances; the second is n, the number of features. These two numbers are followed by mn feature values. The last attribute is taken as the target attribute. Example: ————————– 10 4 0 0 0 0 0 0 1 0 1 0 2 0 2 2 0 1 0 0 1 1 0 1 2 1 1 1 1 1 1 2 2 0 0 0 1 0 2 0 1 0 ————————- You may assume that a feature has no more than 3 distinct values so that the numbers in each column can be eiter 0, 1 or 2. The target attribute (the last column) is binary, so that its values are 0 or 1. A partition is described in a text file. For a situation where there are m instances, a partition is described by a sequence of numbers in the range 1 . . . m. Each partition is specified in a separate line. The first string in a row describing a partition is the partition id. With 10 instances, here is the file representing the partition {1,10}, {2,3,4,5}, {6,7,8,9}. The first partition has id= X, the second id=Y 2, and the id of the third is Z. ————————– X 1 10 Y2 2 3 4 5 Z 6 7 8 9 ————————- Your program should read as input one dataset file and one partition file. It should determine which partition to split, and according to what attribute. This information should be printed as a message. The program should then produce as output the new partition file. Example: MyProgram Enter names of the files dataset input-partition output-partition dataset-1.txt partition-2.txt partition-3.txt Partition Z was replaced with partitions Z1,Z2 using Feature 3 If the file partition-2.txt is as specified above, the file partition-3.txt might be: ————————– X 1 10 3 Y2 2 3 4 5 Z1 7 Z2 6 8 9 ————————- Requirements: Submission must be on eLearning. 4