## Description

Introduction.

This homework is worth 5.5% of your total grade. If you choose to skip it, the ﬁnal exam will be worth 5.5% more. If you complete all the homeworks after Exam #2, there will be a special bonus in terms of your ﬁnal grade computation. Additionally, doing this work now is essential to doing well in the ﬁnal exam. This homework requires no programming, only pen/pencil on paper computations. So you will submit your answers as a PDF ﬁle on GRADESCOPE. Question 1. Estimate the cost of following operations with the information given below. All costs should be in number of pages (read or written). Recall that when the ﬁnal result is found, it is put in the output buﬀer. This operation has no additional cost.

R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000 S(F,G,A) TUPLES(S) = 3,000,000 PAGES(S)= 6,000

(a) Block-nested loop join for R ./ S with M=101 blocks. For join ordering, choose the lowest cost one and only show the result of that join. Answer: cost = PAGES(R) + [PAGES(R) M−1 ]×PAGES(S) = 2000 + 20×6000 = 12200 (b) External sorting of S using M=60 blocks. Answer: cost of read and merge = 6000 × 2 = 12000 6000/60 = 100 groups which is larger than M = 60, so repeat to merge groups. Total cost = 12000 + 12000 + 6000 = 30000

(c) External sorting of S using M=100 blocks. Answer: cost of read and merge = 6000 × 2 = 12000 6000/100 = 60 groups which is smaller than M = 60, sort directly Total cost = 12000 + 6000 = 18000

(d) Hash join for R ./ S with M=101 blocks. Describe how join works and any assumptions you make in your computation. Assume tuples in R and S will uniformly distribute to 101 blocks. If a buckets after hashing will not ﬁt in memory, then you will use block nested loop join. Answer: hash cost = 2 × (2000 + 6000) = 16000 R buckets: 2000/101 = 20 pages/bucket, S buckets: 6000/101 = 60 pages/bucket so enough memory to allocate Total cost = 16000 + 2000 + 6000 = 24000

(e) Sort merge join for R ./ S with M=101 blocks. Sort each relation ﬁrst and then join. Note that if the number of partially sorted groups is smaller than allocated memory (in this case 101 blocks), they can be sorted and joined together in a single step. Answer: cost of sort R = 2 × 4000 + 2 × 4000 = 8000 cost of sort S(only step 1 of external sort) = 6000 × 2 =12000 Then use 60 blocks to do sort merge and join for S and 1 block to read R Total cost = 8000 + 12000 + 8000 = 28000

1

Question 2. Estimate the cost of query Q1 using diﬀerent query plans. All costs should be in number of pages. Assume suﬃcient amount of memory is allocated for each query plan.

R(A,B,C,D,E) TUPLES(R)= 100,000 PAGES(R)= 2,000 Q1: SELECT A,B FROM R WHERE C>10 AND D=25

Index I1 with three levels (root,internal,leaf) on R(C,D,A) with 800 nodes in the leaf level

Index I2 with three levels (root,internal,leaf) on R(D,A,B,C) with 1,500 nodes in the leaf level

Index I3 with three levels (root,internal,leaf) on R(C) with 300 nodes in the leaf level

Index I4 with three levels (root,internal,leaf) on R(D) with 250 nodes in the leaf level

Expected number of tuples of R for conditions:

Tuples(R.C>10)=20,000 Tuples(R.D=25)=1,000 Tuples(R.C>10 and R.D=25)=50

(a) Plan 1: sequential scan over R. Answer: Because of sequential scan, cost = PAGES(R) = 2000

(b) Plan 2: using index I1. Answer: Index I1 has 800 nodes in the leaf level, so 10000/800 = 125 tuples/leave node. Because condition 1 in Q1 is C > 10 which is a range, we need to go over all leave node with C > 10, and cost of this step = 20000/125 = 160 to 161 pages. Then because B is not in I1, so we need to scan all satisﬁed tuples with the maximum cost of 50 pages. Total cost = 2 + (160 to 161) + 50 = 212 to 213 pages

(c) Plan 3: using index I2. Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node. First we need to go over all leave node with D = 25, and cost of this step = 1000/66 = 16 to 17 pages. Then because A and B are in I2, so we do not need to scan all satisﬁed tuples. Total cost = 2 + 16 to 17 = 18 to 19 pages

(d) Plan 4: using index I3. Answer: Index I3 has 3000 nodes in the leaf level, so 10000/3000 = 333 tuples/leave node. First we need to go over all leave node with C > 25, and cost of this step = 20000/333 = 60 to 61 pages. Then because A and B and D are not in I3, so we need to scan all satisﬁed tuples. Total cost = 2 + (60 to 61) + 20000 = 20062 to 20063 pages

2

(e) Plan 5: using index I4. Answer: Index I4 has 250 nodes in the leaf level, so 10000/250 = 400 tuples/leave node. First we need to go over all leave node with D = 25, and cost of this step = 1000/400 = 3 to 4 pages. Then because A, B and C are not in I4, so we do need to scan all satisﬁed tuples. Total cost = 2 + (3 to 4) + 1000 = 1005 to 1006 pages

(f) Plan 6: using index I3 and I4 both. Answer: Index I2 has 1500 nodes in the leaf level, so 10000/1500 = 66 tuples/leave node. From part (d), cost to get tuples with (C > 10) = 62 to 63 pages. From part (e), cost to get tuples with (D = 25) = 5 to 6 pages. Then we do intersection for above two group of tuples to ﬁnd tuples satisfy both conditions. Total cost = (62 to 63) + (5 to 6) + ((60 to 61) + (3 to 4)) + 50 = 180 to 184 pages

3

Question 3. Estimate the size (number of tuples) of the following queries:

Q1: select * from games where id = 21; Q2: select * from contestants where gameid = 2345; Q3: select * from contestants where shortname = ‘Gilbert’; Q4: select * from contestants where gameid = 2345 and shortname = ‘Gilbert’; Q5: select * from games g, contestants c where g.gameid=c.gameid and c.shortname=’Gilbert’; Q6: select * from responses where iscorrect = True; Q7: select * from responses where shortname = ‘Gilbert’ and gameid=’5881′; Q8: select * from responses where shortname = ‘Gilbert’ or gameid=’5881′; Q9: select * from responses r, contestants c where c.shortname=r.shortname; Q10: select * from responses r, contestants c where c.shortname=r.shortname and c.gameid=r.gameid;

using the following statistics (VALUES is the same DISTINCT values):

Tuples(Games) = 10,000 Values(Games.gameid) = 10,000 Values(Games.id) = 40 Values(Games.airdate) = 10,000

Tuples(Contestants)=30,000 Values(Contestants.shortname)=3,000 Values(Contestants.fullname)=18,000 Values(Contestants.gameid)=10,000

Tuples(Responses)=740,000 Values(Responses.gameid)=10,000 Values(Responses.clueid)=520,000 Values(Responses.shortname)=3,000 Values(Responses.iscorrect)=2

These are close to the real statistics, but they are not identical. However, the following is a great educational exercise (not required for solving the homework). You can get the statistics for the real data and compare the estimates to the real results (just by running select count(*) queries). When do the estimates are far oﬀ from reality and why?

4

Answer:

Q1 1000× 1 40 = 250 Q2 30000× 1 10000 = 3 Q3 30000× 1 3000 = 10 Q4 30000× 1 10000×3000

= 1

Q5 10000× 30000 3000×10000

= 10

Q6 740000 2 = 370000 Q7 740000× 1 3000×10000

= 1

Q8 740000×(1− 1 10000)×(1− 1 3000) = 321 Q9 30000× 740000 3000 = 7400000 Q10 30000× 740000 3000×10000 = 740

5

Question 4. Estimate the cost of the following query plans for the same query. Which one is the cheapest plan?

The abbreviations used:

BNLJ Block nested loop join SMJ Sort merge join O-T-F On the ﬂy (pipelined operation) Sort External sort I-O-S Index only scan SS Sequential scan

For simplicity, we will give you the size of the result of each operation. You can assume that there are appropriate projections to reduce the size of intermediate relations which are included in the given sizes.

Furthermore, assume that the join operation is over a foreign key (so for each S.D, there is a single R.A value). This simply makes the computation of sort merge join much simpler after a sort.

PAGES(R) = 100 PAGES(S) = 800 PAGES(σR.B>20 R) = 25 PAGES(R ./R.A=S.D S) = 700 PAGES((σR.B>20 R) ./R.A=S.D S) = 175

Index scan for Plan 2 uses an index on R.B,R.A,R.C. Assume that index has 2 levels, root and 100 leaf pages. The selectivity of the condition σR.B>20 R is 1/4.

Note that operations in each query plan are pipelined from lower operations to upper operations, by placing the output of one operation to the input buﬀer of the next operation.

Answer:

1. Cost of BNLJ = 100 + 800× 100 50 = 1700 Cost of sort = 700 50 = 14 < 50 Total cost = 1700 + 700×2 + 700 = 3800 2. Cost of selection = 1 + 25 = 26 Cost of BNLJ = 25 + 800 ×25 50 = 425 Cost of sort = 175 50 = 4 < 50 Total cost = 26 + 425 + 2 × 175 + 175 = 976 3. Cost of selection = 100 Cost of sort 1 = 100 50 ×2 + 100 = 300 Cost of sort 2 = 800 50 ×2 + 800 = 2400 Cost of SMJ = 300 + 2400 = 2700 Total cost = 100 + 300 + 2400 + 2700 = 5500
6