Suppose we wish to use a postings intersection procedure to determine simply the list of documents that satisfy a /k clause, rather than returning the list of positions, as in Figure 2.12 (page 42). For simplicity, assume k ≥ 2. Let L denote the total number of occurrences of the two terms in the document collection (i.e., the sum of their collection frequencies). Which of the following is true? Justify your answer.
a. The merge can be accomplished in a number of steps linear in L and independent of k, and we can ensure that each pointer moves only to the right.
b. The merge can be accomplished in a number of steps linear in L and independent of k, but a pointer may be forced to move non-monotonically (i.e., to sometimes back up)
c. The merge can require kL steps in some cases.
Figure 2.12 can be found from this pdf file - http://nlp.stanford.edu/IR-book/pdf/02voc.pdf
Q2.
How should the Boolean query x AND NOT y be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.
Q3.
If all the hub and authority scores are initialized to 1, what is the hub/authority score of a node after one iteration?
Q4.
In the preceding discussion we encountered two recommended "hard constants" - the increment on te being ten times the last fetch time, and the number of back queues being three times the number of crawl threads. How are these two constants related?