by Jennifer Stanczak (jenstan in BIT330, Fall 2008)
Questions and queries
Web search engines
When I was younger I always played with Barbie dolls. I had tons of them, from Beach Barbie to Doctor Barbie. I knew that Barbies had been around for a long time, but I wondered who was the creator of the popular doll. My web search engine search will be to find the creator of the Barbie Doll.
In all three search engines, I will use the search query “Barbie doll creator.”
Blog search engines
For my search in the blog search engines, I will be looking for reviews for Google’s new web browser, Google Chrome. I don’t know anything about it so I am looking for reviews on what it does and if it is a good web browser to use.
In all three blog search engines, I will use the search query “google chrome.”
Data that I collected
Search engine overlap data
Web search 
Live 
Google 
Yahoo Web 
Live 
80 
25 
20 
Google 

45 
20 
Yahoo Web 


75 
All 
10 



Blog search 
Technorati 
Google Blog 
Bloglines 
Technorati 
30 
5 
10 
Google Blog 

70 
10 
Bloglines 


55 
All 
5 



Search engine ranking overlap data
This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY 
Yahoo 
Google 
5 
10 
20 
5 
1 
2 
2 
10 
2 
3 
3 
20 
2 
4 
4 

This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG 
Google 
Yahoo 
5 
10 
20 
5 
1 
2 
2 
10 
2 
3 
4 
20 
2 
3 
4 

This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG 
Google 
Bloglines 
5 
10 
20 
5 
0 
0 
0 
10 
0 
1 
2 
20 
0 
1 
2 

This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB 
Bloglines 
GBlog 
5 
10 
20 
5 
0 
0 
0 
10 
0 
1 
1 
20 
0 
2 
2 

Results
Web search
Web search 

Precision 
Overlap 
All 


Live 
Google 
Yahoo 
L/G 
L/Y 
G/Y 
L/G/Y 
Precision_{Google}  Precision_{Yahoo} 
Mean 
42.78 
54.44 
51.67 
18.33 
20 
20.56 
10 
2.78 
Median 
42.5 
57.5 
52.5 
20 
20 
20 
10 
5 
Mode 
15 
70 
70 
10 
10 
25 
10 
10 
Std. Dev. 
22.77 
20.07 
22.43 
9.549 
11.38 
7.838 
7.475 
14.17 
N 
18 
18 
18 
18 
18 
18 
18 
18 
In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the search engines. I also made a new column that calculated the precision of Google minus the precision of Yahoo. I chose these two search engines because they had a higher average precision than Live Search. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google is more precise than Yahoo. My null hypothesis is u_{d} is less than or equal to zero and the alternative hypothesis is u_{d} is greater than zero. I chose a significance level (alpha) of .025 and calculated the tstatistic to be .83, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google and Yahoo are greater than zero (no evidence to prove that Google has a higher precision).


GY 
YG 

o(5,5) 
o(10,5) 
o(20,5) 
o(5,10) 
o(10,10) 
o(20,10) 
o(5,10) 
o(10,20) 
o(20,20) 
o(5,5) 
o(10,5) 
o(20,5) 
o(5,10) 
o(10,10) 
o(20,10) 
o(5,10) 
o(10,20) 
o(20,20) 
Mean 
1.0588 
1.3529 
1.6471 
1.2941 
2 
2.6471 
1.6471 
2.4706 
3.7059 
1.0588 
1.1765 
1.6471 
1.4706 
1.9412 
2.4706 
1.8824 
2.6471 
3.7647 
Median 
1 
1 
2 
1 
2 
3 
1 
3 
4 
1 
1 
1 
1 
2 
3 
2 
3 
4 
Mode 
1 
0 
0 
1 
1 
4 
1 
3 
5 
1 
0 
1 
1 
3 
3 
1 
4 
5 
Std. Dev. 
1.1974 
1.3201 
1.4116 
1.2127 
1.3229 
1.7299 
1.2217 
1.5459 
2.1144 
1.1974 
1.2862 
1.3666 
1.2307 
1.3906 
1.5858 
1.269 
1.7299 
2.0775 
N 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
Top 5 results in Yahoo also appearing in Google 
1.6471 
Results 510 of Yahoo also appearing in Google 
2.6471  1.6471 = 1 
Results 1020 of Yahoo also appearing in Google 
3.7059  2.6471 = 1.0588 ( divided by 2 to put in terms of 5 results = .5294) 
Top 5 results in Google also appearing in Yahoo 
1.6471 
Results 510 of Google also appearing in Yahoo 
2.4706  1.6471 = .8235 
Results 1020 of Google also appearing in Yahoo 
3.7647  2.4706 = 1.2941 (divided by 2 to put in terms of 5 results = .64705) 
In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google/Yahoo and Yahoo/Google. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 510, and 1020 results. I found that there is indeed more overlap in the top results compared to the lower results.
Blog search
Web search 

Precision 
Overlap 
All 


Technorati 
Google Blog 
Bloglines 
T/G 
T/B 
G/B 
T/G/B 
Precision_{Google Blog}  Precision_{Bloglines} 
Mean 
33.06 
52.5 
44.44 
3.611 
9.167 
6.944 
1.389 
8.06 
Median 
30 
42.5 
47.5 
0 
7.5 
5 
0 
10 
Mode 
30 
40 
50 
0 
5 
5 
0 
10 
Std. Dev. 
21.15 
22.18 
14.34 
7.031 
7.717 
6.449 
3.346 
17.75 
N 
18 
18 
18 
18 
18 
18 
18 
18 
In the above table, I calculated the mean, median, mode, and standard deviations for the precision and overlap of the blogsearch engines. I also made a new column that calculated the precision of Google Blog minus the precision of Bloglines. I chose these two search engines because they had a higher average precision than Technorati. I then used this data to perform a hypothesis test to determine if there was sufficient evidence to conclude that Google Blog searches are more precise than Bloglines searches. My null hypothesis is u_{d} is less than or equal to zero and the alternative hypothesis is u_{d} is greater than zero. I chose a significance level (alpha) of .025 and calculated the tstatistic to be 1.93, which does not fall within the rejection region of greater than 2.110. Therefore, I fail to reject the null hypothesis. There is insufficient statistical evidence to conclude that the difference in average precisions of Google Blog and Bloglines are greater than zero (no evidence to prove that Google Blog has a higher precision).


GB 
BG 

o(5,5) 
o(10,5) 
o(20,5) 
o(5,10) 
o(10,10) 
o(20,10) 
o(5,10) 
o(10,20) 
o(20,20) 
o(5,5) 
o(10,5) 
o(20,5) 
o(5,10) 
o(10,10) 
o(20,10) 
o(5,10) 
o(10,20) 
o(20,20) 
Mean 
0.2941 
0.3529 
0.4706 
0.4118 
0.4706 
0.8235 
0.7059 
0.7647 
1.0588 
0.2941 
0.3529 
0.5882 
0.4118 
0.5294 
0.8235 
0.5294 
0.8824 
1.1176 
Median 
0 
0 
0 
0 
0 
0 
0 
0 
1 
0 
0 
0 
0 
0 
1 
0 
1 
1 
Mode 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
Std. Dev. 
0.4697 
0.6063 
0.6243 
0.6183 
0.7174 
1.0146 
0.9196 
1.0914 
1.1974 
0.4697 
0.6063 
0.8703 
0.6183 
0.7174 
1.0744 
0.6243 
0.9926 
1.1663 
N 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
17 
Top 5 results in Bloglines also appearing in Google Blog 
.4706 
Results 510 of Bloglines also appearing in Google Blog 
.8235  .4706 = .3529 
Results 1020 of Bloglines also appearing in Google Blog 
1.0588  .8235 = .2353 ( divided by 2 to put in terms of 5 results = .11765) 
Top 5 results in Google Blog also appearing in Bloglines 
.5882 
Results 510 of Google Blog also appearing in Bloglines 
.8235  .5882 = .2353 
Results 1020 of Google Blog also appearing in Bloglines 
1.1176  .8235 = .2941 (divided by 2 to put in terms of 5 results = .14705) 
In the above table, I calculated the mean, median, mode, and standard deviations for the overlap of rankings in Google Blog/Bloglines and Bloglines/Google Blog. The idea is that we are trying to determine if it's more likely that a top result (compared to a lower result) in one search engine appears in another search engine. Therefore, I then used the medians to calculate the average number of overlaps in the top 5, 510, and 1020 results. I found that there is indeed more overlap in the top results compared to the lower results.
Discussion
Web search
Based on the data sets, I can conclude that not one search engine is more accurate than another. I showed this in my hypothesis test to determine if Google searches were more precise than Yahoo. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, when performing a search and only looking at the top few results, you are essentially getting a lot of the same results no matter what search engine you use. I recommend that if a person is searching for information they use the search engines interchangably for the most part. I would not recommend doing the same search in all three for time saving purposes because there is not that much of a difference in results or precision. If further investigation was done on this topic, I would recommend that you use a bigger sample size. The small samples used here do not provide a good representation of the true data and make statistical analysis more difficult when the sample is not large enough to assume normality.
Blog search
Based on the data sets, I can conclude that not one blog search engine is more accurate than another. I showed this in my hypothesis test to determine if Google Blog searches were more precise than Bloglines. There was no statistical evidence to prove that they were. The top results in each of them were more likely to contain results from the other search. Therefore, if you only look at the top results, you will get some of the same results across the two search engines. However, because the overlap was so low, I would recommend searching for a query in all of them if time allows. If further investigation was done on this topic, I would recommend a few changes in the methods and approach. First of all, I would use bigger samples. The small samples used here do not provide a good representation of the true data. Also, I would revise my query because searching for a term that specifically relates to the search engine such as "Google Chrome" in the Google Blog search engine could distort results somewhat.