Alexi 听听 听听Terry what was the idea behind the research, this notion of page rank?
Terry 听听 听听They started doing the research in an era when people had just begun to do search engines on the web. 听The web started off, erm, really the idea, there was a bunch of interesting stuff and you browsed, you surfed. 听You went from page to page saw what was there and that was fun. 听Erm, and then people realised that there was enough interesting and serious stuff, they might want to actually go somewhere, where they could find something they wanted. 听So a number of people at different places erm, created what were called search engines. 听Erm and the basic idea was that you create an index that let you find where things are in the web. 听So if you have here, and this is sort of a sketch of what it might be, web pages, each of these boxes is a page, a, b, c, and d. 听Each one has certain words in it, television, computer, circuit, whatever it is. 听And each one can have links, where the links point to another page. 听So, this page on computers and net's may point to this one for televisions and computers and so on. 听Now, what they realised, this is before Google, with the people doing the original 'spiders' they were called on the web. 听What the spiders could do, is they could give them the address, give the computer the address of this page. 听The computer could make a list of all the words that were on that page and also, find this page, cause there was a link. 听Then it would go to this page, make a list of all the words on that page and then it could follow the links there. 听And computers had gotten fast enough and powerful enough and the web was small enough, that you could actually build a complete index. 听So you'd end up with something, think of the index in the back of a book, so the word computer appears in pages a, b, and d, the word television appears on this page and so on. 听So I went to AltaVista let's say, which was one of these early search engines and I typed in computer, it would look in the index it had made and it would give me a search, a list of results that said, a, b, d, and so on. And this made it possible to go find something on the web, instead of just browsing around and seeing where you got to.
Alexi 听听 听听But the problem of course is that, if somebody said computer a thousand times, because that was the key word that was being searched, it would push the result up and it wouldn't necessarily be the most
Terry 听听 听听Exactly, so they have to decide, if there are three results, it's not problem, but if there's a hundred results or a thousand results, which ones do you show? 听And how do you know that a, is more interesting than d, or be is more interesting than d? 听So the question of what was interesting, what was irrelevant, wasn't addressed by having just a regular index like this. So, the problem really, here's where Google, the founders of Google came in, Serge and Larry decided, that they could do a better job of, finding the interestingness, the relevance, what makes a page something you want to see, other than just that it happens to have the words that you search for.
Alexi 听听 听And how did they go about identifying interestingness, because that's a very subjective idea, isn't it?
Terry 听 So interestingness is of course subjective, and there is no, what plays things like Yahoo did, is, had human beings go through and say, here's an interesting page, here's an interesting page. 听That was the, the people, Yahoo was the most famous now, but there were a lot of people in that era, who would go through and check out pages. 听And again that worked when the web was very small.
Alexi 听听 听 听Exactly that would not scale
Terry 听 And as the web gets bigger you can't have higher people to go out and look at all the pages. 听So the question is, how do you get people who you don't hire, to in some sense give you judgements on which pages are interesting. And they had a very interesting sort of metaphor for this, which is, imagine a crowd of people all surfing the internet. 听So you take millions of people, start them out all over the internet, and they get to a page and they'll follow a link and from there maybe they'll follow another link. 听Now if you could actually get millions of people and all the paths they take, you would see that traffic would end up concentrating on certain places. 听A lot of people would end up here on this page and only a few people went on this page. 听Then when you've got around to giving your search results, you would give the ones that got a lot of this virtual traffic. 听Now this is not actual people going, cause you don't have millions of people, you don't have data on that. 听But you can imagine, where would they go.
Alexi 听听听So in, so if we kind of take this outside of the web, this would be like places in a City, that have a lot of people driving through it, for example, it's a particular junction, it's an important building or something like that. 听That's what these websites, that's what the search algorithm identified?
Terry 听听听That's what would decided what's the most relevant, what's the most interesting. 听So, there's no, there is no simple way to actually get that data. 听Because the people who know where other people went on the web are only the service providers and they don't give that information. 听But what they realised is, if they used links, they could get an approximation of how interesting pages were. So they built a second index, which, not only kept track of what words were on each page, but were, it was linked from, so, you might here say that page b, has a link coming in from a, and a link coming in from x. 听
So they actually had information that gave them the full link structure of the web, where does every link go from and to. 听Then they could take this and they applied a mathematical algorithm, it's called the page rank algorithm. 听Which was intended to basically simulate in some sense, the result of what would happen if you had an infinite number of monkeys. 听If you put thousands, millions of millions of people on the web and let them just start browsing. 听And the result that they can get out of running this algorithm, which of course didn't require millions and billions of things going on, erm, was a good approximation that page b, lets say is the one that would get the most traffic of a, b, and d. 听So then when you search for computer, it brings b to the top of your listing.
Alexi 听听 听听So if a page had a lot of people going to it or referencing it, then that would increase its interestingness, it would increase its reputation?
Terry 听 It's a little bit like in academics, were you have citations. 听So I write an academic paper and I say, see so and so's paper from such and such year. 听That indicates that, that's an interesting paper. 听And it's sort of the same thing here, if you have lots of links pointing to you, that indicates that a lot of people have decided you're interesting enough to put in a link pointing to you. 听So that's really the basis of the algorithm.
Comments Post your comment