Juggernaut GPLd Search Engine

Please create an account to participate in the Slashdot moderation system

Juggernaut GPLd Search Engine 86

Posted by CmdrTaco on Thursday December 09, 1999 @09:21AM from the stuff-to-hack-on dept.

real bio pointed us to Juggernautsearch which actually looks interesting. Its GPLd. It can index 800 million pages every 3 months and deliver 10 million pages a day on a Pentium II. So I guess if you want to run your own Altavista, you can.

This discussion has been archived. No new comments can be posted.

Juggernaut GPLd Search Engine

Load All Comments

Search 86 Comments Log In/Create an Account

Comments Filter:

Re:Distributed effort ? (Score:1)

by CodeShark ( 17400 ) writes:

The server weeds out the doubles...
I'm wondering if this would be THE weak link in this idea. It seems to me that with the speed that the spidering engines works, you'd need a huge amount of processors at the "server" level just to eliminate the doubles. I use a freeware tool which collects data from 10-12 different search engines and attempts to eliminate the duplicates, only to get the same page numerous times from different sourced origin points, not even counting sites which are mirrored.
Any thoughts out there on how to solve this problem?
The end of PortalMania (Score:1)

by SoupIsGood Food ( 1179 ) writes:

Lycos. AltaVista. Yahoo.

These are all big-name, big-money comapnies living on borrowed time.

Just as every ISP provides DNS, mail and usenet services to their clients, the time is rapidly approaching where they will provide search/indexing services based on open industry standards. Products that integrate the search process into the OS, like Copernic or Apple's Sherlock are a clear indication of where the technology will go.

All it takes is a co-operatively networked "juggernaut search" system, the logical successor/complement to DNS, to topple the search/portal companies.

SoupIsGood Food
Actually, that's even better. (Score:1)

by Derek Pomery ( 2028 ) writes:

We use htdig for the internal network for the precise reason that we want it to only return linked directories, trees, and files.

And most crawlers can easily be limited to a particular site, or set of sites. Even wget does that.
Re:Juggernaut: Ouch. (Score:1)

by lbergstr ( 55751 ) writes:

Did you even read the page? It's a demo version; you're searching a minimal subset of their database.
Moderate this down, please.
Re:Distributed effort ? One way how (Score:1)

by Paolo ( 87425 ) writes:

I believe Excite was the first company to come up with this idea, when it was Architext Inc. They used a bit of distributed crawling on Stanford owned Sun boxes in off-hours.

The idea of a client reporting crawling is interesting but I have two issues with it. First, it's essential what is being done with services like AllAdvantage or free ISPs. They may monitor the sites you go to in order to build a database for advertisers, instead of for searching. The second issue is, net surfing would be bogged down considerably unless there was high bandwidth for the project.

One way someone could do this though, is to create an open proxy server on a big pipe, which would log all the sites users went to. This would be voluntary of course, and the database of sites could be added to the findings of a crawling bot.
Food for thought...
Re:Distributed effort ? One way how (Score:1)

by greenrd ( 47933 ) writes:

Like Besilent [besilent.com]? That might be a good idea, actually.
Free, scalable search engines (Score:1)

by Anonymous Coward writes:

Having just finished a research project on the state of the marketplace in search engines and related technology, I'll be interested to see how Juggernaut stands up in practice. For anyone interested in the best of what's already out there for free, at least under some conditions, I suggest checking out the following:
- the IB project at Basis Systeme netzwerk [bsn.com],
- the former commercial products from PLS [pls.com] that AOL is now giving away,
- the ASF [etymon.com] project,
- the Webglimpse [webglimpse.net] pages,
- the pages for the mg [rmit.edu.au] system.
For a comprehensive presentation on the subject, see the searchtools [searchtools.com] site.
Okay, post-IPO, /, needs some changes... (Score:3)

by tgd ( 2822 ) writes: on Thursday December 09, 1999 @11:53AM (#1472665)

I'm sure I'm not the only one who believes that the top-level stories on here should have moderation on them too.

I mean, really! This search engine hardly works at all, only the search part is free (and that's the no-brainer part of any search engine), it certainly doesn't index 800 million pages (I rarely got any results on any queries) and yet they still appear on here like some news item.

Did they pay slashdot? Are they a major stockholder now? What's the deal? Or was once again a story posted that wasn't checked first.

Give me seven million dollars, I'll double check my stories...

Share
twitter facebook
Re:Juggernaut: Ouch. (Score:1)

by pnot ( 96038 ) writes:

For the love of god, LAUNCH RIGHT! Don't say you can index 800 million pages in three months when your database gives less results that Lycos circa 1996.
Puts me in mind of GPLTrans. "Tests have shown it's more reliable than Babelfish and InterTran," they claimed. Which was actually correct, provided you only want to translate the one sentence they tested it with :-).
Davenet (Score:1)

by SoupIsGood Food ( 1179 ) writes:

Naw. The Whiner's a lurker on the MTK boards. Probably picked it up from me when I ranted about the topic a few months ago.

SoupIsGood Foof
Re:Juggernaut: Ouch. (Score:2)

by Effugas ( 2378 ) writes:

Did you even read the page? It's a demo version; you're searching a minimal subset of their database.

Clearly not obvious to the casual observer, and the entire page just doesn't reflect the claimed quality of the engine itself.

It's a botched launch, and right after GPLTrans too.

Yours Truly,

Dan Kaminsky
DoxPara Research
http://www.doxpara.com
Juggernaut: Ouch| Google, yea!! (Score:2)

by Wah ( 30840 ) writes:

Yea, great it's Op So and all, but it's tough to beat Google. Case in point, I run a small site out of my apartment (for about a month now). I have yet to do any search engine placement or promotion, other than meta tagging. A search on Google with my full site name returns a page full of my pages. Also, if you search Google for "Wah" my page just makes it in there at the bottom.

Google rocks!!
Re:Hideous (Score:1)

by Rurik ( 113882 ) writes:

In mind that they probably filter the pages that they store in their engine, or maybe have it limited for that frontend (though I doubt that). Though it may not be such a great engine for a general-user search-all task, if it was localized to say, a student home page directory, it would be a whole different story. And searching through a hundred thousand pages wouldn't take nearly as long as a few million. But hey, it's a start.
down to the second (Score:2)

by bumppo ( 15745 ) writes:

well:

800,000,000 million pages/3 months =

266,666,666 pages/1 month =
8,888,888 pages/day =
370,370 pages/hour =
6172 pages/minute =
103 pages/second
assuming ~3K of indexable text per page, that's a rough estimate of a 2.5Mb/sec connection.

If those claims hold up it's about 1.5 orders of magnitude faster than htdig, assuming you can keep it fed.
bumppo
Re:Juggernaut: Ouch| Google, yea!! (Score:2)

by Effugas ( 2378 ) writes:

Google rocks!!

No really. I can't even use my work's internal search engine anymore--I use Google, which finds more useful content in the public documentation than our lousy engine can find in the private!

Yours Truly,

Dan Kaminsky
DoxPara Research
http://www.doxpara.com
Re:Good Things To Come! - How many zillions (Score:1)

by Wah ( 30840 ) writes:

give gamers some credit. It's not like it used to be, but you still have to have a pretty good knowledge of your system to get the most out of gaming. Most of what I learned about computers (and perhaps why my Linux is weak..) was learned figuring out how to make various games work. In the win3/dos4gw gaming era, if you didn't know how to reconfigure memory you couldn't play half the games.

Not to mention the hardware expertise that comes with 10 years of upgrades.

There need to be more games on Linux, it helps draw in that next gen. or hackers (ones that will start using *real* OSes at 12). If you really want the tide to turn (faster, I guess) support Linux gaming.
Re:Make wild claims; get free /. publicity (Score:1)

by ghutchis ( 7810 ) writes:

Thanks for the complement. I'll agree that ht://Dig more suitable for medium-sized collections, but "medium" in this case is starting to stretch to over a million URLs. In many cases, it's simply an issue of system resources--to index 250 million URLs like Google's first set, or 800 million like Juggernaut claims, you need some pretty big iron, at least in terms of RAM and RAID arrays.

We'd obviously love feedback in how well it scales since we rarely get such reports. It's an area that we'd like to improve (since many of the developers don't run "mini-Altavistas" themselves).

I haven't been able to check out the Juggernaut code since it's heavily slashdotted right now. But suffice to say, we'll be checking out whatever code they've made available to see if there are any interesting optimizations.
Dizz-net (Score:3)

by jfunk ( 33224 ) writes: <jfunk@roadrunner.nf.net> on Thursday December 09, 1999 @05:49AM (#1472675) Homepage

A previous discussion here incited this:

http://www.dizz.net/ [dizz.net]

Basically, we need to get down exactly what to do and how to do it. More developers would be nice too...

Here's part of one of my messages on the list:

"The servers can perform database updating/maintenance and may also run client software itself. The client software sends it's finished "work units" to it's designated server. The servers assign IP addresses to be indexed to each client. Say a client is indexing in Australia and hits a link located in New York. The client will tell it's server about the link (and any other non-local ones) which will send them to the server nearest each link. The New York server sends a work unit to an arbitrary client waiting for links to index. It indexes, so on, etc. The cycle continues."

You can get on the list at http://www.egroups.com/group/dizz-net [egroups.com].

Share
twitter facebook
Filtering (Score:2)

by British ( 51765 ) writes:

Have they come out with a search engine yet, that, before giving you the results for your keyword searches, TAKES OUT the 404 errors? That would be something nice. Do your keyword search, and then have the search engine check each and every link to see if there's a 404 or whatever, and if there is, take it out of the results before it hands it over, and save the results it for next time in case someone else does the search.
Re:Filtering (Score:1)

by spiro_killglance ( 121572 ) writes:

Not Possible, if the search is supposed to respond quickly. the machine will have to make outgoing requests to search page to check if they exist, even done in parallel this is gone to quite a performance hit. At the minimum you want get a responce until the slowest website found has responced.
Re:Fine Print (Score:1)

by ghutchis ( 7810 ) writes:

I think many of us would agree that it is not GPL in spirit. But nonetheless, once they've made the code available that doesn't mean folks won't be reverse-engineering their database and projects like ht://Dig won't be examining what code was released.

Ultimately, their method of business may change in unexpected ways. Let's say someone reverse-engineers their database. Suddenly their revenue stream will disappear (unless they have some sort of patent, but that's another story). So they'll have to make money on support and/or hosting the indexing/searching for people w/o the hardware.

Let's not look a gift horse in the mouth. Ultimately the community will derive benefit from this code, either through cross-polinization with projects like ht://Dig, or simply by getting people interested in the concept of an open-source version of large search engines.
Re:Make wild claims; get free /. publicity (Score:1)

by Phrogman ( 80473 ) writes:

I consider the claims of indexing 800 million pages to somewhat exaggerated, since no one else is oing this at this time simply because the hardware required to do so is so intensive. The figure is now doubt based entirely on an "estimate" performed on a much smaller sample and extrapolated - and these are often wrong.
Re:Distributed effort ? (Score:1)

by Swinners ( 53870 ) writes:

Yes sorry, that wasn't the reason. I should have checked before posting. This is the reason [distributed.net].
An extract:

Databases, by definition, mean dealing with huge amounts of data. They also often contain very small computational requirements (although this is not always the case). This means that the bottleneck for database operations usually isn't CPU horsepower, but disk bandwidth. This means that distributed.net would be ill suited to help.
Re:Distributed effort - That's how Inktomi works (Score:1)

by billstewart ( 78916 ) writes:

While Inktomi http://www.in isn't made to run on zillions of coordinated home machines, but it's made to be scalable and run on a network of spider machines. They've since branched out to cache servers and other businesses, but the search engine was their initial project.

And other posters have pointed out Harvest.
Re:Suggestion... (Score:1)

by KarMann ( 121054 ) writes:

But to make it do that, you'd want the crawler part, not just the database searcher, and from what everyone else is saying, the crawler is the part that isn't GPL'ed and isn't free. I don't think it's likely worth whatever they're asking for the crawler just to fulfill that simple need. The free part just searches the local database generated by the crawler, or purchased from Juggernaut, when you ask it to. Won't generate browsing-like traffic over your connection.
Re:Uh, why? (Score:1)

by Rob Kaper ( 5960 ) writes:

But if this was integrated into a browser... you would save bandwidth because lots of pages would be evaluated without an extra download, the page is loaded anyway.
Of course this idea will raise questions about privacy and the such, and the most popular pages would get evaluated most frequently (which isn't such a bad feature actually, some search engines work this way, they index cnn.com multiple times a day but put mypage.com on a lower priority).
Re: spidering (Score:1)

by CodeShark ( 17400 ) writes:

I think you're right -- that would probably work. If I'm understanding you right, the spider would be return two lists, the "intra-site" links and the extra-site links, the server/servers storing the data would only have to check the "outside links".

Would this work for database driven sites as well?
Re:Uh, sorry (Score:1)

by Grahf666 ( 118413 ) writes:

I pressed a button that said "you are 1 point beneath your current threshold" and the preceding message popped up. Did I do that? If i did, sorry. I'm kinda new here.
NO CHANCE (Score:2)

by Robert S Gormley ( 24559 ) writes:

Sorry. I pay 19c/mb recvd. No way in hell I'm gonna participate. :-)
Anyone have experience with it? (Score:2)

by dkh2 ( 29130 ) writes:

Anyone tried it yet. Having a strong, open-sourced search engine would be a tremendous boone to institutions on a tight budget. We have a reasonably large webspace here and we're always watching for effective ways to make the whole thing searchable.
--
Una piccola canzone, un piccolo ballo, poco seltzer giù i vostri pantaloni.
Juggernaut: Ouch. (Score:3)

by Effugas ( 2378 ) writes: on Thursday December 09, 1999 @04:37AM (#1472691) Homepage

For the love of god, LAUNCH RIGHT!

Don't say you can index 800 million pages in three months when your database gives less results that Lycos circa 1996.

Hyperbole is rife in the computer world in general, and it's one of the genuine strengths of the Open Source community that we're very results oriented--Apache gets *results*. Samba *works*, and actually *does* knock NT out of the park in terms of flexibility and feature sets. And so on.

There are exceptions, granted, but we don't stretch our credibility to the breaking point nearly as much as stock-price-manipu^H^H^H^H^H^Hmaximizing corporations practically have to.

My problem with Juggernaut is that, while their technology might be awesome, their online index *isn't*. When you don't even get enough hits back to compare whether the hits are delivered in an optimum order, you know there's a problem. That, combined with the fact that the site looks decidedly 1996'ish(sorry, I know there's a webmaster out there who doesn't like me right now), tarnishes the otherwise excellent announcement that we now ostensibly(pending testing) have an extremely high quantity and quality search engine system, not to mention the birth of a new business model--the internal search engine of external content.

Honestly, I must admit there's something to be said about companies purchasing internal versions of large search engines, just so no outside source can watch the unencrypted stream of queries coming from a given company to deduce what projects they're working on.

The Juggernaut guys may be on to something, but I'm still a Google addict.

Yours Truly,

Dan Kaminsky
DoxPara Research
http://www.doxpara.com

Share
twitter facebook
not too impressed (Score:3)

by madHomer ( 2207 ) writes: on Thursday December 09, 1999 @04:39AM (#1472693)

From their demo, I am not too impressed with the search at all. It seems to be lacking many advanced options. Also, what is up with this??

>>
first fully automated crawler that can reindex all 800 million World Wide Web pages every three months fully available to the public for a nominal two year subscription fee.

Does that mean that they give away the search engine but you have to purchase the database???

I think that there are better options out there right now. One GPL'd search engive out there that I have liked a lot is HTDIG (http://www.htdig.org). It does not have the horse power the the juggernautsearch "claims", but it is great for intranet/corporate/university website search.

If you are looking for a good search engine, you may also want to read the ask slashdot thread from last year on this topic. (http://slashdot.org/askslashdot/98/10/24/1756224. shtml)

Share
twitter facebook
Hideous (Score:2)

by rde ( 17364 ) writes:

I typed in 'jj thompson' to see would it find my page about the legendary physicist (it's indexed by most engines). It didn't bother returning any matches, or even a 'no matches' message. And it's the most horrible page I've seen in a while. Green on pink text? Yeuch.
Are the claims true? (Score:1)

by Sir Logic ( 119380 ) writes:

Interesting... I just looked at their site. The "advanced" search does not seem to be any different than the "standard" search, and the database they have on-line appears to only be a fraction of the web.
I'd like to see a site that really uses the software to see if it is any good. Also I noticed that they seem to be selling a URL list for the search engine... It would seem to me that the engine should be able to find its own URL's...
Re:Search Engine: GPL - Database/Crawler: $$$$ (Score:1)

by loom ( 35551 ) writes:

All we need not is for someone to write a GPL crawler since we obviously have the file format for the database.

Anybody want to contribute one ?
Search Engine: GPL - Database/Crawler: $$$$ (Score:4)

by turg ( 19864 ) writes: <turg@winston.CHEETAHorg minus cat> on Thursday December 09, 1999 @04:35AM (#1472698) Journal

From what I read here [juggernautsearch.com] and here [165.90.48.2]: the "Juggernaut search Engine" and the "Juggernaut Search Engine Crawler" are two separate pieces of software. The former is GPLed. The latter is not for sale but you can purchase the database it creates (or get a demo/sampler subset of the database for free)

-
<SIG>
"I am not trying to prove that I am right... I am only trying to find out whether." -Bertolt Brecht

Share
twitter facebook
Progress (Score:1)

by Mada ( 123475 ) writes:

I'll take them for their word on the performance aspect. However, my brief test of the engine disturbed me a bit. It is not obvious to me where the results of a search appear. Cuz it aint on the homepage. Speed and coverage key but it means nothing without results.
Distributed effort ? (Score:4)

by Dilbert_ ( 17488 ) writes: on Thursday December 09, 1999 @04:36AM (#1472700) Homepage

I have been wondering for a while now : couldn't building the index for such a search engine be distributed (like SETI@HOME or RC5) ? The server would do the actual page serving, querying etc, but the spidering would be done by the clients. They'd each receive a batch of URL's from the server and start indexing them, collecting lists of URL's and sending those back to the server. The server weeds out the doubles, and assigns those URL's to the clients again. The more people would participate, the bigger the index would grow, as the available bandwidth increased also.

Hmmm... maybe I should patent this...

Share
twitter facebook
Bandwidth issues? (Score:1)

by Jish ( 80046 ) writes:

Sounds nice but isn't the bigger concern for something like that the amount of bandwidth it would use to catalog all those pages?

Sure it will run on a PII but does it need a T3 to run efficiently?

Josh
Good Things To Come! (Score:2)

by omarius ( 52253 ) writes:

This is a really great thing, and here's why I think so:
You can run your own altavista. . . and as the open source 'canon' grows, folks will also be able to have an amazon.com, a slashdot, and whatever else you want to do on the Web.
But why just the Web? With enough open-source game engines, applications, and other code to build on . . .
Well, just imagine what happens when the first Open Source 'killer App' is released. (Not that sendmail, apache, and others aren't already -- I'm talking userland, here.) What if the Next Big Computer Game was Open Source? How many zillions would install Linux to play it?
What if Open Source was suddenly the dominant software paradigm?
Can I just say, 'Oh, YEAH!'?
-Omar
It would be great, if it were true. (Score:2)

by cruise ( 111380 ) writes:

I checked out the search engine. I would think that if they are selling a robot that claims to be able to index the entire web every three months they would have an online database to prove it.

try searching on slashdot. You get one link which is at least 2 years old :)

Dazzle them with bullshit.
What the hell? (Score:2)

by generic-man ( 33649 ) writes:

The green-on-pink text in the search box reads "Try One or Two Keywords in ANY International Language". So I, being a Typical American, tried English. Single English words (or any single search term, for that matter) work fine. However, using two keywords (be they my name, "Microsoft Windows," "carrot cake," etc., etc., etc.) just returns the home page all over again.

So you can only search on one keyword at a time, it has a butt-ugly page, it doesn't return relevant links, and it has a horrible domain name to boot. What a waste.

Oh wait, it's GPL'ed! Hooray! Down with the software monopolies! We'll take over the world!

Groan...
Re:Hideous (Score:1)

by _Swank ( 118097 ) writes:

Well, a search for "Timberwolves" (as in the MN Timberwolves) returned 2 matches while a search for "Linux" returned only 469 and compared to other engines it is slow (couple seconds per search). Google returned 21600 and 829000 matches respectively and both in under a second. I am sure that Google is probably running on more/faster machines, but I'm not going to do anything one place if I can do it quicker somewhere else. If they expect anyone to use it, they better get to indexing those 80 million pages every 3 months or they won't have be serving more than 10 searches a day. And I'll second the motion that the interface sucks. Needs a good deal of work.
Re:Good Things To Come! - How many zillions (Score:1)

by BenHmm ( 90784 ) writes:

>How many zillions would install Linux to play it?

no zillions. Linux is hard, people. Maybe not to the regular /. population. But to the mass games playing market it is. Damn tricky stuff. To most minds a game, no matter how great, is not worth learning and installing a whole new OS, especially one like linux.
Re:Anyone have experience with it? (Score:1)

by ghutchis ( 7810 ) writes:

I'm obviously a bit biased, but there *are* strong, open-sourced search engines. Try ht://Dig for example www.htdig.org [htdig.org] or if you don't like that, you should check out the excellent SearchTools.com [searchtools.com] website. Cheers, -Geoff
Re:Anyone have experience with it? (Score:1)

by smoser ( 4385 ) writes:

http://www.htdig.org/ [htdig.org] is a GPL'd search engine that will crawl your site. It can go a certain depth, or start from any given page (like a site index page). We use it at saintjoe.edu [saintjoe.edu] and it works wonderful for everything we need.
We have the indexer running on a cron job twice a week during the middle of the night. It does kinda screw up webalizer results, but you can work around that.
Theres also one called glimpse, but my experience with that a few years ago showed it to not be as useful as htdig. things might have changed, though, and YMMV.
Re:Good Things To Come! - How many zillions (Score:1)

by BenHmm ( 90784 ) writes:

true - but most people aren't "gamers" they're just people who play games. A big difference. Sure, you and I may have been inside the back of our machines with screwdrivers more often than we'd like...but that would scare most people half to death. fdisk might be your friend, but most people are scared to defrag in case it breaks something.

getting off topic here, alas
Mirrors? (Score:2)

by um... Lucas ( 13147 ) writes:

Has anyone mirrored the FTP site yet? I'm downloading at 4.2 k/second.... and this is at work where i'm more used to 150+ k/second at this time of day.... I'm very eager to check out the database format, but it seems i need to first download at 50 meg file...
Re:Distributed effort ? (Score:2)

by foop ( 30304 ) writes:

Yes, exactly the problem. With this sort of distributed app, the bottleneck is usually just moving the data around -- not the computational analysis. Distributed.net is based on a centralized server feeding chunks of data to thousands of clients. A distributed search engine is the reverse, clients crawl somewhat independently around the web, analyze the data, and then send summarized information up the heirarchy. See dizz.net [dizz.net]
It sucks! (Score:2)

by a2800276 ( 50374 ) writes:

How can it index the entire "800 million webpages" out there and only find 19 hits for "xml"? And the Specification wasn't one of them.
Not one single hit for "wide open beavers"! And the colors are just awful.
Stupid question (Score:3)

by Anonymous Coward writes: on Thursday December 09, 1999 @04:50AM (#1472718)

But why is it when I search for "ugly webpage" I get a the Juggernaut Technical Support page?

Oh, I get it, I got EXACTLY what I searched for! :-)

Share
twitter facebook
Make wild claims; get free /. publicity (Score:4)

by gbnewby ( 74175 ) writes: on Thursday December 09, 1999 @04:51AM (#1472719) Homepage

Need a free search engine? Try ht://dig [htdig.org]. It's been around awhile, and is stable and highly configurable. It includes a spider, but is more suitable for medium sized collections, not the whole Web.
Examination of their ftp distribution site reveals this is an early work in progress...most docs are "under construction," and even their helpers.txt (supposedly giving credit to others) is basically empty.
I'll post more if/when their src tarball ever finishes downloading (54M - whew!...and the site is getting /.'ed right now). My guess is they drew heavily from ht://dig, WAIS, SMART and other public-source search engines and spiders.
For those who can't get through to the site: they hope to sell subscriptions to their database, so that you can run their search engine internally. It's not clear whether they intend to license the spider/crawler or just the database.
Meanwhile, to those who have complained that easy searches turn up with nil results: read the page, dudes! It says clearly that you're searching a minimal test collection, but can search the whole thing (on your local system, seems like) for a subscription fee.
Credibility break: I'm an information science professor and design/evaluate alternate information retrieval systems.

Share
twitter facebook
,,, (Score:1)

by Signail11 ( 123143 ) writes:

Honestly, the search engine is pretty poor in quality right now. It seems to have only indexed a small proportion of publicly accessable web pages. BUT, to paraphrase on another poster's idea, the open-source nature of the project can be profitably combined the whole Beowulf cluster and distributed computing concepts such that a far greater number of web sites could be indexed than previously possible. Moreover, with distributed spidering, more in-depth analysis can be run on the text contained in the pages. However, I do see 3 major problems with this:

-The bandwidth that most users have is not commiserate with their processing capability
-The index might become stale if a site is not visited repeatedly, but in the distributed spidering case, this risks either duplication of work or gaps where nobody visited the pages in a large web site
-The ability to rank pages based on "relevance" or "linkability" (ala Google) is decreased in this scenario

--
Flames? Think I'm a karma whore?
Fine Print (Score:2)

by paleck ( 10298 ) writes:

The Juggernautsearch Engine crawler is the first fully automated crawler that can reindex all 800 million World Wide Web pages every three months fully available to the public for a nominal two year subscription fee.

With the search engine being GPLed it still relies on a subscription service in in order for it to function. It mentions nothing about the crawler needed to create the database, but it also mentions that you are free to create your own database. Is it just me or is this a contradiction.

For the smallest subscription it gives 1.6 million urls at $100 a year. This price goes up to $500 for 10 Million urls.

For such a useful program, it is limiting itself to its own database which costs money to use.

Just my .02
Another nice GPL'd indexing system: MG (Score:2)

by harmonica ( 29841 ) writes:

That's for Managing Gigabytes, and there also is a great book (note that there's a second edition out now) with the same name on the topic from Witten, Moffat and Bell. Very well written. Go to http://www.mds.rmit.edu.au/mg/intro/about_mg.html to learn about the software, including links. It also has a Freshmeat appindex: http://freshmeat.net/appindex/1999/09/09/936885957 .html

BTW, I'm not associated with the university, the book or whatever. I just enjoyed reading it.
Re:Distributed effort ? (Score:1)

by Swinners ( 53870 ) writes:

This has been discussed on several occasions at distributed.net. The stumbling block is that you would be creating a commercially valuable resource in a not-for-profit organisation. Oh yes, and issues of faking and spamming come to light when you think about it.
Uh, why? (Score:3)

by Signal 11 ( 7608 ) writes: on Thursday December 09, 1999 @04:56AM (#1472725)

It would be outlawed by ISPs faster than you can say "slashdot" three times fast. That bandwidth is over a shared medium. If you go and increase the load by even 10-15% at most places the T1 saturates and the QoS drops like a lead balloon.
Making this a distributed effort would only be useful for a clustering environment ala beowulf where tight syncronization would be needed to prevent machines from revisiting the same websites. Other than that, distributed processing for web crawlers is... dubious.

Share
twitter facebook
Re:Distributed effort ? (Score:2)

by sxxw ( 124344 ) writes:

Ummmm - no patent - its already been done.
The main problem, as other posters have commented in doing anything like this in a co-operative fashion is the large commercial value of the results. It also requires those taking part to have a significant amount of bandwidth (to pull in all of the content and then to exchange indexes).
The spidering part of the process is one of the least processor intensive - once you've completed it you're left with a large glob of data. You then need to convert that into an inverted index, which would still be large and then need passing to a central server, which would then have to do further processing in order to actually merge it in to the whole.
The Harvest Indexing system (http://www.tardis.ed.ac.uk/harvest) sought to develop a system like this. It seperated the searching and crawling tasks, so it would be possible to have a large number of crawlers (probably topologically close to the sites they were indexing), which then gave their results to an indexing system which collated them and presented them to the world.
The problem here is that you've still got one large, monolothic system at the indexing end. TERENA, as part of the TF-CHIC project developed a referral system (based on WHOIS+) to allow there to be one central gateway which then passed search requests to a large number of individual engines, each of which could run different software. Kind of like a fancy metasearch engine.
Originally the plan for devolving things locally was that if the indexes were generated by people who know the pages, then you'll get a higher standard of index. Aliweb, for instance, had a file per server which contained an index of all of the objects available on that server.
The problem with this is easily shown up by metatag abuse. If the person running the spider has a commerical interest in the sites they're indexing, they'll often go and fabricate the index so that their sites appear higher on searches.
Cheers.
Simon.
But it's a crawler... (Score:2)

by bons ( 119581 ) writes:

A crawler is a fine tool for searching the web. Becuase it only attempts to go to those sites that are already linked to somewhere else, it doesn't bombard machines with useless requests.
However, it really does not work when you would like it to find pages that no one points to. Those unique pages are well hidden from crawlers, even those you can e-mail all of your friends about them. Until one of your friends puts a link on his start page, you're immune.
For an organization, it's the wrong avenue of approach. Organizations tend to keep their internet files on a small set of machines, in very specific directory structures. The best search engine for those machines should have permission to look at the directory structures and go through every file in them when it uspdates it's database. This insures that every file in that organization is collected and that no links going outside the organization are followed.
Ken Boucher
Re:Are the claims true? (Score:1)

by dr ( 93364 ) writes:

the database they have on-line appears to only be a fraction of the web
I highly doubt any of the claims that they make... why? Well the facts just don't seem to add up.
The biggest flaw I found was that these guys claim they have been 'programming' for 'the last four years straight' to be 'able to provide the most efficiently coded and fastest search engine on the Internet' and yet the engine requires that you have Perl (leading me to believe that is what the whole thing is written in) as stated on the support page [165.90.48.2] (which isn't more that 1 small paragraph saying 'this is a work in progress' it says:

The GNU Juggernaut Search Engine, as distributed by Juggernautsearch LLC, depends on the use of Perl software (version 5.005 or later).

And if these guys did have the world's greatest search engine, would they not have enough pride to design a website a little better than one using a tiled background and a pink coloured table?
Finally, while would the online demo suck and not have many hits? Well they admit to that at least...

The present database online represents a small sample compared to the full size database available through subscription.

But why only have a few sites indexed only? Why not have the whole thing but only show us a few? If I search for slashdot I should get more than 1 hit...
I usually don't like to be a sceptic, but the whole thing just smells funny, especially when there are very few concrete details about the whole thing.
-dr
Re:Distributed effort ? (Score:1)

by TheShadow ( 76709 ) writes:

Yeah, only allow the client to spider the current domain and record any "outside" links it encounters. When it's finished with the current domain it will send it's outside link list back to the server and the server will assign another domain for the client to spider.
Wanna bet? (Score:1)

by Zico ( 14255 ) writes:

Those other sites aren't just search engines/indexes, that's why they're called portals. Have you seen Yahoo! or MSN lately? Shopping, auctions, ticketing, weather, scores, communities, calendaring, gaming, stock quotes, maps, make-your-own web pages, chat, news, e-mail, messaging, etc., etc. etc.? And you think Juggernaut or anything similar is going to make these companies go away? Borrowed time? Good God, man, I've got some beachfront property to sell ya! :)

Cheers,
ZicoKnows@hotmail.com
Re:Anyone have experience with it? (Score:1)

by soward ( 6325 ) writes:

Well, if I could get in to download it...I'll give it a try, we've been using htdig here at UK for years, but it's not very useable for the size of index we are trying to work with...takes basically all day to rebuild the index on a Xeon450 w/512M...An Infoseek demo we tried under Linux worked great, indexed much faster, had many more options, etc...but at ~$20K up front and ~$15K/year...well...
Bandwidth? (Score:2)

by Phizzy ( 56929 ) writes:

Well, alright, so it can do 10 mil a day on a PII with what? A t3? How much bandwidth do you need before the processor becomes the limiting factor with this engine? I certainly dont think my 26.4 connection at home can handle 10 mil pages a day. They should make some mention of that on the page.
On a side note, I was very dissapointed when a search for "deez nuts" came up dry.. oh well.

//Phizzy
Re:Anyone have experience with it? (Score:1)

by Anonymous Coward writes:

You need more coffee.

He was clearly not talking about just trying a simple search, he was asking if anyone had actually downloaded the code and gotten it running. This, I'm sure you'll agree, is likely to be a process that takes a bit longer than typing his message did.
Re:Distributed effort ? (Score:1)

by Dilbert_ ( 17488 ) writes:

Why would that be a problem ? SendMail and Apache are commercially valuable too, aren't they ? And considering faking and spamming : like it was said in the 'Open source Seti@Home' debate : nothing prevents sending out the same block to two or more randomly chosen clients so the results could be verified. If clients start behaving weird... don't send them blocks anymore !
Re:Search Engine: GPL - Database/Crawler: $$$$ (Score:3)

by jd ( 1658 ) writes: <imipak@ y a hoo.com> on Thursday December 09, 1999 @05:13AM (#1472740) Homepage Journal

Just grab the Harvest crawler and change the format it outputs.

Share
twitter facebook
Personally... (Score:3)

by jd ( 1658 ) writes: <imipak@ y a hoo.com> on Thursday December 09, 1999 @05:19AM (#1472741) Homepage Journal

I've got to ask "Why Bother?". It seems easier to just use Harvest, update the version of Glimpse it uses, and tidy up the database a bit.
Unlike Juggernaut, it's a complete search engine system (crawler, database & front-end), it was developed over a long time, and has capabilities that even most modern search engines don't (such as relaxed spelling).
IMHO, it would be better for the Open Source community, as a whole, if someone picked up Harvest, modernised it and maintained it. At present, it's the best "openish" source Search Engine out there, and it's going to waste.

Share
twitter facebook
Re:Fine Print (Score:1)

by bonehead ( 6382 ) writes:

Somehow, this really bothers me. I fully support any software author's right to distribute his code under any terms he sees fit.

BUT...

When you release a piece of software under the GPL, there have come to be certain expectations about how things will work. One expectation that I have is that the software will be fully functional.

This search engine is GPL in only the most technical sense. It is certainly not GPL in spirit. What we have here is just good old-fashioned shareware. It's free enough to test it out, but if you really want to make use of it, it's going to cost you. That's certainly not how I've come to expect things to work when I run GPL'd software.

These guys have every right in the world to make some cash from their hard work. I just wish they'd use a licensing scheme that more closely reflected the way they are doing business.

Or am I just nuts?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:Distributed effort ? (Score:1)

The end of PortalMania (Score:1)

Actually, that's even better. (Score:1)

Re:Juggernaut: Ouch. (Score:1)

Re:Distributed effort ? One way how (Score:1)

Re:Distributed effort ? One way how (Score:1)

Free, scalable search engines (Score:1)

Okay, post-IPO, /, needs some changes... (Score:3)

Re:Juggernaut: Ouch. (Score:1)

Davenet (Score:1)

Re:Juggernaut: Ouch. (Score:2)

Juggernaut: Ouch| Google, yea!! (Score:2)

Re:Hideous (Score:1)

down to the second (Score:2)

Re:Juggernaut: Ouch| Google, yea!! (Score:2)

Re:Good Things To Come! - How many zillions (Score:1)

Re:Make wild claims; get free /. publicity (Score:1)

Dizz-net (Score:3)

Filtering (Score:2)

Re:Filtering (Score:1)

Re:Fine Print (Score:1)

Re:Make wild claims; get free /. publicity (Score:1)

Re:Distributed effort ? (Score:1)

Re:Distributed effort - That's how Inktomi works (Score:1)

Re:Suggestion... (Score:1)

Re:Uh, why? (Score:1)

Re: spidering (Score:1)

Re:Uh, sorry (Score:1)

NO CHANCE (Score:2)

Anyone have experience with it? (Score:2)

Juggernaut: Ouch. (Score:3)

not too impressed (Score:3)

Hideous (Score:2)

Are the claims true? (Score:1)

Re:Search Engine: GPL - Database/Crawler: $$$$ (Score:1)

Search Engine: GPL - Database/Crawler: $$$$ (Score:4)

Progress (Score:1)

Distributed effort ? (Score:4)

Bandwidth issues? (Score:1)

Good Things To Come! (Score:2)

It would be great, if it were true. (Score:2)

What the hell? (Score:2)

Re:Hideous (Score:1)

Re:Good Things To Come! - How many zillions (Score:1)

Re:Anyone have experience with it? (Score:1)

Re:Anyone have experience with it? (Score:1)

Re:Good Things To Come! - How many zillions (Score:1)

Mirrors? (Score:2)

Re:Distributed effort ? (Score:2)

It sucks! (Score:2)

Stupid question (Score:3)

Make wild claims; get free /. publicity (Score:4)

,,, (Score:1)

Fine Print (Score:2)

Another nice GPL'd indexing system: MG (Score:2)

Re:Distributed effort ? (Score:1)

Uh, why? (Score:3)

Re:Distributed effort ? (Score:2)

But it's a crawler... (Score:2)

Re:Are the claims true? (Score:1)

Re:Distributed effort ? (Score:1)

Wanna bet? (Score:1)

Re:Anyone have experience with it? (Score:1)

Bandwidth? (Score:2)

Re:Anyone have experience with it? (Score:1)

Re:Distributed effort ? (Score:1)

Re:Search Engine: GPL - Database/Crawler: $$$$ (Score:3)

Personally... (Score:3)

Re:Fine Print (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals