Siddha Chakra

Thursday, November 23, 2006

Goole File System (GFS)

Goole File System (GFS)
Reducing Complexity

Google's distributed storage architecture for data is combined with distributed execution of the software that parses and analyzes it.

To keep software developers from spending too much time on the arcana of distributed programming, Google invented MapReduce as a way of simplifying the process. According to a 2004 Google Labs paper, without MapReduce the company found "the issues of how to parallelize the computation, distribute the data and handle failures" tended to obscure the simplest computation "with large amounts of complex code."

Much as the GFS offers an interface for storage across multiple servers, MapReduce takes programming instructions and assigns them to be executed in parallel on many computers. It breaks calculations into two parts—a first stage, which produces a set of intermediate results, and a second, which computes a final answer. The concept comes from functional programming languages such as Lisp (Google's version is implemented in C++, with interfaces to Java and Python).

A typical first-week training assignment for a new programmer hired by Google is to write a software routine that uses MapReduce to count all occurrences of words in a set of Web documents. In that case, the "map" would involve tallying all occurrences of each word on each page—not bothering to add them at this stage, just ticking off records for each one like hash marks on a sheet of scratch paper. The programmer would then write a reduce function to do the math—in this case, taking the scratch paper data, the intermediate results, and producing a count for the number of times each word occurs on each page.

One example, from a Google developer presentation, shows how the phrase "to be or not to be" would move through this process.


While this might seem trivial, it's the kind of calculation Google performs ad infinitum. More important, the general technique can be applied to many statistical analysis problems. In principle, it could be applied to other data mining problems that might exist within your company, such as searching for recurring categories of complaints in warranty claims against your products.

But it's particularly key for Google, which invests heavily in a statistical style of computing, not just for search but for solving other problems like automatic translation between human languages such as English and Arabic (using common patterns drawn from existing translations of words and phrases to divine the rules for producing new translations).

MapReduce includes its own middleware—server software that automatically breaks computing jobs apart and puts them back together. This is similar to the way a Java programmer relies on the Java Virtual Machine to handle memory management, in contrast with languages like C++ that make the programmer responsible for manually allocating and releasing computer memory. In the case of MapReduce, the programmer is freed from defining how a computation will be divided among the servers in a Google cluster.

Typically, programs incorporating MapReduce load large quantities of data, which are then broken up into pieces of 16 to 64 megabytes. The MapReduce run-time system creates duplicate copies of each map or reduce function, picks idle worker machines to perform them and tracks the results.

Worker machines load their assigned piece of input data, process it into a structure of key-value pairs, and notify the master when the mapped data is ready to be sorted and passed to a reduce function. In this way, the map and reduce functions alternate chewing through the data until all of it has been processed. An answer is then returned to the client application.

If something goes wrong along the way, and a worker fails to return the results of its map or reduce calculation, the master reassigns it to another computer.

As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.

Mobile Computing

Mobile Computing

Introduction:
A global wireless protocol specification, to work across heterogeneous wireless network technologies and communication standards for digital mobile phones, supported by over 1,200 companies is defined by Wireless Application Protocol (WAP). WAP makes it possible for mobile phones to access the Internet and retrieve information on simplified display screens or through a voice interface. WAP phones can receive text and data, including pages downloaded from the World Wide Web, as well as voice using an implementation of WAP. As you can see WAP has many capabilities and due to these capabilities and the amount of new upcoming handsets with features such as built-in cameras, video streaming capabilities and graphics/text integration, WAP is well on its way to making a comeback (Armor, 2002).

WAP’s Come Back: Enhanced Messaging Services (EMS) combines text and graphics and it uses a WAP browser. Motorola is coming out with a new flat-screen color display phone, which includes a camera as well as an MP3 player, supports MMS and EMS through the WAP 2.0 browser. This has paved the way for WAP to make a quiet comeback even though early WAP browsers were not that great. Many of the handset vendors feel that the current version of WAP gives the technology the capability that it was not able to deliver on in the beginning Nobel, 2002).

Short Message Services (SMS): The transformation from using them as all-purpose data devices that enrich our personal lives through constant communication is already evident in the use of Short Message Services (SMS) (Nadel, 2002). SMS is a globally accepted wireless service that enables the transmission of alphanumeric messages between mobile subscribers and external systems such as e-mail, paging, and voice-mail systems. SMS provides a mechanism for transmitting short messages to and from wireless devices. The service makes use of a local base station, which acts as a store-and-forward system for short messages. The wireless network provides the mechanisms required to find the destination station(s) and transports short messages between the base station and wireless stations (Armor, 2002).

The benefits of SMS to subscribers center on convenience, flexibility, and seamless integration of messaging services and data access. Looking at things from this point of view the primary benefit is the use of the handset as an extension of the computer (Armor, 2002). In order to make this vision of wireless technology a reality many people are depending on the implementation and adoption of 3G technology (Nadel, 2002).

The Next (Third) Generation (3G): 3G can be defined as the next (third) generation of wireless technology beyond personal communications services (Armor, 2002). 3G can be referred to as a catch all phrase that sub sequentially describes the next step in mobile phones. Speeds of 100 Kbps and up can be attained by sending and receiving all data and voice traffic in packets. Portable bandwidth will rise to the level of wired broadband connections through the use of 3G technology, thus enabling instant messaging, videoconferencing, gaming, and son forth. Handsets are projected to become a useful part of our lives just like PCs increased in usefulness and value when they connected to one another and the Internet (Nadel, 2002).

Java Intelligent Network Infrastructure (Jini): Jini may be the best-known component of the pervasive computing technology because of the marketing efforts of Sun. Jini regulates the communication between computers and other devices in the network and allows peripherals to be connected to the network without special configurations and used immediately. The self-identifying devices transmit their technical specifications and eliminate the need for “manual” driver selection (Armor, 2002).

Bluetooth: Bluetooth is the specification for small form-factor, low-cost, short-range radio links between mobile PCs, phones, and other portable devices. Bluetooth devices can detect each other automatically and set up a network connection. Using a modulation frequency of 204 GHz, they transfer data from one adapter to another, whereby the signals don have a predefined direction and can, in principle, be received from any other device. Bluetooth also enable devices to communicate with each other on the basis of Jini technology without being connected by cable (Armor, 2002).

Bluetooth contains a maximum of three voice channels and seven data channels per piconet. Security is administered at the link layer. Each link is encoded and protected against both eavesdropping and interference (Armor, 2002).

Wireless Security: All WLAN devices have come package with a security mechanism called WEP since the ratification of 802.11b. WEP allows for the encryption of wireless traffic. But, encryption is turned off by default in wireless devices and software, and in a lot of cases it was never actually turned on (Ward off the, 2002).

WEP relies on a secret key to encrypt packets transmitted between a mobile station (advice with a wireless Ethernet card) and an access point (abase station connecting to a wired network). An integrity check ensures that packets aren’t modified in transit. But in actuality many installations use a sing key that’s shared by a mobile and access points (Kay, 2002).

However, in any case when WEP is really turned on it still isn’t terribly secure as some managers don’t change the key because it has to be done manually. Besides that a real expert can steal a key in a matter of seconds (Ward off the, 2002).

Even though WEP have weaknesses it is still ok for it to be used by households and small offices. Large corporations should use layers of security (Ward off the, 2002).
There is an anticipated replacement, 802.11i which will incorporate a stronger authentication that is known as the Advanced Encryption Standard. This standard wards off intruders through the use of a complex algorithm to encode the data. A final standard is expected by analysts in about a year. The drawback to this new standard is that it has caused IT managers to worry about there current WLAN equipment becoming obsolete. However, there are stronger encryptions available in the meantime using Wi-Fi Protected Access (Ward off the, 2002).

At the present time the best security option is installation a virtual private network, or VPN—a data tunnel that can safely carry traffic from an employee’s computer to the corporate network over a public medium such as air. Network managers have the option of leaving their security in the hands of a large supplier to manage everything from administration to security or they can buy any brand of hardware and get security without being locked into a single supplier (Ward off the, 2002)

Monday, November 20, 2006

Learning C++

 Learning C++



"everything sites" and "Websites for authors of C++ books and
articles" below.

Tutorials about C++
http://cplus.about.com/
C++ Annotations (moving from C to C++)
http://www.icce.rug.nl/documents/cplusplus/

DevCentral tutorials for C and C++
http://devcentral.iftech.com/learning/tutorials/

C++ tutorials for Windows 32, how to do without MFC, getting the compiler
to do the hard work of avoiding memory leaks, games, frequency analysis etc
http://www.relisoft.com/

... interactive guide to C++ ... written with Pascal users in mind
http://tqd.advanced.org/3074/

Coronado enterprises tutorials (formerly Gordon Dodrill's)
You can see sample chapters, but are charged for the full tutorials
http://www.coronadoenterprises.com/

Guru of the week - ie discussion papers on using C++
http://www.cntc.com/resources/gotw.html

Tutorials etc on Borland's CBuilder
http://www.richplum.co.uk/cbuilder/

Tutorial on the STL by Phil Ottewell.
http://www.yrl.co.uk/~phil/stl/stl.htmlx
http://www.pottsoft.com/home/stl/stl.htmlx
He has also got a tutorial on C for Fortran users
http://www.pottsoft.com/home/c_course/course.html

Notes for a university lecture course, but
maybe there is enough here for independent study.
http://m2tech.net/cppclass/

Note on pointers - perhaps more oriented towards C than C++.
http://www.cudenver.edu/~tgibson/tutorial/

Very simple C under DOS or MS-windows. Not much C++;
possibly useful to someone interested in programming
MS-windows without MFC etc.
http://www.cpp-programming.com

Weekly newsletter on C++ and other things: aimed at helping new
and intermediate programmers improve their coding skills.
http://www.cyberelectric.net.au/~collins

www.informit.com - a site run by Macmillan USA containing a lot
of information including the several well-known C++ books for
free download - if you are prepared to supply name and email address
http://www.informit.com/

C++ in 21 days - 2nd edition
http://newdata.box.sk/bx/c/

A variety of C++ books on line (Macmillian, Sams, Wiley, IDG etc)
You can see the tables of contents, but you will have to have a
subscription to read the books themselves after a free trial.
http://www.itknowledge.com/reference/dir.programminglanguages.c1.html

Elementary introduction to C++ (mostly the C subset)
http://clio.mit.csu.edu.au/TTT/

How to use function-pointers in C and C++, callbacks, functors
http://www.function-pointer.org
http://www.newty.de/fpt/fpt.html

Short C++ tutorial, aimed at people who already have
experience with an object-oriented programming language
http://www.entish.org/realquickcpp/

Articles about Win32, C++, MFC articles using VC++ compiler.
http://www.codersource.net


Site lists


Google web directory
http://directory.google.com/Top/Computers/Programming/Languages/C%2B%2B/

University of Cambridge Department of Engineering
http://www-h.eng.cam.ac.uk/help/tpl/languages/C++.html

Object-Oriented Numerics Web Site
http://oonumerics.org/

German scientific computing (in English)
http://scicomp.math.uni-augsburg.de/~scicomp/

World-wide-web "C++ Virtual Library"
http://www.desy.de/user/projects/C++.html

Karim Ratib's list of C++ sites (Scientific computing,
graphs, GUIs etc)
http://www.IRO.UMontreal.CA/~ratib/code/

Phil Austin's list of oo sites for scientific computing
http://www.geog.ubc.ca/~phil/oo

Manfred Schneider's list of sites (CETUS links)
http://www.objenv.com/cetus/software.html
http://www.rhein-neckar.de/~cetus/software.html

Site list from Forschungszentrum Juelich
http://www.fz-juelich.de/zam/cxx/extern.html

C++ and C SIG (New York)
http://www.cppsig.org/

This file
http://www.robertnz.com/cpp_site.html

"Connected Object Solutions" list
http://www.connobj.com/refserv.htm

Warren Young's list - especially STL
http://www.cyberport.com/~tangent/programming/index.html


Recover ur Data

U Can Recover ur Data

A lot of people don't known, but when we delete a file from a computer in fact it isn't really deleted. The operating system simply removes it from the file list and makes the space the file was using available for new data to be written. In other words, the operating system doesn't "zero" (i.e. doesn't clean) the space the file was using.

The operating system acts like that in order to save time. Imagine a large file that occupies lots of sectors on the hard drive. To really delete this file from the disk the operating system would have to fill with zeros (or any other value) all sectors occupied by this file. This could take a lot of time. Instead, it simply removes the file name from the directory where the file is located and marks the sectors the file used as available space.

This means that it is possible to recover a deleted file, since its data wasn't really removed from the disk. Recovery data software works by looking for sectors with data in them that are not currently used by any file listed.

This leads us to a very important security question: if you have really confidential files, that cannot be read by anyone else, deleting them from the disk simply by hitting the Del key and then removing the recycle bin contents isn't enough: they can be recovered by an advanced data recovery tool.

With disk formatting it isn't different. When we format a hard drive, the data that was there aren't deleted, making it possible to recover data with an advanced data recovery tool even after formatting your hard drive. A lot of people that have a hard disk full of confidential data think that by formatting the hard drive they are killing any chance of data recovery. This is far from being true.

When you format a disk, the operating system only "zeros" the root directory and the tables containing the list of sectors on disk that are occupied by files (this table is called FAT). Pay attention when you format a hard drive, a message "Verifying x%" is shown. The hard drive isn't being formatted; the format command is only testing the hard disk magnetical surface in order to see if there is any error and, in case if a error is found, mark the defective area as bad (the famous "bad blocks" or "bad sectors").

So, in the same way it happens when we delete files, the hard drive isn't really "zeroed" when we format it. In order to really "zero" your hard drive, use utilities like Zero Fill from Quantum (click here to download it). This utility fills all sectors from your hard drive with zeros, making it impossible to recover any data after this utility is run, what doesn't happen when you use the normal format procedure. You can also use the so-called "low-level format utilities". These programs fill all sectors with zeros as well. You must download the software accordingly to your hard drive manufacturer. In our download section you will find low level format utilities for the most common hard disk drive manufacturers.

Bad block or faulty sector is the name given to a damaged area on a hard disk. It is a physical problem, i. e., the hard disk's magnetic media is defective. When we run a disk utility such as Scandisk and Norton Disk Doctor, such faulty sectors are marked with a "B".

Several users have written us asking how to proceed to recover hard disks with bad blocks. Many note that bad blocks disappear after low level formatting the hard disks.

What really happens, however, is that current physical formatting programs do not actually physically format the disk. If this should be feasible, the hard disk would be damaged, since hard disk tracks have a signal called servo that operates as a guide for the hard disk head. If we really formatted a hard disk at low level, these servos would be erased and the hard disk head would be unable to move any longer.

Low level formatting programs are utilities for detecting bad sectors and wiping the disk (for security reasons, for instance, after concluding a confidential project), not carrying out – despite their name – low level formatting.

These programs have an interesting function, which consists of updating the disk's bad sector map. When you use this option, the program scans the disk, seeking defective sectors and updating the disk's map.

When you run a high level formatting (through the Format command), this command skips the sectors contained in this bad sector table. Accordingly, there will not be any sector marked B ("Bad Block") in the FAT, although the defective sectors remain on the disk.

Defective sectors are not removed, but merely noted in this table of bad sectors, resulting in the system ignoring them (in other words, the sectors are hidden).

If new bad sectors keep occurring after running this procedure, you should get rid of the disk, as its magnetic surface is deteriorating, for some reason.



 

Saturday, November 18, 2006

Biometric Security

Biometric Security



What is biometrics?



Biometrics is the science of measuring physical properties of living beings.

(1) Biometric authentication is the automatical recognition of a living being using suitable body characteristics.

(2) By measuring an individual's physical features in an authentication inquiry and comparing this data with stored biometric reference data, the identity of a specific user is determined.



How it all began:

The concept of biometric probably began with the use of facial features to identify other people. It was in 1882 when Alphonse Bertillon, Chief of the Criminal Identification Division, Police Department in Paris, France, developed a detailed method identification based on a number of bodily measurements and physical descriptions. The Bertillon system of Anthropometric Identification gained wide acceptance before fingerprint identification suspended it.



It was Sir Francis Galton, a British scientist, who proposed the use of fingerprints for identification purpose in the late 19th century. He analysed the fingerprint patterns in detail, and finally presented a new classification system using prints of all ten fingers, which is the basis of identification systems even today. Subsequently, a British police official, Sir Richard Edward Henry, introduced fingerprinting as a means of identifying criminals.



What are the advantages of biometric systems for authentication?

Advancing automation and the development of new technological systems, such as the internet and cellular phones, have led users to more frequent use of technical means rather than human beings in receiving authentication. Personal identification has taken the form of secret passwords and PINs. Everyday examples requiring a password include the ATM, the cellular phone, or internet access on a personal computer. In order that a password cannot be guessed, it should be as long as possible, not appear in a dictionary, and include symbols such as +, -, %, or #. Moreover, for security purposes, a password should never be written down, never be given to another person, and should be changed at least every three months. When one considers that many people today need up to 30 passwords, most of which are rarely used, and that the expense and annoyance of a forgotten password is enormous, it is clear that users are forced to sacrifice security due to memory limitations. While the password is very machine friendly, it is far from user-friendly.

There is a solution that returns to the ways of nature. In order to identify an individual, humans differentiate between physical features such as facial structure or sound of the voice. Biometrics, as the science of measuring and compiling distinguishing physical features, now recognizes many further features as ideal for the definite identification of even an identical twin. Examples include a fingerprint, the iris, and vein structure. In order to perform recognition tasks at the level of the human brain (assuming that the brain would only use one single biometric trait), 100 million computations per second are required.

In the development of biometric identification systems, physical and behavioral features for recognition are required which:
• are as unique as possible, that is, an identical trait won't appear in two people: Uniqueness
• occur in as many people as possible: Universality

Biometric Trait Description
Fingerprint Finger lines, pore structure
Signature (dynamic) Writing with pressure and speed differentials
Facial geometry Distance of specific facial features (eyes, nose, mouth)
Iris Iris pattern
Retina Eye background (pattern of the vein structure)
Hand geometry Measurement of fingers and palm
Finger geometry Finger measurement
Vein structure of back of hand Vein structure of the back of the hand
Ear form Dimensions of the visible ear
Voice Tone or timbre
DNA DNA code as the carrier of human hereditary
Odor Chemical composition of the one's odor
Keyboard strokes Rhythm of keyboard strokes (PC or other keyboard)
What are the most well known biometric features used for authentication purposes?

• don't change over time: Permanence
• are measurable with simple technical instruments: Measurability
• are easy and comfortable to measure: User friendliness
Biometric traits develop:
• Through genetics: genotypic
• Through random variations in the early phases of an embryo's development: randotypic (often called phenotypic)
• Or through training: behavioral

Which biometric features are most constant over time?
Reasons for variation over time:
• Growth
• Wear and tear
• Aging
• Dirt and grime
• Injury and subsequent regeneration etc.



Biometric features, which are minimally affected by such variation, are preferred. The degree to which this is possible is shown in the following table. Easily changed effects such as dirt and quickly healing injuries such as an abrasion, are not taken into consideration.



The following table rates the relative importance of each factor (o is small, ooo is large)



Biometric Trait Permanence over time

Fingerprint (Minutia) oooooo

Signature(dynamic) oooo

Facial structure ooooo

Iris pattern ooooooooo

Retina oooooooo

Hand geometry ooooooo

Finger geometry ooooooo

Vein structure of the back of the hand oooooo

Ear form oooooo

Voice (Tone) ooo

DNA ooooooooo

Odor oooooo?

Keyboard strokes oooo

Comparison: Password ooooo



Which organizations attend to standardizing biometric systems?

• ISO/IEC JTC1 (world)

• DIN NI-AHGB & NI-37 (Germany)



What is the difference between identification and verification?

• In identification, the recorded biometric feature is compared to all biometric data saved in a system. If there is a match, the identification is successful, and the corresponding user name or user ID may be processed subsequently.

• In a verification, the user enters her/his identity into the system (e.g., via a keypad or card), then a biometric feature is scanned. The biometric trait must only be compared to the one previously saved reference feature corresponding to the ID. If a match occurs, verification is successful.

• If a system has only one saved reference trait, identification is similar to verification, but the user need not first enter his or her identity, as for example, access to a mobile phone which should only be used by its owner.



What are the advantages of verification over identification?

• Verification is much faster than identification when the number of saved reference features/users is very high.

• Verification shows a better biometric performance than identification, especially when the number of reference traits/users is very high.



What makes up a biometric authentication system?

A basic biometric system is made up of:

• a sensor to record the biometric trait

• a computer unit to process and eventually save the biometric trait

• an application, for which the user's authentication is necessary

Generally, computation speeds adequate for pattern recognition are required. This is about 100 million operations per second, which have only recently been attained by affordable hardware (PC, DSP).



Is biometrics more "secure" than passwords?

This question at least poses two problems: biometrics is not equal to biometrics, and the term "secure" is in fact commonly used, but it is not exactly defined. However, we can try to collect pros and cons in order to find at least an intuitive answer.



It is a matter of fact that the security of password protected values in particular depends on the user. If the user has to memorize too many passwords, he will use the same passwords for as many applications as possible. If this is not possible, he will go to construct very simple passwords. If this will also fail (e.g., if the construction rules are too complex), the next fall-back stage is to notify the password on paper. This would transform "secret knowledge" into "personal possession". Of course, not every user will react this way. Rather the personal motivation plays an important role: is he aware of the potential loss caused by careless handling of the password? It is easy if the user is the owner. But often foreign possession (e.g., that of the employer) has to be guarded, whose value one often can hardly estimate. If motivation is missing, any password primarily tends to be felt bothersome. In this case, and that seems to be the normal case, it is assumed that biometrics has considerable advantages.



Contrariwise, passwords feature unbeatable theoretic protection ability: an eight-digit password which is allowed to contain any symbol from an 8-bit alphabet offers 1020 possible combinations! This is a real challenge for any biometric feature. The requirements are obvious: such a password is maximally difficult to learn, it must not be written down, it must not be passed to anyone, the input must take place absolutely secret, it must not be extorted, and the technical implementations must be perfect. This leads us to the practical aspects: the implementation must be protected against replay attacks, keyboard dummies (e.g., false ATMs), wiretapping etc. Even biometric features have to cope with such problems. However, it can be assumed that the protection of biometric feature acquisition is not easier than the acquisition of the password, provided the implementation expense is comparable!



Conclusion: Surely, there are cases where passwords offer more security than biometric features. However, these cases are not common!



Information Sources:



• Biometrics site of Jan Krissler and Lisa Thalheim

(http://www.biometrische-systeme.org/)

• Avanti Biometrics Site

(http://www.avanti.1to1.org/)

• Biometrics Research Homepage at Michigan State University (http://biometrics.cse.msu.edu/)

• NIST National Institute of Standards and Technology (http://www.nist.gov/biometrics)

Search Engines

A search engine or search service is a document retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. Search engines use regularly updated indexes to operate quickly and efficiently.

History

The very first tool used for searching on the Internet was Archie. [1] (The name stands for "archive" without the "v", not the character from the 'Archie' comic book series). It was created in 1990 by Alan Emtage, a student at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of filenames, but Archie could not search by file contents.
While Archie indexed computer files,
Gopher indexed plain text documents. Gopher was created in 1991 by Mark McCahill at the University of Minnesota: Gopher was named after the school's mascot[1]. Because these were text files, most of the Gopher sites became websites after the creation of the World Wide Web.
Two other programs,
Veronica and Jughead, searched the files stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from various Gopher servers.Google
Around 2001, the
Google search engine rose to prominence. Its success was based in part on the concept of link popularity and PageRank. The number of other websites and webpages that link to a given page is taken into consideration with PageRank, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to order its results by how many websites link to each found page. Google's minimalist user interface was very popular with users, and has since spawned a number of imitators.

Google and most other web engines utilize not only PageRank but more than 150 criteria to determine relevancy. The algorithm "remembers" where it has been and indexes the number of cross-links and relates these into groupings. PageRank is based on citation analysis that was developed in the 1950s by Eugene Garfield at the University of Pennsylvania. Google's founders cite Garfield's work in their original paper. In this way virtual communities of webpages are found. Teoma's search technology uses a communities approach in its ranking algorithm. NEC Research Institute has worked on similar technology. Web link analysis was first developed by Dr. Jon Kleinberg and his team while working on the CLEVER project at IBM's Almaden Research Center. Google is currently the most popular search engine.


How search engines work
A search engine operates, in the following order
Web crawling
Indexing
Searching


Web search engines work by storing information about a large number of web pages, which they retrieve from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
When a user comes to the search engine and makes a
query, typically by giving key words, the engine looks up the index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the boolean terms AND, OR and NOT to further specify the search query. An advanced feature is proximity search, which allows users to define the distance between keywords.
The usefulness of a search engine depends on the
relevance of the result set it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve.
Most Web search engines are commercial ventures supported by
advertising revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money everytime someone clicks on one of these ads.
The vast majority of search engines are run by private companies using proprietary algorithms and closed databases, the most popular currently being
Google, MSN Search, and Yahoo! Search. However, Open source search engine technology does exist, such as ht://Dig, Nutch, Senas, Egothor, OpenFTS, DataparkSearch and many others.

Storage costs and crawling time
Storage costs are not the limiting resource in search engine implementation. Simply storing 10 billion pages of 10 kbytes each (compressed) requires 100
TB and another 100TB or so for indexes, giving a total hardware cost of under $200k: 400 500GB disk drives on 100 cheap PCs.
However, a public search engine requires considerably more resources than this to calculate query results and to provide high availability. Also, the costs of operating a large server farm are not trivial.
Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection. Most search engines crawl a small fraction of the Web (10-20% pages) at around this frequency or better, but also crawl dynamic websites (e.g.
news sites and blogs) at a much higher frequency.

A network security policy

A network security policy is a generic document that outlines rules for computer network access, determines how policies are enforced and lays out some of the basic architecture of the company security/ network security environment. The document itself is usually several pages long and written by a committee. A security policy goes far beyond the simple idea of "keep the bad guys out". It's a very complex document, meant to govern data access, web-browsing habits, use of passwords and encryption, email attachments and more. It specifies these rules for individuals or groups of individuals throughout the company.
Security policy should keep the malicious users out and also exert control over potential risky users within your organization. The first step in creating a policy is to understand what information and services are available (and to which users), what the potential is for damage and whether any protection is already in place to prevent misuse.
In addition, the security policy should dictate a hierarchy of access permissions; that is, grant users access only to what is necessary for the completion of their work.
While writing the security document can be a major undertaking, a good start can be achieved by using a
template. The policies could be expressed as a set of instructions that could be understood by special purpose network hardware dedicated for securing the network.