Gnutella Web Caching System
Version 2 Specifications Client Developers' Guide
Copyright (c) 2003 Hauke Dämpfling,
version 1.9.4 / 18.6.2003, http://www.gnucleus.com/gwebcache/newgwc.html
Table of Contents
This document serves a guide for client developers that covers how to use the
"new" GWebCache system (as according to the "version 2
specifications", also referred to as GWC2). This document should be
considered "beta". Clients and caches using these specs have not
been thoroughly tested.
GWebCache, even though it is designed for simplicity, will only work if
several key functionalities are implemented by developers. Therefore,
developers, read this document carefully.
To understand why this is so important: Because some clients had errors in
their code, people who ran GWebCaches had (and may still have) much grief,
because these clients relentlessly hammered away at the servers, in some cases
even continuing to hammer servers' IP's when the virtual web servers were shut
down. Such utter lack of responsibility in coding put many users in a situation
that they could not escape from, and such a situation must not be repeated.
Therefore, I hope that you understand why it is critical that you read and
understand this entire document. And, when you get ready to release your
shiny new client with GWebCache v2 functionality, you will thoroughly test
the interaction with a web cache before making any releases.
A bunch of Thank Yous for support of the GWebCache project with many
ideas and code: John Marshall, Robert Rainwater, Guo Xu, Tor Klingberg,
Christopher Rohrs, Mike Green, Nick Randall, ...
If you have any questions, comments, suggestions, (constructive)
criticisms, etc., please post them in the Forum
^ Top ^
A GWebCache is a script on a web server, clients use normal HTTP. It stores
IP addresses of Gnutella nodes and the URLs of other caches. Clients (ultrapeers)
make regular updates to GWCs to keep the information fresh.
Summary of Important Things to Remember
Each of these points is described in more detail below.
- Your client must use GWebCache only if it has no other way to discover
hosts. First, use your Pong cache and such.
- Your client may send updates only if it meets certain criteria.
For example, it must accept incoming connections as an ultrapeer. More
- In any case, your client must not send more than one request per hour.
Your client will be rejected anyway, and you don't want to be rejected.
- If your client fails to contact a cache, it must not request to that
cache again. If a cache is down, it's down!
- Remember that GWebCaches are run by volunteers in their own webspace.
Do not abuse the privilege to be able to access GWebCaches, as they have
limited CPU and Bandwidth resources. Don't DDoS your users and service
Step 1: How to store GWC data in your client
- Keep an array of GWebCache URLs, and for each URL, store a flag as to
whether or not your client has successfully contacted this cache. The client
should forget this information when it exits and stores the information to
(for example) a text file, but your client must keep this information in
memory while running.
- Do not hardcode any cache URLs. Include a default list of GWCs with
your client, but do not hardcode the URLs.
- You must remove any clients from your list that do not respond
correctly. More on this later.
- Hosts will be returned in the standard numerical IP : port format (i.e.
- URLs always begin with
- Before your client accepts new URLs into its internal list, it must
make the following changes:
- If the URL contains any %XX sequences where XX is a hex string (0-9,
a-z, A-Z), replace them by the ASCII character with the hex value (i.e.
%7E is ASCII character 0x7E, decimal 126, char "
- If the URL ends in "index.EXT" where EXT is any of the
following: "php", "cgi", "asp", "cfm",
"jsp" (this list is not complete), then trim this filename.
- Trim any trailing slashes (
/). (For example
- This check is encouraged: perform a DNS lookup of the web server you
are adding to your list and compare that IP address to those of the
servers already in the list. Do not replace the webserver's hostname
with it's IP address! This would screw up virtual servers very
badly. This check is meant to avoid ambiguities between hostnames that
have the same IP address. For example, both "zero-g.net" and
"www.zero-g.net" are working hostnames for the same site, but
this should not cause duplicate entries in your list of cache URLs.
Step 2: How to interact with GWebCaches
- Your client must not exclusively rely on GWebCache. Your client
must use its internal host cache (information gathered from Pongs) and X-Try
headers with priority above GWebCache.
- Use a standard HTTP library. GWebCaches are regular scripts on
regular web servers and therefore rely on your client understanding regular,
full HTTP. (For example, 3xx responses mean "redirect" and 4xx-5xx
means "error".) Make sure that your HTTP libraries
provide a mechanism for identifying HTTP error codes.
- Do not use HTTP proxies. If the HTTP library you use uses proxies,
they should be disabled. (Scripts need to see the client's IP.)
- This should not be an issue if you use standard HTTP libraries, but since
it's happened before: make sure your libraries speak HTTP/1.1 and support
virtual hosts. (For example, the "Host:" header.)
- When you contact a GWebCache, you can get four different kinds of
responses, listed here. If you get anything that is not a normal
GWebCache response, delete that cache's URL from your internal list.
- Normal GWebCache responses (described below)
- GWebCache error (response begins with string "ERROR")
- Invalid response (not parseable)
- HTTP error (HTTP codes 400 to 599)
- In all cases except the first, your client must forget about that
cache, and do not retry. Note that in cases 2 and 3,
the HTTP response code will still be 2xx ("OK"), but these
responses still mean that the cache has had an error. In other words, only
when you can successfully parse the response did the request succeed.
- Note that, as defined below, a GWebCache will now always output at
least one line - this differs from the original GWebCache
specifications, which said that GWebCache may return an empty string. Now,
returning an empty file is invalid (note that "empty file"
means that there may still be one or more CRLF/CR/LFs in the file).
- When contacting a web cache, pick a random cache from your internal
list of caches.
- There is absolutely no reason to send more than one request per hour.
Updates can be combined with Gets and Pings. Ideally, your client will make
one request at startup only if necessary (more on this below),
and then only one update an hour if it meets the criteria
(more on this below too).
- Make sure your client can handle different end-of-line formats. Clients
and servers may be on different platforms so there is no guarantee as to
whether you will get CR, LF, or CRLF. As an example, here is some simple
logic for converting everything to LFs: If the returned file contains any
LFs, then remove all CRs, else replace all CRs by LFs.
- Your client must supply version information to a GWebCache. This is done
via the "client" parameter. Version information is a 4-character
string of uppercase letters (your client's ID) plus a max of 16 characters
for the version number. (Examples: "
- IP Addresses must not begin with leading zeros, i.e. not
001.002.003.012 (this is dumb, and nobody does this anyway, but I just
wanted to be clear).
- Your client will send requests via HTTP GET. This means that your request
[the cache's URL] + "
?" + any number of the
following: [parameter name] + "
=" + [escape-encoded
parameter value] + "
&" + [next parameter name] +
=" + [escape-encoded value] etc.
The order of the parameters does not matter. Each parameter should appear
- "Escape Encoding" (RFC1738)
means replacing all characters that are not letters, numbers, dashes
"-", underscores "_", or periods "." with the
%" + [2-character ASCII code of character
To make this replacement:
Step 1: replace all "
%" by their representation:
Step 2: replace all non-alphanumeric characters except "
_" and "
by a percent (
%) sign followed by two hex digits.
- Example requests:
Step 3: GWebCache output format
- Output of a GWebCache is in line-by-line format, according to the
- "x" can be either "I" = Informational, "U" =
URL, "H" = Host. So far, the following responses have been
I - Informational Response
- field 1:
- field 2: (version string)
Included in response to a Ping request, returns GWebCache version
- field 1:
- field 2: OK
Returned when the update completed successfully (but possibly
there were warnings!)
- field 2: WARNING
field 3: "You came back too early", "Rejected
IP" or "Rejected URL" (others may be added as needed)
A WARNING response to an update generally means that your client
did something wrong. Note that warnings can appear in
addition to an OK response.
- field 1:
Returned when there is no other output, so your client doesn't get
bored. (Actually, this is because GWC must always output at least one
U - URLs
- field 1:
The URL of the alternate cache, beginning with http://
- field 2:
The time since submission of this URL to the cache in seconds
H - Hosts
- field 1:
The Host:Port of a host
- field 2:
The time since submission of this URL to the cache in seconds
- Your client should of course be prepared to expect any other responses, as
long as they are in the above format: they begin with a character (a-z, A-Z,
0-9), then a pipe (|), then any number of characters and pipes. Also make
sure your client can handle extensions to the above formats (for example,
expect to have more information following an "
response, i.e. something like "
etc.). In other words, your parser should be very general.
- A GWebCache may also provide an extra HTTP header for your client,
"X-Remote-IP". This header is analogous to the
"Remote-IP" header provided in the Gnutella handshaking protocol,
with the difference that it cannot be trusted as much. Trust the
Remote-IP header from Gnutella connections instead. X-Remote-IP is what the
web server thinks your IP address is, and this could be wrong due to
transparent proxies and the like.
- Example responses:
- Short response to a simple Get:
- Response to an update combined with a ping:
I|update|WARNING|You came back too early
- Some responses that are currently not given but that are valid and
your parser should still handle:
Step 4: How to make updates to a cache
- To make an update, your client must meet the following criteria.
Note that these are the same as the standard Ultrapeer criteria:
- Your client must have been online (running & connected) for at
least an hour.
- Your client must accept incoming connections. (This is usually
tested by keeping track of whether or not your client has received any
- In other words, leaf nodes must not send updates.
- Your client must support the Remote-IP Gnutella header. This
header is essential for a client so that it can find it's own IP address
(for example, if your client is behind a firewall or NAT router). If
your client does not yet support this header, you should start
supporting it now. Ask on the
GDF if you have any questions regarding implementation.
- If your client meets these criteria, your client should send updates once
an hour. This is limited by the GWebCache and any updates sent too early
will be rejected. Again, there is absolutely no reason to send more than
one request per hour to a GWC.
- Updates are sent through the following parameters:
ip=[your client's numerical IP]
client's port for incoming connections]
url=[the url of a web cache that your client has
- The IP address you send must be you're client's IP address.
This IP address will be checked against the one that the server sees. In
case your client is behind a transparent HTTP proxy, there is not much
you can do about it, your updates will most likely fail. However, if
your IP address is rejected ("
on more than one cache then your client should consider not sending any
- The URL you send must be one that your cache has successfully
contacted. This is why I said above, keep tack of which caches your
client has successfully contacted.
For example, Gnucleus keeps GWebCaches flagged with either
"ALIVE" or "UNTESTED". Any web cache that is added
to the internal list is initially flagged as "Untested". When
making Get requests, Gnucleus uses a cache flagged as
"Untested". If the cache is successfully contacted, the URL is
flagged as "Alive". When making updates, Gnucleus sends the
update to an "Untested" cache, and sends an "Alive"
cache in the
- Don't forget that the parameter values must be URL-escape-encoded.
(See the above explanation.)
- To send an update the cache running at "
with your IP/port
126.96.36.199:123 and sending the URL
Step 5: How to request information from a GWebCache
- When your client needs IP addresses to connect to, first try your
internal host cache (information gathered from Pongs and X-Try headers).
On startup, your client should try about 20 IPs from its internal cache, and
only then should it contact a GWebCache.
- Requesting information is simple, send the following parameter:
- If the GWebCache has hosts and/or URLs stored, it will return them
according to the format defined above.
Extras: Using the "Network" Parameter
- GWebCache now supports storing more than one list of Hosts/URLs. A cache
owner may enable his/her cache to store more than just the default Gnutella
hosts. Your client should simply send the extra parameter: "
of network]". When you contact a cache, there are two situations:
- The cache supports the network you are asking for. Interaction with the
GWC will be unchanged.
- The cache does not support the network you are asking for. The following
things will happen:
- The cache will send the extra response "
- When sending Updates: The cache will assume that the URL you are
submitting supports the network that you are asking for (!). The URL
will be stored internally along with the network name. Any other clients
that ask for this network will be given this URL as a kind of
"redirect" or "try other".
- When sending Gets: If the cache knows about a URL that supports this
network then it will return that URL. Think of this as a
Extras: Using the Timestamp information
- This feature is experimental, we will keep the timestamp
information but might add more information as we see necessary.
- As you may have noticed, GWC returns the "age" (time since
submission) of all URLs and IPs it stores. This information is provided as a
kind of "freshness" information.
- What your client can do with this information:
- If you notice that the information in the cache is "very
fresh" then your client can consider not sending an update for a
while. For example: if you notice that a cache has information that was
submitted less than a minute ago, you can wait two hours instead of one
until you send an update.
- Be very careful with this: If you notice that the information
in the cache is very old, then your client can consider sending an
update a little earlier. For example: if you notice a cache hasn't
gotten an update for more than an hour, you can send an update right
away. Remember, this is very dangerous - your client should still not
send more than one request an hour.
Extras: Clustering Information
- The GWC2 beta supports the new "
parameter. This functionality is currently for testing of this feature, so
consider it "alpha".
- On update requests, if you include the extra parameter "
these keywords will be stored along with the host you submit.
- The following limitations are placed on the keyword string: it may only
contain the characters [A-Za-z0-9.-_:], and it may not be longer than 256
characters (yes, the entire keyword string). - Characters that aren't
allowed are stripped and any keywords beyond the 256 characters are dropped.
- On get requests, the keywords are returned in the field after the
age parameter, like so:
^ Top ^
- Changed "alpha" to "beta" status
- Added clustering information
- Smaller corrections and updates
- Replaced "Important Traffic Issues" by "Summary of
Important Things to Remember"
- Added Timestamp information
- Added Traffic section
- Clarified Remote-IP/X-Remote-IP issues
- First release of "Developers' Guide"
^ Top ^
See also: http://www.gnucleus.com/
Copyright (c) 2003 Hauke Dämpfling.
License Terms: FDL.