ch02: URLs and Resources
Last updated
Last updated
URL syntax and what the various URL components mean and do
URL shortcuts that many web clients support, including relative URLs and expandomatic URLs
URL encoding and character rules
Common URL schemes that support a variety of Internet information systems
The future of URLs, including uniform resource names(URNs)—a framework to support objects that move from place to place while retaining stable names
URL: identify resources by describing where resources are located.
URN: identify reosurces by name, regardless of where they currently reside.
Example: http://www.joes-hardware.com/seasonal/index-fall.html
The first part of the URL(http) is the URL scheme. The scheme tells a web client how to access the resource. In this case, the URL says to use the HTTP protocol.
The second part of the URL(www.joes-hardware.com) is the server location. This tells the web client where the resource is hosted.
The third part of the URL(/seasonal/index-fall.html) is the resource path. The path tells what particular local resource on the server is being requested.
URL format:
<scheme>://<user>:<password>@<host>:<port>/<path>;<params>?<query>#<frag>
General URL components:
Component | Description | Default Value |
scheme | Which protocol to use when accessing a server to get a resource. | None |
user | The username some schemes require to access a resource. | anonymous |
password | The password that may be included after the username, separated by a colon(:). | \ |
host | The hostname or dotted IP address of the server hosting resource. | None |
port | The port number on which the server hosting the resource is listening. Many schemes have default port numbers. | Scheme-specific |
path | The local name for the resource on the server, separated from the previous URL components by a slash(/). The syntax of the path component is server- and shceme-specific. | None |
params | Used by some schemes to specify input parameters. Params are name/value pairs. A URL can contain multiple params fields, separated from themselves and the rest of the path by semicolons(;). | None |
query | Used by some schemes to pass parameters to active applications(such as databases, bulletin boards, search engines, and other Internet gateways). There is no common format for the contents of the query component. It is separated from the rest of the URL by the "?" Character. | None |
frag | A name for a piece or part of the resource. The frag field is not passed to the server when referencing the object; it is used internally by the client. It is separated from the rest of the URL by the "#" character. | None |
Many servers require a username and password before you can access data through them. FTP servers are a common example of this. Examples:
Many protocols require more information to work.
Applications interpreting URLs need these protocol parameters to access the resource.
To give applications the input parameters they need in order to talk to the server correctly, URLs have a params component. This component is just a list of name/value pairs in the URL, separated from the rest of the URL by ";" characters. For example: ftp"//prep.ai.mit.edu/pub/gnu;type=d
Each segment can have its own params:
http://www.joes-hardware.com/hammers;sale=false/index.html;graphics=true
If you use relative URLs, you can move a set of documents around and still have their links work, because they will beinterpreted relative to the new bases. This allows for things like mirroring content on other servers.
Base URLs
The base URLs can come from a few places:
Explicitly provided in the reource
An HTML document, for example, may include a \ HTML tag defining the base URL by which to convert all relative URLs in that HTML document.
Base URL of the encapsulating resource
Use the URL of the resource in which it is embedded as a base.
No base URL
In some instances, there is no base URL. This often means that you have an absolute URL; however, sometimes you may just have an incomplete or broken URL.
Protable, Readable and Complete.
Escape sequences allow the encoding of arbitrary character values or data using a restricted subset of the US-ASCII character set, yielding portability and completeness.
An encoding scheme was devised to represent characters in a URL that are not safe. The encoding simply represents the unsafe character by an "escape" notation, consisting of a percent sign(%) followed by two hexadecimal digits that represent the ASCII code of the character.
Examples:
Character | ASCII code |
~ | 126(0x7E) |
SPACE | 32(0x20) |
% | 37(0x25) |
Several characters have been reserved to have special meaning inside of a URL:
Character | Reservation/Restriciton |
% | Reserved as escape token for encoded characters |
/ | Reserved for delimiting splitting up path segments in the path component |
. | Reserved in the path component |
.. | Reserved in the path component |
# | Reserved as the fragment delimiter |
? | Reserved as the query-string delimiter |
; | Reserved as the params delimiter |
: | Reserved to delimiter the scheme, user/password, and host/port components |
$, + | Reserved |
@&= | Reserved because they have special meaning in the context of some shcemes |
{}|\^~[]' | Restricted because of unsafe handling by various transport agents, such as gateways |
<>'' | Unsafe; should be encoded because these characters often have meaning outside the scope of the URL, such as delimiting the URL itself in a document |
0x00-0x1F,0x7F | Restricted; characters within these hex ranges fall within the nonprintable section of the US-ASCII character set |
>0x7F | Restricted; characters whose hex values fall within this range do not fall within the 7-bit range of the US-ASCII character set |
URLs are really addreses, not true names. This means that a URL tells you where something is located, for the moment.