ch02: URLs and Resources

0. Guide

  • URL syntax and what the various URL components mean and do

  • URL shortcuts that many web clients support, including relative URLs and expandomatic URLs

  • URL encoding and character rules

  • Common URL schemes that support a variety of Internet information systems

  • The future of URLs, including uniform resource names(URNs)—a framework to support objects that move from place to place while retaining stable names

1. Navigating the Internet's Resources

URL: identify resources by describing where resources are located.

URN: identify reosurces by name, regardless of where they currently reside.

Example: http://www.joes-hardware.com/seasonal/index-fall.html

  • The first part of the URL(http) is the URL scheme. The scheme tells a web client how to access the resource. In this case, the URL says to use the HTTP protocol.

  • The second part of the URL(www.joes-hardware.com) is the server location. This tells the web client where the resource is hosted.

  • The third part of the URL(/seasonal/index-fall.html) is the resource path. The path tells what particular local resource on the server is being requested.

2. URL Syntax

URL format:

<scheme>://<user>:<password>@<host>:<port>/<path>;<params>?<query>#<frag>

General URL components:

Component

Description

Default Value

scheme

Which protocol to use when accessing a server to get a resource.

None

user

The username some schemes require to access a resource.

anonymous

password

The password that may be included after the username, separated by a colon(:).

\

host

The hostname or dotted IP address of the server hosting resource.

None

port

The port number on which the server hosting the resource is listening. Many schemes have default port numbers.

Scheme-specific

path

The local name for the resource on the server, separated from the previous URL components by a slash(/). The syntax of the path component is server- and shceme-specific.

None

params

Used by some schemes to specify input parameters. Params are name/value pairs. A URL can contain multiple params fields, separated from themselves and the rest of the path by semicolons(;).

None

query

Used by some schemes to pass parameters to active applications(such as databases, bulletin boards, search engines, and other Internet gateways). There is no common format for the contents of the query component. It is separated from the rest of the URL by the "?" Character.

None

frag

A name for a piece or part of the resource. The frag field is not passed to the server when referencing the object; it is used internally by the client. It is separated from the rest of the URL by the "#" character.

None

2.1 Usernames and Passwords

Many servers require a username and password before you can access data through them. FTP servers are a common example of this. Examples:

ftp://ftp.prep.ai.mit.edu/pub/gnu
ftp://anonymous@ftp.prep.ai.mit.edu/pub/gnu
ftp://anonymous:my_passwd@ftp.prep.ai.mit.edu/pub/gnu
http://joe:joespasswd@www.joes-hardware.com/sales_info.txt

2.2 Parameters

Many protocols require more information to work.

Applications interpreting URLs need these protocol parameters to access the resource.

To give applications the input parameters they need in order to talk to the server correctly, URLs have a params component. This component is just a list of name/value pairs in the URL, separated from the rest of the URL by ";" characters. For example: ftp"//prep.ai.mit.edu/pub/gnu;type=d

Each segment can have its own params:

http://www.joes-hardware.com/hammers;sale=false/index.html;graphics=true

3. URL Shortcuts

3.1 Relative URLs

If you use relative URLs, you can move a set of documents around and still have their links work, because they will beinterpreted relative to the new bases. This allows for things like mirroring content on other servers.

Base URLs

The base URLs can come from a few places:

Explicitly provided in the reource

​ An HTML document, for example, may include a \ HTML tag defining the base URL by which to convert all relative URLs in that HTML document.

Base URL of the encapsulating resource

​ Use the URL of the resource in which it is embedded as a base.

No base URL

​ In some instances, there is no base URL. This often means that you have an absolute URL; however, sometimes you may just have an incomplete or broken URL.

4. Shady Characters

Protable, Readable and Complete.

4.1 The URL Character Set

Escape sequences allow the encoding of arbitrary character values or data using a restricted subset of the US-ASCII character set, yielding portability and completeness.

4.2 Encoding Mechanisms

An encoding scheme was devised to represent characters in a URL that are not safe. The encoding simply represents the unsafe character by an "escape" notation, consisting of a percent sign(%) followed by two hexadecimal digits that represent the ASCII code of the character.

Examples:

Character

ASCII code

~

126(0x7E)

SPACE

32(0x20)

%

37(0x25)

4.3 Character Restrictions

Several characters have been reserved to have special meaning inside of a URL:

Character

Reservation/Restriciton

%

Reserved as escape token for encoded characters

/

Reserved for delimiting splitting up path segments in the path component

.

Reserved in the path component

..

Reserved in the path component

#

Reserved as the fragment delimiter

?

Reserved as the query-string delimiter

;

Reserved as the params delimiter

:

Reserved to delimiter the scheme, user/password, and host/port components

$, +

Reserved

@&=

Reserved because they have special meaning in the context of some shcemes

{}|\^~[]'

Restricted because of unsafe handling by various transport agents, such as gateways

<>''

Unsafe; should be encoded because these characters often have meaning outside the scope of the URL, such as delimiting the URL itself in a document

0x00-0x1F,0x7F

Restricted; characters within these hex ranges fall within the nonprintable section of the US-ASCII character set

>0x7F

Restricted; characters whose hex values fall within this range do not fall within the 7-bit range of the US-ASCII character set

5. The Future

URLs are really addreses, not true names. This means that a URL tells you where something is located, for the moment.

Last updated