URLb-0.0.1: DEPRECATED A simple, liberal URL parser.

Safe HaskellSafe-Infered

Network.URLb

Description

URL parser, following RFC 3986 (http://tools.ietf.org/html/rfc3986).

Synopsis

Documentation

data URL Source

URL "...refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism".

A breakdown of URLs, per the diagram from RFC 3986:

     foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment
      |   _____________________|__
     / \ /                        \
     urn:example:animal:ferret:nose

For the most part, URL parts are made of strings with percent encoding required of certain characters. The scheme is especially limited in the allowable data:

  scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

Note well that no percent encoding is allowed.

The authority section, nominally denoting userinfo@host:port, is in fact quite flexible, allowing percent encoding for the hostname and userinfo section; only the port has a byte range restriction, to digits.

Since this datatype represents the data in a URL and not its particular encoded form, we use ByteString liberally.

class Encode t whereSource

Class for encoding items from this module as URLs.

Methods

encode :: t -> ByteStringSource

class Parse t whereSource

Class for parsing URL-related datatypes.

Methods

parser :: Parser tSource

userinfoOctet :: Word8 -> BoolSource

 *( unreserved / pct-encoded / sub-delims / ":" )

regNameOctet :: Word8 -> BoolSource

 *( unreserved / pct-encoded / sub-delims )

pathRootlessP :: Parser ByteStringSource

Paths are quite sophisticated, with 5 productions to handle the different URI contexts in which they appear. However, for the purpose of URL parsing, we can assume that paths are always separated from the authority (even the empty authority) with a / and thus can work with a relatively simple subset of the productions in the RFC.

  path-rootless = segment-nz *( "/" segment )

  ...

  segment-nz    = 1*pchar

  ...

  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Although literal slash runs are not permitted by the RFC, equivalent content can be encoded with percent encoding.

segmentOctet :: Word8 -> BoolSource

To parse the authority and path:

  • we parse an authority and then optionally a slash and a path or
  • we parse a single slash and then optionally a path.

withPercents :: (Word8 -> Bool) -> Parser ByteStringSource

Parse a bytestream, accepting either literal bytes matching the predicate or any percent encoded characters.

percentEncode :: Word8 -> ByteStringSource

Transform any octet to its percent encoded form.

selectiveEncode :: (Word8 -> Bool) -> ByteString -> ByteStringSource

Percent encode a ByteString, ignoring octets that match the predicate.

pathEncode :: ByteString -> ByteStringSource

Slash runs are not allowed in encoded paths. Here, this is interpreted to mean that the first slash in path data, which would come after the slash separating the path and the scheme or authority, should be escaped.