Safe Haskell | Safe-Infered |
---|
URL parser, following RFC 3986 (http://tools.ietf.org/html/rfc3986).
- data URL = URL {
- scheme :: Scheme
- authority :: Maybe Authority
- path :: ByteString
- query :: ByteString
- fragment :: ByteString
- data Authority = Authority {
- userinfo :: ByteString
- host :: ByteString
- port :: Maybe Word16
- newtype Scheme = Scheme ByteString
- class Encode t where
- encode :: t -> ByteString
- class Parse t where
- userinfoOctet :: Word8 -> Bool
- userinfoP :: Parser ByteString
- regNameOctet :: Word8 -> Bool
- regNameP :: Parser ByteString
- percent :: Parser Word8
- pathRootlessP :: Parser ByteString
- segmentOctet :: Word8 -> Bool
- authorityPath :: Parser (Maybe Authority, ByteString)
- queryFragmentOctet :: Word8 -> Bool
- queryFragmentP :: Parser ByteString
- usingOnly :: Int -> Parser t -> Parser t
- withPercents :: (Word8 -> Bool) -> Parser ByteString
- percentEncode :: Word8 -> ByteString
- selectiveEncode :: (Word8 -> Bool) -> ByteString -> ByteString
- concatNonEmpty :: ByteString -> ByteString -> ByteString
- pathEncode :: ByteString -> ByteString
- fromString' :: Parse a => String -> Either String a
- fromRight :: Either [Char] t -> t
Documentation
URL "...refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism".
A breakdown of URLs, per the diagram from RFC 3986:
foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment | _____________________|__ / \ / \ urn:example:animal:ferret:nose
For the most part, URL parts are made of strings with percent encoding required of certain characters. The scheme is especially limited in the allowable data:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
Note well that no percent encoding is allowed.
The authority section, nominally denoting userinfo@host:port
, is in
fact quite flexible, allowing percent encoding for the hostname and
userinfo section; only the port has a byte range restriction, to digits.
Since this datatype represents the data in a URL and not its particular
encoded form, we use ByteString
liberally.
URL | |
|
Authority | |
|
Class for encoding items from this module as URLs.
encode :: t -> ByteStringSource
Class for parsing URL-related datatypes.
userinfoOctet :: Word8 -> BoolSource
*( unreserved / pct-encoded / sub-delims / ":" )
regNameOctet :: Word8 -> BoolSource
*( unreserved / pct-encoded / sub-delims )
pathRootlessP :: Parser ByteStringSource
Paths are quite sophisticated, with 5 productions to handle the different
URI contexts in which they appear. However, for the purpose of URL
parsing, we can assume that paths are always separated from the authority
(even the empty authority) with a /
and thus can work with a relatively
simple subset of the productions in the RFC.
path-rootless = segment-nz *( "/" segment ) ... segment-nz = 1*pchar ... pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
Although literal slash runs are not permitted by the RFC, equivalent content can be encoded with percent encoding.
segmentOctet :: Word8 -> BoolSource
To parse the authority and path:
- we parse an authority and then optionally a slash and a path or
- we parse a single slash and then optionally a path.
withPercents :: (Word8 -> Bool) -> Parser ByteStringSource
Parse a bytestream, accepting either literal bytes matching the predicate or any percent encoded characters.
percentEncode :: Word8 -> ByteStringSource
Transform any octet to its percent encoded form.
selectiveEncode :: (Word8 -> Bool) -> ByteString -> ByteStringSource
Percent encode a ByteString
, ignoring octets that match the predicate.
pathEncode :: ByteString -> ByteStringSource
Slash runs are not allowed in encoded paths. Here, this is interpreted to mean that the first slash in path data, which would come after the slash separating the path and the scheme or authority, should be escaped.