Copyright	(c) 2013 Peter Simons
License	BSD3
Maintainer	phlummox2@gmail.com
Stability	provisional
Portability	portable
Safe Haskell	Safe
Language	Haskell98

Text.ParserCombinators.Parsec.Rfc2822NS

Contents

Useful parser combinators
Miscellaneous tokens (section 3.2.6)
Date and Time Specification (section 3.3)
Address Specification (section 3.4)
- Addr-spec specification (section 3.4.1)
Overall message syntax (section 3.5)
Field definitions (section 3.6)
Miscellaneous obsolete tokens (section 4.1)
Obsolete folding white space (section 4.2)
Obsolete Date and Time (section 4.3)
Obsolete Addressing (section 4.4)
Obsolete header fields (section 4.5)

Description

This module provides parsers for the grammar defined in RFC2822, "Internet Message Format", http://www.faqs.org/rfcs/rfc2822.html.

Please note: The module is not particularly well tested.

Addendum for Nonstandard Version:

This module deviates from the RFC currently in

Allowing for non-standard line endings.

These allowances are subject to change, and should not be used when parsing incoming messages, only for parsing messages that have been stored on disk. The goal of these nonstandard parsers is to provide a higher probability of parsing common headers (rather than only those explicitly defined in the RFC) as well as allowing for potential oddities / changes that may occur during storage of an email message. These parsers have be rebranded so as not to conflict with the standard parsers available from the excellent hsemail package, upon which this package depends. For patches to this package only (namely 'hsemail-ns'), patches should be sent to phlummox2@gmail.com, for patches to the proper parsers, you can send them to the original maintainer.

Synopsis

Documentation

crlf :: CharParser a String Source #

Useful parser combinators

maybeOption :: GenParser tok st a -> GenParser tok st (Maybe a) Source #

Return Nothing if the given parser doesn't match. This combinator is included in the latest parsec distribution as optionMaybe, but ghc-6.6.1 apparently doesn't have it.

unfold :: CharParser a b -> CharParser a b Source #

unfold = between (optional cfws) (optional cfws)

header :: String -> CharParser a b -> CharParser a b Source #

Construct a parser for a message header line from the header's name and a parser for the body.

obs_header :: String -> CharParser a b -> CharParser a b Source #

Like header, but allows the obsolete white-space rules.

Primitive Tokens (section 3.2.1)

no_ws_ctl :: CharParser a Char Source #

Match any US-ASCII non-whitespace control character.

text :: CharParser a Char Source #

Match any US-ASCII character except for r, n.

specials :: CharParser a Char Source #

Match any of the RFC's "special" characters: ()<>[]:;@,.\".

Quoted characters (section 3.2.2)

quoted_pair :: CharParser a String Source #

Match a "quoted pair". All characters matched by text may be quoted. Note that the parsers returns both characters, the backslash and the actual content.

Folding white space and comments (section 3.2.3)

fws :: CharParser a String Source #

Match "folding whitespace". That is any combination of wsp and crlf followed by wsp.

ctext :: CharParser a Char Source #

Match any non-whitespace, non-control character except for "(", ")", and "\". This is used to describe the legal content of comments.

Note: This parser accepts 8-bit characters, even though this is not legal according to the RFC. Unfortunately, 8-bit content in comments has become fairly common in the real world, so we'll just accept the fact.

comment :: CharParser a String Source #

Match a "comments". That is any combination of ctext, quoted_pairs, and fws between brackets. Comments may nest.

cfws :: CharParser a String Source #

Match any combination of fws and comments.

Atom (section 3.2.4)

atext :: CharParser a Char Source #

Match any US-ASCII character except for control characters, specials, or space. atom and dot_atom are made up of this.

atom :: CharParser a String Source #

Match one or more atext characters and skip any preceeding or trailing cfws.

dot_atom :: CharParser a String Source #

Match dot_atom_text and skip any preceeding or trailing cfws.

dot_atom_text :: CharParser a String Source #

Match two or more atexts interspersed by dots.

Quoted strings (section 3.2.5)

qtext :: CharParser a Char Source #

Match any non-whitespace, non-control US-ASCII character except for "\" and """.

qcontent :: CharParser a String Source #

Match either qtext or quoted_pair.

quoted_string :: CharParser a String Source #

Match any number of qcontent between double quotes. Any cfws preceeding or following the "atom" is skipped automatically.

Miscellaneous tokens (section 3.2.6)

word :: CharParser a String Source #

Match either atom or quoted_string.

phrase :: CharParser a [String] Source #

Match either one or more words or an obs_phrase.

utext :: CharParser a Char Source #

Match any non-whitespace, non-control US-ASCII character except for "\" and """.

unstructured :: CharParser a String Source #

Match any number of utext tokens.

"Unstructured text" is used in free text fields such as subject. Please note that any comments or whitespace that prefaces or follows the actual utext is included in the returned string.

Date and Time Specification (section 3.3)

date_time :: CharParser a CalendarTime Source #

Parse a date and time specification of the form

  Thu, 19 Dec 2002 20:35:46 +0200

where the weekday specification "Thu," is optional. The parser returns a CalendarTime, which is set to the appropriate values. Note, though, that not all fields of CalendarTime will necessarily be set correctly! Obviously, when no weekday has been provided, the parser will set this field to Monday - regardless of whether the day actually is a monday or not. Similarly, the day of the year will always be returned as 0. The timezone name will always be empty: "".

Nor will the date_time parser perform any consistency checking. It will accept

   40 Apr 2002 13:12 +0100

as a perfectly valid date.

In order to get all fields set to meaningful values, and in order to verify the date's consistency, you will have to feed it into any of the conversion routines provided in System.Time, such as toClockTime. (When doing this, keep in mind that most functions return local time. This will not necessarily be the time you're expecting.)

day_of_week :: CharParser a Day Source #

This parser matches a day_name or an obs_day_of_week (optionally wrapped in folding whitespace) and return its Day value.

day_name :: CharParser a Day Source #

This parser will the abbreviated weekday names ("Mon", "Tue", ...) and return the appropriate Day value.

date :: CharParser a (Int, Month, Int) Source #

This parser will match a date of the form "dd:mm:yyyy" and return a tripple of the form (Int,Month,Int) - corresponding to (year,month,day).

year :: CharParser a Int Source #

This parser will match a four digit number and return its integer value. No range checking is performed.

month :: CharParser a Month Source #

This parser will match a month_name, optionally wrapped in folding whitespace, or an obs_month and return its Month value.

month_name :: CharParser a Month Source #

This parser will the abbreviated month names ("Jan", "Feb", ...) and return the appropriate Month value.

day_of_month :: CharParser a Int Source #

day :: CharParser a Int Source #

Match a 1 or 2-digit number (day of month), recognizing both standard and obsolete folding syntax.

time :: CharParser a (TimeDiff, Int) Source #

This parser will match a time_of_day specification followed by a zone. It returns the tuple (TimeDiff,Int) corresponding to the return values of either parser.

time_of_day :: CharParser a TimeDiff Source #

This parser will match a time-of-day specification of "hh:mm" or "hh:mm:ss" and return the corrsponding time as a TimeDiff.

hour :: CharParser a Int Source #

This parser will match a two-digit number and return its integer value. No range checking is performed.

minute :: CharParser a Int Source #

This parser will match a two-digit number and return its integer value. No range checking is performed.

second :: CharParser a Int Source #

This parser will match a two-digit number and return its integer value. No range checking takes place.

zone :: CharParser a Int Source #

This parser will match a timezone specification of the form "+hhmm" or "-hhmm" and return the zone's offset to UTC in seconds as an integer. obs_zone is matched as well.

Address Specification (section 3.4)

data NameAddr Source #

A NameAddr is composed of an optional realname a mandatory e-mail address.

Constructors

NameAddr
Fields nameAddr_name :: Maybe String nameAddr_addr :: String

Instances

Eq NameAddr Source #
Methods (==) :: NameAddr -> NameAddr -> Bool # (/=) :: NameAddr -> NameAddr -> Bool #
Show NameAddr Source #
Methods showsPrec :: Int -> NameAddr -> ShowS # show :: NameAddr -> String # showList :: [NameAddr] -> ShowS #

address :: CharParser a [NameAddr] Source #

Parse a single mailbox or an address group and return the address(es).

mailbox :: CharParser a NameAddr Source #

Parse a name_addr or an addr_spec and return the address.

name_addr :: CharParser a NameAddr Source #

Parse an angle_addr, optionally prefaced with a display_name, and return the address.

angle_addr :: CharParser a String Source #

Parse an angle_addr or an obs_angle_addr and return the address.

group :: CharParser a [NameAddr] Source #

Parse a "group" of addresses. That is a display_name, followed by a colon, optionally followed by a mailbox_list, followed by a semicolon. The found address(es) are returned - what may be none. Here is an example:

>>> parse group "" "my group: user1@example.org, user2@example.org;"
Right [NameAddr {nameAddr_name = Nothing, nameAddr_addr = "user1@example.org"},NameAddr {nameAddr_name = Nothing, nameAddr_addr = "user2@example.org"}]

display_name :: CharParser a String Source #

Parse and return a phrase.

mailbox_list :: CharParser a [NameAddr] Source #

Parse a list of mailbox addresses, every two addresses being separated by a comma, and return the list of found address(es).

address_list :: CharParser a [NameAddr] Source #

Parse a list of address addresses, every two addresses being separated by a comma, and return the list of found address(es).

Addr-spec specification (section 3.4.1)

addr_spec :: CharParser a String Source #

Parse an "address specification". That is a local_part, followed by an "@" character, followed by a domain. Return the complete address as String, ignoring any whitespace or any comments.

local_part :: CharParser a String Source #

Parse and return a "local part" of an addr_spec. That is either a dot_atom or a quoted_string.

domain :: CharParser a String Source #

Parse and return a "domain part" of an addr_spec. That is either a dot_atom or a domain_literal.

domain_literal :: CharParser a String Source #

Parse a "domain literal". That is a "[" character, followed by any amount of dcontent, followed by a terminating "]" character. The complete string is returned verbatim.

dcontent :: CharParser a String Source #

Parse and return any characters that are legal in a domain_literal. That is dtext or a quoted_pair.

dtext :: CharParser a Char Source #

Parse and return any ASCII characters except "[", "]", and "\".

Overall message syntax (section 3.5)

data GenericMessage a Source #

This data type repesents a parsed Internet Message as defined in this RFC. It consists of an arbitrary number of header lines, represented in the Field data type, and a message body, which may be empty.

Constructors

Message [Field] a

Instances

Show a => Show (GenericMessage a) Source #
Methods showsPrec :: Int -> GenericMessage a -> ShowS # show :: GenericMessage a -> String # showList :: [GenericMessage a] -> ShowS #

type Message = GenericMessage String Source #

message :: CharParser a Message Source #

Parse a complete message as defined by this RFC and it broken down into the separate header fields and the message body. Header lines, which contain syntax errors, will not cause the parser to abort. Rather, these headers will appear as OptionalFields (which are unparsed) in the resulting Message. A message must be really, really badly broken for this parser to fail.

This behaviour was chosen because it is impossible to predict what the user of this module considers to be a fatal error; traditionally, parsers are very forgiving when it comes to Internet messages.

If you want to implement a really strict parser, you'll have to put the appropriate parser together yourself. You'll find that this is rather easy to do. Refer to the fields parser for further details.

body :: CharParser a String Source #

A message body is just an unstructured sequence of characters.

Field definitions (section 3.6)

data Field Source #

This data type represents any of the header fields defined in this RFC. Each of the various instances contains with the return value of the corresponding parser.

Constructors

OptionalField String String
From [NameAddr]
Sender NameAddr
ReturnPath String
ReplyTo [NameAddr]
To [NameAddr]
Cc [NameAddr]
Bcc [NameAddr]
MessageID String
InReplyTo [String]
References [String]
Subject String
Comments String
Keywords [[String]]
Date CalendarTime
ResentDate CalendarTime
ResentFrom [NameAddr]
ResentSender NameAddr
ResentTo [NameAddr]
ResentCc [NameAddr]
ResentBcc [NameAddr]
ResentMessageID String
ResentReplyTo [NameAddr]
Received ([(String, String)], CalendarTime)
ObsReceived [(String, String)]

Instances

Show Field Source #
Methods showsPrec :: Int -> Field -> ShowS # show :: Field -> String # showList :: [Field] -> ShowS #

fields :: CharParser a [Field] Source #

This parser will parse an arbitrary number of header fields as defined in this RFC. For each field, an appropriate Field value is created, all of them making up the Field list that this parser returns.

If you look at the implementation of this parser, you will find that it uses Parsec's try modifier around all of the fields. The idea behind this is that fields, which contain syntax errors, fall back to the catch-all optional_field. Thus, this parser will hardly ever return a syntax error -- what conforms with the idea that any message that can possibly be accepted should be.

The origination date field (section 3.6.1)

orig_date :: CharParser a CalendarTime Source #

Parse a "Date:" header line and return the date it contains a CalendarTime.

Originator fields (section 3.6.2)

from :: CharParser a [NameAddr] Source #

Parse a "From:" header line and return the mailbox_list address(es) contained in it.

sender :: CharParser a NameAddr Source #

Parse a "Sender:" header line and return the mailbox address contained in it.

reply_to :: CharParser a [NameAddr] Source #

Parse a "Reply-To:" header line and return the address_list address(es) contained in it.

Destination address fields (section 3.6.3)

to :: CharParser a [NameAddr] Source #

Parse a "To:" header line and return the address_list address(es) contained in it.

cc :: CharParser a [NameAddr] Source #

Parse a "Cc:" header line and return the address_list address(es) contained in it.

bcc :: CharParser a [NameAddr] Source #

Parse a "Bcc:" header line and return the address_list address(es) contained in it.

Identification fields (section 3.6.4)

message_id :: CharParser a String Source #

Parse a "Message-Id:" header line and return the msg_id contained in it.

in_reply_to :: CharParser a [String] Source #

Parse a "In-Reply-To:" header line and return the list of msg_ids contained in it.

references :: CharParser a [String] Source #

Parse a "References:" header line and return the list of msg_ids contained in it.

msg_id :: CharParser a String Source #

Parse a "message ID:" and return it. A message ID is almost identical to an angle_addr, but with stricter rules about folding and whitespace.

id_left :: CharParser a String Source #

Parse a "left ID" part of a msg_id. This is almost identical to the local_part of an e-mail address, but with stricter rules about folding and whitespace.

id_right :: CharParser a String Source #

Parse a "right ID" part of a msg_id. This is almost identical to the domain of an e-mail address, but with stricter rules about folding and whitespace.

no_fold_quote :: CharParser a String Source #

Parse one or more occurences of qtext or quoted_pair and return the concatenated string. This makes up the id_left of a msg_id.

no_fold_literal :: CharParser a String Source #

Parse one or more occurences of dtext or quoted_pair and return the concatenated string. This makes up the id_right of a msg_id.

Informational fields (section 3.6.5)

subject :: CharParser a String Source #

Parse a "Subject:" header line and return its contents verbatim. Please note that all whitespace and/or comments are preserved, i.e. the result of parsing "Subject: foo" is " foo", not "foo".

comments :: CharParser a String Source #

Parse a "Comments:" header line and return its contents verbatim. Please note that all whitespace and/or comments are preserved, i.e. the result of parsing "Comments: foo" is " foo", not "foo".

keywords :: CharParser a [[String]] Source #

Parse a "Keywords:" header line and return the list of phrases found. Please not that each phrase is again a list of atoms, as returned by the phrase parser.

Resent fields (section 3.6.6)

resent_date :: CharParser a CalendarTime Source #

Parse a "Resent-Date:" header line and return the date it contains as CalendarTime.

resent_from :: CharParser a [NameAddr] Source #

Parse a "Resent-From:" header line and return the mailbox_list address(es) contained in it.

resent_sender :: CharParser a NameAddr Source #

Parse a "Resent-Sender:" header line and return the mailbox_list address(es) contained in it.

resent_to :: CharParser a [NameAddr] Source #

Parse a "Resent-To:" header line and return the mailbox address contained in it.

resent_cc :: CharParser a [NameAddr] Source #

Parse a "Resent-Cc:" header line and return the address_list address(es) contained in it.

resent_bcc :: CharParser a [NameAddr] Source #

Parse a "Resent-Bcc:" header line and return the address_list address(es) contained in it. (This list may be empty.)

resent_msg_id :: CharParser a String Source #

Parse a "Resent-Message-ID:" header line and return the msg_id contained in it.

Trace fields (section 3.6.7)

return_path :: CharParser a String Source #

path :: CharParser a String Source #

received :: CharParser a ([(String, String)], CalendarTime) Source #

name_val_list :: CharParser a [(String, String)] Source #

name_val_pair :: CharParser a (String, String) Source #

item_name :: CharParser a String Source #

item_value :: CharParser a String Source #

Optional fields (section 3.6.8)

optional_field :: CharParser a (String, String) Source #

Parse an arbitrary header field and return a tuple containing the field_name and unstructured text of the header. The name will not contain the terminating colon.

field_name :: CharParser a String Source #

Parse and return an arbitrary header field name. That is one or more ftext characters.

ftext :: CharParser a Char Source #

Match and return any ASCII character except for control characters, whitespace, and ":".

Miscellaneous obsolete tokens (section 4.1)

obs_qp :: CharParser a String Source #

Match the obsolete "quoted pair" syntax, which - unlike quoted_pair - allowed any ASCII character to be specified when quoted. The parser will return both, the backslash and the actual character.

obs_text :: CharParser a String Source #

Match the obsolete "text" syntax, which - unlike text - allowed "carriage returns" and "linefeeds". This is really weird; you better consult the RFC for details. The parser will return the complete string, including those special characters.

obs_char :: CharParser a Char Source #

Match and return the obsolete "char" syntax, which - unlike character - did not allow "carriage return" and "linefeed".

obs_utext :: CharParser a String Source #

Match and return the obsolete "utext" syntax, which is identical to obs_text.

obs_phrase :: CharParser a [String] Source #

Match the obsolete "phrase" syntax, which - unlike phrase - allows dots between tokens.

obs_phrase_list :: CharParser a [String] Source #

Match a "phrase list" syntax and return the list of Strings that make up the phrase. In contrast to a phrase, the obs_phrase_list separates the individual words by commas. This syntax is - as you will have guessed - obsolete.

Obsolete folding white space (section 4.2)

obs_fws :: CharParser a String Source #

Parse and return an "obsolete fws" token. That is at least one wsp character, followed by an arbitrary number (including zero) of crlf followed by at least one more wsp character.

Obsolete Date and Time (section 4.3)

obs_day_of_week :: CharParser a Day Source #

Parse a day_name but allow for the obsolete folding syntax.

obs_year :: CharParser a Int Source #

Parse a year but allow for a two-digit number (obsolete) and the obsolete folding syntax.

obs_month :: CharParser a Month Source #

Parse a month_name but allow for the obsolete folding syntax.

obs_day :: CharParser a Int Source #

Parse a day but allow for the obsolete folding syntax.

obs_hour :: CharParser a Int Source #

Parse a hour but allow for the obsolete folding syntax.

obs_minute :: CharParser a Int Source #

Parse a minute but allow for the obsolete folding syntax.

obs_second :: CharParser a Int Source #

Parse a second but allow for the obsolete folding syntax.

obs_zone :: CharParser a Int Source #

Match the obsolete zone names and return the appropriate offset.

Obsolete Addressing (section 4.4)

obs_angle_addr :: CharParser a String Source #

This parser matches the "obsolete angle address" syntax, a construct that used to be called "route address" in earlier RFCs. It differs from a standard angle_addr in two ways: (1) it allows far more liberal insertion of folding whitespace and comments and (2) the address may contain a "route" (which this parser ignores):

>>> parse obs_angle_addr "" "<@example1.org,@example2.org:joe@example.org>"
Right "<joe@example.org>"

obs_route :: CharParser a [String] Source #

This parser parses the "route" part of obs_angle_addr and returns the list of Strings that make up this route. Relies on obs_domain_list for the actual parsing.

obs_domain_list :: CharParser a [String] Source #

This parser parses a list of domain names, each of them prefaced with an "at". Multiple names are separated by a comma. The list of domains is returned - and may be empty.

obs_local_part :: CharParser a String Source #

Parse the obsolete syntax of a local_part, which allowed for more liberal insertion of folding whitespace and comments. The actual string is returned.

obs_domain :: CharParser a String Source #

Parse the obsolete syntax of a domain, which allowed for more liberal insertion of folding whitespace and comments. The actual string is returned.

obs_mbox_list :: CharParser a [NameAddr] Source #

This parser will match the obsolete syntax for a mailbox_list. This one is quite weird: An obs_mbox_list contains an arbitrary number of mailboxes - including none -, which are separated by commas. But you may have multiple consecutive commas without giving a mailbox. You may also have a valid obs_mbox_list that contains no mailbox at all. On the other hand, you must have at least one comma. The following example is valid:

>>> parse obs_mbox_list "" ","
Right []

But this one is not:

>>> parse obs_mbox_list "" "joe@example.org"
Left (line 1, column 16):
unexpected end of input
expecting obsolete syntax for a list of mailboxes

obs_addr_list :: CharParser a [NameAddr] Source #

This parser is identical to obs_mbox_list but parses a list of addresses rather than mailboxes. The main difference is that an address may contain groups. Please note that as of now, the parser will return a simple list of addresses; the grouping information is lost.