arxiv-0.0.3: A client for the Arxiv API
Copyright(c) Tobias Schoofs
LicenseLGPL
Stabilityexperimental
Portabilityportable
Safe HaskellSafe-Inferred
LanguageHaskell98

Network.Api.Arxiv

Description

The ArXiv API is split in two parts: Request and Response. The Request part contains a simple language to define queries, a query parser and some helpers to navigate through the results of a mutlipage query (which, in fact, induces a new query).

The Response part contains an API to access the fields of the result based on TagSoup.

This library does not contain functions to actually execute and manage http requests. It is intended to be used with existing http libraries such as http-conduit. An example how to use the ArXiv library with http-conduit is included in this documentation.

Synopsis

Request

Requests are URL parameters, either "search_query" or "id_list". This module provides functions to build and parse these parameters, to create the full request string and to navigate through a multi-page request with a maximum number of items per page.

For details of the Arxiv request format, please refer to the Arxiv documentation.

baseUrl :: String Source #

The Arxiv base URL "arxiv.org"

apiUrl :: String Source #

The Arxiv API URL "export.arxiv.org/api"

apiQuery :: String Source #

The query string ("search_query=" or "id_list=")

data Field Source #

Field data type; a field consist of a field identifier (author, title, etc.) and a list of search terms.

Constructors

Ti [Term]

Title

Au [Term]

Author

Abs [Term]

Abstract

Co [Term]

Comment

Jr [Term]

Journal

Cat [Term]

Category

Rn [Term]

Report Number

Id [Term]

Article identifier

All [Term]

Any of the above

Instances

Instances details
Eq Field Source # 
Instance details

Defined in Network.Api.Arxiv

Methods

(==) :: Field -> Field -> Bool #

(/=) :: Field -> Field -> Bool #

Show Field Source # 
Instance details

Defined in Network.Api.Arxiv

Methods

showsPrec :: Int -> Field -> ShowS #

show :: Field -> String #

showList :: [Field] -> ShowS #

data Expression Source #

Expression data type. An expression is either a field or a logical connection of two expressions using the basic operators AND, OR and ANDNOT.

Constructors

Exp Field

Just a field

And Expression Expression

Logical "and"

Or Expression Expression

Logical "or"

AndNot Expression Expression

Logical "and . not"

Instances

Instances details
Eq Expression Source # 
Instance details

Defined in Network.Api.Arxiv

Show Expression Source # 
Instance details

Defined in Network.Api.Arxiv

(/*/) :: Expression -> Expression -> Expression infix 9 Source #

AND operator. The symbol was chosen because 0 * 1 = 0.

(/+/) :: Expression -> Expression -> Expression infix 9 Source #

OR operator. The symbol was chosen because 0 + 1 = 1.

(/-/) :: Expression -> Expression -> Expression infix 9 Source #

ANDNOT operator. The symbol was chosen because 1 - 1 = 0 and 1 - 0 = 1.

Expression Example

Expressions are intended to ease the construction of well-formed queries in application code. A simple example:

let au = Exp $ Au ["Knuth"]
    t1 = Exp $ Ti ["The Art of Computer Programming"]
    t2 = Exp $ Ti ["Concrete Mathematics"]
    ex = au /*/ (t1 /+/ t2)
 in ...

Queries

data Query Source #

Query data type.

You usually want to create a query like:

let e = (Exp $ Au ["Aaronson"]) /*/ (
          (Exp $ Ti ["quantum"]) /+/
          (Exp $ Ti ["complexity"]))
 in Query {
       qExp   = Just e,
       qIds   = ["0902.3175v2","1406.2858v1","0912.3825v1"],
       qStart = 0,
       qItems = 10}

Constructors

Query 

Fields

Instances

Instances details
Eq Query Source # 
Instance details

Defined in Network.Api.Arxiv

Methods

(==) :: Query -> Query -> Bool #

(/=) :: Query -> Query -> Bool #

Show Query Source # 
Instance details

Defined in Network.Api.Arxiv

Methods

showsPrec :: Int -> Query -> ShowS #

show :: Query -> String #

showList :: [Query] -> ShowS #

nextPage :: Query -> Query Source #

Prepares the query to fetch the next page adding "items per page" to "start index".

parseQuery :: String -> Either String Expression Source #

Parses an expression from a string. Please refer to the Arxiv documentation for details on query format.

Just a minor remark here: The operators OR, AND and ANDNOT are case sensitive. "andnot" would be interpreted as part of a title, for instance: "ti:functional+andnot+object+oriented" is just one title; "ti:functional+ANDNOT+object+oriented" would cause an error, because a field identifier (ti, au, etc.) is expected after "+ANDNOT+".

The other way round: the field content itself is not case sensitive, i.e. "ti:complexity" or "au:aaronson" is the same as "ti:Complexity" and "au:Aaronson" respectively. This is a feature of the very arXiv API.

You may want to refer to the comments under preprocess and exp2str for some more details on our interpretation of the Arxiv documentation.

preprocess :: String -> String Source #

This is an internal function used by parseQuery. It may be occasionally useful for direct use: It replaces " ", "(", ")" and """ by "+", "%28", "%29" and "%22" respectively.

Usually, these substitutions are performed when transforming a string to an URL, which should be done by your http library anyway (e.g. http-conduit). But this step is usually after parsing has been performed on the input string. (Considering a work flow like: parseQuery >>= mkQuery >>= parseUrl >>= execQuery.) The parser, however, accepts only the URL-encoded characters and, thus, some preprocessing may be necessary.

The other way round, this means that you may use parentheses, spaces and quotation marks instead of the URL encodings. But be careful! Do not introduce two successive spaces - we do not check for whitespace!

parseIds :: String -> [Identifier] Source #

Converts a string containing comma-separated identifiers into a list of Identifiers. As stated already: No whitespace!

mkQuery :: Query -> String Source #

Generates the complete query string including URL, query expression, id list and item control

exp2str :: Expression -> String Source #

Create a query string from an expression. Note that we create redundant parentheses, for instance "a AND b OR c" will be encoded as "a+AND+%28b+OR+c%29". The rationale is that the API specification is not clear on how expressions are parsed. The above expression could be understood as "a AND (b OR c)" or as "(a AND b) OR c". To avoid confusion, one should always use parentheses to group boolean expressions - even if some of these parentheses appear to be redundant under one or the other parsing strategy.

items2str :: Query -> String Source #

Converts the query to a string containing only the item control

ids2str :: [Identifier] -> String Source #

Converts a list of Identifier to a string with comma-separated identifiers

itemControl :: Int -> Int -> String Source #

Generates the item control of a query string according to first item and results per page:

  • Int: Start index for this page
  • Int: Number of results per page.

Response

Response processing expects [Tag String] as input (see TagSoup). The result produced by your http library (such as http-conduit) must be converted to [Tag String] before the result is passed to the response functions defined here (see also the example below).

The response functions extract information from the tag soup, either in String, Int or TagSoup format.

For details of the Arxiv Feed format, please refer to the Arxiv documentation.

totalResults :: [Tag String] -> Int Source #

Total results of the query

startIndex :: [Tag String] -> Int Source #

Start index of this page of results

itemsPerPage :: [Tag String] -> Int Source #

Number of items per page

getEntry :: [Tag String] -> ([Tag String], [Tag String]) Source #

Get the first entry in the tag soup. The function returns a tuple of

  • The entry (if any)
  • The rest of the tag soup following the first entry.

forEachEntry :: [Tag String] -> ([Tag String] -> r) -> [r] Source #

Loop through all entries in the result feed applying a function on each one. The results are returned as list. The function is similar to map with the arguments swapped (as in Foldable forM).

Arguments:

  • [Tag String]: The TagSoup through which we are looping
  • [Tag String] -> r: The function we are applying per entry; the TagSoup passed in to the function represents the current entry.

Example:

forEachEntry soup $ \e ->
  let y = case getYear e of
            "" -> "s.a."
            x  -> x
      a = case getAuthorNames e of
            [] -> "Anonymous"
            as -> head as ++ 
   in a ++ " (" ++ y ++ ")"

Would retrieve the name of the first author and the year of publication (like "Aaronson (2013)") from all entries.

forEachEntryM :: Monad m => [Tag String] -> ([Tag String] -> m r) -> m [r] Source #

Variant of forEachEntry for monadic actions.

forEachEntryM_ :: Monad m => [Tag String] -> ([Tag String] -> m ()) -> m () Source #

Variant of forEachEntryM for actions that do not return a result.

checkForError :: [Tag String] -> Either String () Source #

Checks if the feed contains an error message, i.e.

  • it has only one entry,
  • the title of this entry is "Error" and
  • its id field contains an error message, which is returned as Left.

Apparently, this function is not necessary, since the Arxiv site returns error feeds with status code 400 ("bad request"), which should be handled by your http library anyway.

exhausted :: [Tag String] -> Bool Source #

Checks whether the query is exhausted or not, i.e. whether all pages have been fetched already. The first argument is the entire response (not just a part of it).

getId :: [Tag String] -> String Source #

Gets the article identifier as it can be used in an "id_list" query, i.e. without the URL. The [Tag String] argument is expected to be a single entry.

getIdUrl :: [Tag String] -> String Source #

Gets the full contents of the id field (which contains an URL before the article identifier). The [Tag String] argument is expected to be a single entry.

getUpdated :: [Tag String] -> String Source #

Gets the contents of the "updated" field in this entry, i.e. the date when the article was last updated. Be aware that there is another "updated" field right below the root node of the result. Make sure your are operating on an entry, not on the root node!

getPublished :: [Tag String] -> String Source #

Gets the contents of the "published" field in this entry, i.e. the date when the article was last uploaded.

getYear :: [Tag String] -> String Source #

Gets the year of the "published" field in this entry.

getTitle :: [Tag String] -> String Source #

Gets the title of this entry.

getSummary :: [Tag String] -> String Source #

Gets the summary of this entry.

getComment :: [Tag String] -> String Source #

Gets author''s comment (in "arxiv:comment") of this entry.

getJournal :: [Tag String] -> String Source #

Gets the journal information (in "arxiv:journal_ref") of this entry.

getDoi :: [Tag String] -> String Source #

Gets the digital object identifier (in "arxiv:doi") of this entry.

data Link Source #

The Link data type

Constructors

Link 

Fields

  • lnkHref :: String

    The hyperlink

  • lnkType :: String

    The link type (a MIME type)

  • lnkTitle :: String

    The link title (e.g. "pdf" would be the link where we find the article in pdf format)

  • lnkRel :: String

    the link relation (e.g. "related" would point to the related information, such as the pdf document)

Instances

getLinks :: [Tag String] -> [Link] Source #

Gets all links in the entry.

getPdfLink :: [Tag String] -> Maybe Link Source #

Gets only the pdf link of this entry (if any).

getPdf :: [Tag String] -> String Source #

Gets the hyperlink to the pdf document of this entry (if any).

data Category Source #

Category data type

Constructors

Category 

Fields

Instances

Instances details
Eq Category Source # 
Instance details

Defined in Network.Api.Arxiv

Show Category Source # 
Instance details

Defined in Network.Api.Arxiv

getCategories :: [Tag String] -> [Category] Source #

Gets the categories of this entry.

getPrimaryCategory :: [Tag String] -> Maybe Category Source #

Gets the primary category of this entry (if any).

data Author Source #

The Author data type

Constructors

Author 

Fields

Instances

Instances details
Eq Author Source # 
Instance details

Defined in Network.Api.Arxiv

Methods

(==) :: Author -> Author -> Bool #

(/=) :: Author -> Author -> Bool #

Show Author Source # 
Instance details

Defined in Network.Api.Arxiv

getAuthors :: [Tag String] -> [Author] Source #

Gets the authors of this entry.

getAuthorNames :: [Tag String] -> [String] Source #

Gets the names of all authors of this entry.

A complete Example using http-conduit

module Main
where

  import qualified Network.Api.Arxiv as Ax
  import           Network.Api.Arxiv (Expression(..), 
                               Field(..), (/*/), (/+/))
  import           Network.Socket (withSocketsDo)
  import           Network.HTTP.Simple as HT
  import           Network.HTTP.Conduit (parseRequest)
  import           Network.HTTP.Types.Status
  import           Data.List (intercalate)
  import qualified Data.ByteString as B hiding (unpack) 
  import qualified Data.ByteString.Char8 as B  (unpack) 
  import           Data.Conduit ((.|))
  import qualified Data.Conduit as C
  import qualified Data.Conduit.List as CL
  import           Data.Function ((&))
  import           Text.HTML.TagSoup
  import           Control.Monad.Trans (liftIO)
  import           Control.Monad.Trans.Resource (MonadResource)
  import           Control.Applicative ((<$>))

  main :: IO ()
  main = withSocketsDo (execQuery makeQuery)
    
  makeQuery :: Ax.Query
  makeQuery = 
    let au = Exp $ Au ["Aaronson"]
        t1 = Exp $ Ti ["quantum"]
        t2 = Exp $ Ti ["complexity"]
        x  = au /*/ (t1 /+/ t2)
     in Ax.Query {
          Ax.qExp   = Just x,
          Ax.qIds   = [],
          Ax.qStart = 0,
          Ax.qItems = 25}

  type Soup = Tag String

  execQuery :: Ax.Query -> IO ()
  execQuery q = C.runConduitRes (searchAxv q .| outSnk)

  ----------------------------------------------------------------------
  -- Execute query and start a source
  ----------------------------------------------------------------------
  searchAxv :: MonadResource m => Ax.Query -> C.ConduitT () String m ()
  searchAxv q = 
    let s = Ax.mkQuery q
     in do rsp <- HT.httpBS =<< liftIO (parseRequest s)
           case getResponseStatus rsp of
             (Status 200 _) -> getSoup (getResponseBody rsp)
                               >>= results q
             st             -> error $ "Error:" ++ show st

  ----------------------------------------------------------------------
  -- Consume page by page
  ----------------------------------------------------------------------
  getSoup :: MonadResource m =>  
             B.ByteString -> C.ConduitT () String m [Soup]
  getSoup b = concat <$> (C.yield b .| toSoup .| CL.consume)

  ----------------------------------------------------------------------
  -- Receive a ByteString and yield Soup
  ----------------------------------------------------------------------
  toSoup :: MonadResource m => C.ConduitT B.ByteString [Soup] m ()
  toSoup = C.awaitForever (C.yield . parseTags . B.unpack)

  ----------------------------------------------------------------------
  -- Yield all entries and fetch next page
  ----------------------------------------------------------------------
  results :: MonadResource m =>
             Ax.Query -> [Soup] -> C.ConduitT () String m ()
  results q sp = 
     if Ax.exhausted sp 
       then C.yield ("EOT: " ++ show (Ax.totalResults sp) ++ " results")
       else Ax.forEachEntryM_ sp (C.yield . mkResult) 
            >> searchAxv (Ax.nextPage q)
  
  ----------------------------------------------------------------------
  -- Get data and format
  ----------------------------------------------------------------------
  mkResult :: [Soup] -> String
  mkResult sp = let aus = Ax.getAuthorNames sp
                    y   = Ax.getYear sp
                    tmp = Ax.getTitle sp & clean ['\n', '\r', '\t']
                    ti  = if null tmp then "No title" else tmp
                 in intercalate ", " aus ++ " (" ++ y ++ "): " ++ ti
    where clean _ [] = []
          clean d (c:cs) | c `elem` d =   clean d cs
                         | otherwise  = c:clean d cs

  ----------------------------------------------------------------------
  -- Sink results 
  ----------------------------------------------------------------------
  outSnk :: MonadResource m => C.ConduitT String C.Void m ()
  outSnk = C.awaitForever (liftIO . putStrLn)