arxiv-0.0.1: A client for the Arxiv API

Portabilityportable
Stabilityexperimental
Safe HaskellNone

Network.Api.Arxiv

Contents

Description

The ArXiv API is split in two parts: Request and Response. The Request part contains a simple language to define queries, a query parser and some helpers to navigate through the results of a mutlipage query (which, in fact, induces a new query).

The Response part contains an API to access the fields of the result based on TagSoup.

This library does not contain functions to actually execute and manage http requests. It is intended to be used with existing http libraries such as http-conduit. An example how to use the ArXiv library with http-conduit is included in this documentation.

Synopsis

Request

Requests are URL parameters, either "search_query" or "id_list". This module provides functions to build and parse these parameters, to create the full request string and to navigate through a multi-page request with a maximum number of items per page.

For details of the Arxiv request format, please refer to the Arxiv documentation.

baseUrl :: StringSource

The Arxiv base URL "arxiv.org"

apiUrl :: StringSource

The Arxiv API URL "export.arxiv.org/api"

apiQuery :: StringSource

The query string ("search_query=" or "id_list=")

data Field Source

Field data type; a field consist of a field identifier (author, title, etc.) and a list of search terms.

Constructors

Ti [Term]

Title

Au [Term]

Author

Abs [Term]

Abstract

Co [Term]

Comment

Jr [Term]

Journal

Cat [Term]

Category

Rn [Term]

Report Number

Id [Term]

Article identifier

All [Term]

Any of the above

Instances

data Expression Source

Expression data type. An expression is either a field or a logical connection of two expressions using the basic operators AND, OR and ANDNOT.

Constructors

Exp Field

Just a field

And Expression Expression

Logical "and"

Or Expression Expression

Logical "or"

AndNot Expression Expression

Logical "and . not"

(/*/) :: Expression -> Expression -> ExpressionSource

AND operator. The symbol was chosen because 0 * 1 = 0.

(/+/) :: Expression -> Expression -> ExpressionSource

OR operator. The symbol was chosen because 0 + 1 = 1.

(/-/) :: Expression -> Expression -> ExpressionSource

ANDNOT operator. The symbol was chosen because 1 - 1 = 0 and 1 - 0 = 1.

Expression Example

Expressions are intended to ease the construction of well-formed queries in application code. A simple example:

 let au = Exp $ Au ["Knuth"]
     t1 = Exp $ Ti ["The Art of Computer Programming"]
     t2 = Exp $ Ti ["Concrete Mathematics"]
     ex = au /*/ (t1 /+/ t2)
  in ...

Queries

data Query Source

Query data type.

You usually want to create a query like:

 let e = (Exp $ Au ["Aaronson"]) /*/ (
           (Exp $ Ti ["quantum"]) /+/
           (Exp $ Ti ["complexity"]))
  in Query {
        qExp   = Just e,
        qIds   = ["0902.3175v2","1406.2858v1","0912.3825v1"],
        qStart = 0,
        qItems = 10}

Constructors

Query 

Fields

qExp :: Maybe Expression

The query expression

qIds :: [Identifier]

Id List

qStart :: Int

The first item we want to see

qItems :: Int

The number of items we want to see

Instances

nextPage :: Query -> QuerySource

Prepares the query to fetch the next page adding "items per page" to "start index".

parseQuery :: String -> Either String ExpressionSource

Parses an expression from a string. Please refer to the Arxiv documentation for details on query format.

Just a minor remark here: The operators OR, AND and ANDNOT are case sensitive. "andnot" would be interpreted as part of a title, for instance: "ti:functional+andnot+object+oriented" is just one title; "ti:functional+ANDNOT+object+oriented" would cause an error, because a field identifier (ti, au, etc.) is expected after "+ANDNOT+".

The other way round: the field content itself is not case sensitive, i.e. "ti:complexity" or "au:aaronson" is the same as "ti:Complexity" and "au:Aaronson" respectively. This is a feature of the very arXiv API.

You may want to refer to the comments under preprocess and exp2str for some more details on our interpretation of the Arxiv documentation.

preprocess :: String -> StringSource

This is an internal function used by parseQuery. It may be occasionally useful for direct use: It replaces " ", "(", ")" and """ by "+", "%28", "%29" and "%22" respectively.

Usually, these substitutions are performed when transforming a string to an URL, which should be done by your http library anyway (e.g. http-conduit). But this step is usually after parsing has been performed on the input string. (Considering a work flow like: parseQuery >>= mkQuery >>= parseUrl >>= execQuery.) The parser, however, accepts only the URL-encoded characters and, thus, some preprocessing may be necessary.

The other way round, this means that you may use parentheses, spaces and quotation marks instead of the URL encodings. But be careful! Do not introduce two successive spaces - we do not check for whitespace!

parseIds :: String -> [Identifier]Source

Converts a string containing comma-separated identifiers into a list of Identifiers. As stated already: No whitespace!

mkQuery :: Query -> StringSource

Generates the complete query string including URL, query expression, id list and item control

exp2str :: Expression -> StringSource

Create a query string from an expression. Note that we create redundant parentheses, for instance "a AND b OR c" will be encoded as "a+AND+%28b+OR+c%29". The rationale is that the API specification is not clear on how expressions are parsed. The above expression could be understood as "a AND (b OR c)" or as "(a AND b) OR c". To avoid confusion, one should always use parentheses to group boolean expressions - even if some of these parentheses appear to be redundant under one or the other parsing strategy.

items2str :: Query -> StringSource

Converts the query to a string containing only the item control

ids2str :: [Identifier] -> StringSource

Converts a list of Identifier to a string with comma-separated identifiers

itemControl :: Int -> Int -> StringSource

Generates the item control of a query string according to first item and results per page:

  • Int: Start index for this page
  • Int: Number of results per page.

Response

Response processing expects [Tag String] as input (see TagSoup). The result produced by your http library (such as http-conduit) must be converted to [Tag String] before the result is passed to the response functions defined here (see also the example below).

The response functions extract information from the tag soup, either in String, Int or TagSoup format.

For details of the Arxiv Feed format, please refer to the Arxiv documentation.

totalResults :: [Tag String] -> IntSource

Total results of the query

startIndex :: [Tag String] -> IntSource

Start index of this page of results

itemsPerPage :: [Tag String] -> IntSource

Number of items per page

getEntry :: [Tag String] -> ([Tag String], [Tag String])Source

Get the first entry in the tag soup. The function returns a tuple of

  • The entry (if any)
  • The rest of the tag soup following the first entry.

With getEntry, we can build a loop through all entries in the result (which is actually implemented in forEachEntry).

forEachEntry :: [Tag String] -> ([Tag String] -> r) -> [r]Source

Loop through all entries in the result feed applying a function on each one. The results are returned as list. The function is similar to map with the arguments swapped (as in Foldable forM).

Arguments:

  • [Tag String]: The TagSoup through which we are looping
  • [Tag String] -> r: The function we are applying per entry; the TagSoup passed in to the function represents the current entry.

Example:

 forEachEntry soup $ \e ->
   let y = case getYear e of
             "" -> "s.a."
             x  -> x
       a = case getAuthorNames e of
             [] -> "Anonymous"
             as -> head as ++ 
    in a ++ " (" ++ y ++ ")"

Would retrieve the name of the first author and the year of publication (like Aaronson (2013)) from all entries.

forEachEntryM :: Monad m => [Tag String] -> ([Tag String] -> m r) -> m [r]Source

Variant of forEachEntry for monadic actions.

forEachEntryM_ :: Monad m => [Tag String] -> ([Tag String] -> m ()) -> m ()Source

Variant of forEachEntryM for actions that do not return a result.

checkForError :: [Tag String] -> Either String ()Source

Checks if the feed contains an error message, i.e.

  • it has only one entry,
  • the title of this entry is "Error" and
  • its id field contains an error message, which is returned as Left.

Apparently, this function is not necessary, since the Arxiv site returns error feeds with status code 400 (bad request), which should be handled by your http library anyway.

exhausted :: [Tag String] -> BoolSource

Checks whether the query is exhausted or not, i.e. whether all pages have been fetched already. The first argument is the entire response (not just a part of it).

getId :: [Tag String] -> StringSource

Gets the article identifier as it can be used in an "id_list" query, i.e. without the URL. The [Tag String] argument is expected to be a single entry.

getIdUrl :: [Tag String] -> StringSource

Gets the full contents of the id field (which contains an URL before the article identifier). The [Tag String] argument is expected to be a single entry.

getUpdated :: [Tag String] -> StringSource

Gets the contents of the "update" field in this entry, i.e. the date when the article was last updated.

getPublished :: [Tag String] -> StringSource

Gets the contents of the "published" field in this entry, i.e. the date when the article was last uploaded.

getYear :: [Tag String] -> StringSource

Gets the year of the "published" field in this entry.

getTitle :: [Tag String] -> StringSource

Gets the title of this entry.

getSummary :: [Tag String] -> StringSource

Gets the summary of this entry.

getComment :: [Tag String] -> StringSource

Gets author''s comment (in "arxiv:comment") of this entry.

getJournal :: [Tag String] -> StringSource

Gets the journal information (in "arxiv:journal_ref") of this entry.

getDoi :: [Tag String] -> StringSource

Gets the digital object identifier (in "arxiv:doi") of this entry.

data Link Source

The Link data type

Constructors

Link 

Fields

lnkHref :: String

The hyperlink

lnkType :: String

The link type (a MIME type)

lnkTitle :: String

The link title (e.g. "pdf" would be the link where we find the article in pdf format)

lnkRel :: String

the link relation (e.g. "related" would point to the related information, such as the pdf document)

getLinks :: [Tag String] -> [Link]Source

Gets all links in the entry.

getPdfLink :: [Tag String] -> Maybe LinkSource

Gets only the pdf link of this entry (if any).

getPdf :: [Tag String] -> StringSource

Gets the hyperlink to the pdf document of this entry (if any).

data Category Source

Category data type

Constructors

Category 

Fields

catTerm :: String

The category term (e.g. "math-ph")

catScheme :: String

The category scheme

Instances

getCategories :: [Tag String] -> [Category]Source

Gets the categories of this entry.

getPrimaryCategory :: [Tag String] -> Maybe CategorySource

Gets the primary category of this entry (if any).

data Author Source

The Author data type

Constructors

Author 

Fields

auName :: String

Author name

auFil :: String

Author Affiliation

Instances

getAuthors :: [Tag String] -> [Author]Source

Gets the authors of this entry.

getAuthorNames :: [Tag String] -> [String]Source

Gets the names of all authors of this entry.

A complete Example using http-conduit

 module Main
 where
 
   import qualified Network.Api.Arxiv as Ax
   import           Network.Api.Arxiv (Expression(..), 
                                       Field(..), (/*/), (/+/))
   import           Network (withSocketsDo)
   import           Network.HTTP.Conduit
   import           Network.HTTP.Types.Status
   import           Data.List (intercalate)
   import qualified Data.ByteString as B hiding (unpack) 
   import qualified Data.ByteString.Char8 as B  (unpack) 
   import           Data.Conduit (($$+-), (=$), ($$))
   import qualified Data.Conduit as C
   import qualified Data.Conduit.List as CL
   import           Text.HTML.TagSoup
   import           Control.Monad.Trans (liftIO)
   import           Control.Monad.Trans.Resource (MonadResource)
   import           Control.Applicative ((<$>))
 
   main :: IO ()
   main = withSocketsDo (execQuery makeQuery)
     
   makeQuery :: Ax.Query
   makeQuery = 
     let au = Exp $ Au ["Aaronson"]
         t1 = Exp $ Ti ["quantum"]
         t2 = Exp $ Ti ["complexity"]
         x  = au /*/ (t1 /+/ t2)
      in Ax.Query {
           Ax.qExp   = Just x,
           Ax.qIds   = [],
           Ax.qStart = 0,
           Ax.qItems = 25}
 
   type Soup = Tag String
 
   execQuery :: Ax.Query -> IO ()
   execQuery q = withManager $ \m -> searchAxv m q $$ outSnk
 
   -----------------------------------------------------------------
   -- Execute query and start a source
   -----------------------------------------------------------------
   searchAxv :: MonadResource m =>
                Manager -> Ax.Query -> C.Source m String
   searchAxv m q = do
      u   <- liftIO (parseUrl $ mkQuery q)
      rsp <- http u m 
      case responseStatus rsp of
        (Status 200 _) -> getSoup rsp >>= results m q 
        st             -> error $ "Error:" ++ show st
 
   -----------------------------------------------------------------
   -- Consume page by page
   -----------------------------------------------------------------
   getSoup :: MonadResource m => 
              Response (C.ResumableSource m B.ByteString) -> m [Soup]
   getSoup rsp = concat <$> (responseBody rsp $$+- 
                                       toSoup =$ CL.consume)
 
   -----------------------------------------------------------------
   -- Receive a ByteString and yield Soup
   -----------------------------------------------------------------
   toSoup :: MonadResource m => C.Conduit B.ByteString m [Soup] 
   toSoup = C.awaitForever (C.yield . parseTags . B.unpack)
 
   ------------------------------------------------------------------
   -- Yield all entries and fetch next page
   ------------------------------------------------------------------
   results :: MonadResource m => 
              Manager -> Ax.Query -> [Soup] -> C.Source m String
   results m q sp = 
      if Ax.exhausted sp 
        then C.yield ("EOT: " ++ show (Ax.totalResults sp) ++ 
                      " results")
        else Ax.forEachEntryM_ sp (C.yield . mkResult) 
             >> searchAxv m (Ax.nextPage q)
   
   ------------------------------------------------------------------
   -- Get data and format somehow
   ------------------------------------------------------------------
   mkResult :: [Soup] -> String
   mkResult sp = let aus = Ax.getAuthorNames sp
                     y   = Ax.getYear sp
                     tmp = Ax.getTitle sp
                     ti  = if null tmp then "No title" else tmp
                  in intercalate ", " aus ++ " (" ++ y ++ "): " ++ ti
 
   ------------------------------------------------------------------
   -- Sink results 
   ------------------------------------------------------------------
   outSnk :: MonadResource m => C.Sink String m ()
   outSnk = C.awaitForever (liftIO . putStrLn)