Portability | portable |
---|---|
Stability | experimental |
Safe Haskell | None |
The ArXiv API is split in two parts: Request and Response. The Request part contains a simple language to define queries, a query parser and some helpers to navigate through the results of a mutlipage query (which, in fact, induces a new query).
The Response part contains an API to access the fields of the result based on TagSoup.
This library does not contain functions to actually execute and manage http requests. It is intended to be used with existing http libraries such as http-conduit. An example how to use the ArXiv library with http-conduit is included in this documentation.
- baseUrl :: String
- apiUrl :: String
- apiQuery :: String
- data Field
- data Expression
- (/*/) :: Expression -> Expression -> Expression
- (/+/) :: Expression -> Expression -> Expression
- (/-/) :: Expression -> Expression -> Expression
- data Query = Query {}
- nextPage :: Query -> Query
- parseQuery :: String -> Either String Expression
- preprocess :: String -> String
- parseIds :: String -> [Identifier]
- mkQuery :: Query -> String
- exp2str :: Expression -> String
- items2str :: Query -> String
- ids2str :: [Identifier] -> String
- itemControl :: Int -> Int -> String
- totalResults :: [Tag String] -> Int
- startIndex :: [Tag String] -> Int
- itemsPerPage :: [Tag String] -> Int
- getEntry :: [Tag String] -> ([Tag String], [Tag String])
- forEachEntry :: [Tag String] -> ([Tag String] -> r) -> [r]
- forEachEntryM :: Monad m => [Tag String] -> ([Tag String] -> m r) -> m [r]
- forEachEntryM_ :: Monad m => [Tag String] -> ([Tag String] -> m ()) -> m ()
- checkForError :: [Tag String] -> Either String ()
- exhausted :: [Tag String] -> Bool
- getId :: [Tag String] -> String
- getIdUrl :: [Tag String] -> String
- getUpdated :: [Tag String] -> String
- getPublished :: [Tag String] -> String
- getYear :: [Tag String] -> String
- getTitle :: [Tag String] -> String
- getSummary :: [Tag String] -> String
- getComment :: [Tag String] -> String
- getJournal :: [Tag String] -> String
- getDoi :: [Tag String] -> String
- data Link = Link {}
- getLinks :: [Tag String] -> [Link]
- getPdfLink :: [Tag String] -> Maybe Link
- getPdf :: [Tag String] -> String
- data Category = Category {}
- getCategories :: [Tag String] -> [Category]
- getPrimaryCategory :: [Tag String] -> Maybe Category
- data Author = Author {}
- getAuthors :: [Tag String] -> [Author]
- getAuthorNames :: [Tag String] -> [String]
Request
Requests are URL parameters, either "search_query" or "id_list". This module provides functions to build and parse these parameters, to create the full request string and to navigate through a multi-page request with a maximum number of items per page.
For details of the Arxiv request format, please refer to the Arxiv documentation.
Field data type; a field consist of a field identifier (author, title, etc.) and a list of search terms.
data Expression Source
Expression data type. An expression is either a field or a logical connection of two expressions using the basic operators AND, OR and ANDNOT.
Exp Field | Just a field |
And Expression Expression | Logical "and" |
Or Expression Expression | Logical "or" |
AndNot Expression Expression | Logical "and . not" |
(/*/) :: Expression -> Expression -> ExpressionSource
AND operator. The symbol was chosen because 0 * 1 = 0.
(/+/) :: Expression -> Expression -> ExpressionSource
OR operator. The symbol was chosen because 0 + 1 = 1.
(/-/) :: Expression -> Expression -> ExpressionSource
ANDNOT operator. The symbol was chosen because 1 - 1 = 0 and 1 - 0 = 1.
Expression Example
Expressions are intended to ease the construction of well-formed queries in application code. A simple example:
let au = Exp $ Au ["Knuth"] t1 = Exp $ Ti ["The Art of Computer Programming"] t2 = Exp $ Ti ["Concrete Mathematics"] ex = au /*/ (t1 /+/ t2) in ...
Queries
Query data type.
You usually want to create a query like:
let e = (Exp $ Au ["Aaronson"]) /*/ ( (Exp $ Ti ["quantum"]) /+/ (Exp $ Ti ["complexity"])) in Query { qExp = Just e, qIds = ["0902.3175v2","1406.2858v1","0912.3825v1"], qStart = 0, qItems = 10}
nextPage :: Query -> QuerySource
Prepares the query to fetch the next page adding "items per page" to "start index".
parseQuery :: String -> Either String ExpressionSource
Parses an expression from a string. Please refer to the Arxiv documentation for details on query format.
Just a minor remark here: The operators OR, AND and ANDNOT are case sensitive. "andnot" would be interpreted as part of a title, for instance: "ti:functional+andnot+object+oriented" is just one title; "ti:functional+ANDNOT+object+oriented" would cause an error, because a field identifier (ti, au, etc.) is expected after "+ANDNOT+".
The other way round: the field content itself is not case sensitive, i.e. "ti:complexity" or "au:aaronson" is the same as "ti:Complexity" and "au:Aaronson" respectively. This is a feature of the very arXiv API.
You may want to refer to the comments under
preprocess
and exp2str
for some more details
on our interpretation of the Arxiv documentation.
preprocess :: String -> StringSource
This is an internal function used by parseQuery
.
It may be occasionally useful for direct use:
It replaces " ", "(", ")" and """
by "+", "%28", "%29" and "%22"
respectively.
Usually, these substitutions are performed when transforming a string to an URL, which should be done by your http library anyway (e.g. http-conduit). But this step is usually after parsing has been performed on the input string. (Considering a work flow like: parseQuery >>= mkQuery >>= parseUrl >>= execQuery.) The parser, however, accepts only the URL-encoded characters and, thus, some preprocessing may be necessary.
The other way round, this means that you may use parentheses, spaces and quotation marks instead of the URL encodings. But be careful! Do not introduce two successive spaces - we do not check for whitespace!
parseIds :: String -> [Identifier]Source
Converts a string containing comma-separated identifiers
into a list of Identifier
s.
As stated already: No whitespace!
mkQuery :: Query -> StringSource
Generates the complete query string including URL, query expression, id list and item control
exp2str :: Expression -> StringSource
Create a query string from an expression. Note that we create redundant parentheses, for instance "a AND b OR c" will be encoded as "a+AND+%28b+OR+c%29". The rationale is that the API specification is not clear on how expressions are parsed. The above expression could be understood as "a AND (b OR c)" or as "(a AND b) OR c". To avoid confusion, one should always use parentheses to group boolean expressions - even if some of these parentheses appear to be redundant under one or the other parsing strategy.
ids2str :: [Identifier] -> StringSource
Converts a list of Identifier
to a string with comma-separated identifiers
itemControl :: Int -> Int -> StringSource
Response
Response processing expects [Tag String] as input (see TagSoup). The result produced by your http library (such as http-conduit) must be converted to [Tag String] before the result is passed to the response functions defined here (see also the example below).
The response functions extract information from the tag soup,
either in String
, Int
or TagSoup format.
For details of the Arxiv Feed format, please refer to the Arxiv documentation.
totalResults :: [Tag String] -> IntSource
Total results of the query
startIndex :: [Tag String] -> IntSource
Start index of this page of results
itemsPerPage :: [Tag String] -> IntSource
Number of items per page
getEntry :: [Tag String] -> ([Tag String], [Tag String])Source
Get the first entry in the tag soup. The function returns a tuple of
- The entry (if any)
- The rest of the tag soup following the first entry.
With getEntry, we can build a loop through all entries
in the result (which is actually implemented in forEachEntry
).
forEachEntry :: [Tag String] -> ([Tag String] -> r) -> [r]Source
Loop through all entries in the result feed
applying a function on each one.
The results are returned as list.
The function is similar to map
with the arguments swapped (as in Foldable forM
).
Arguments:
- [
Tag
String
]: The TagSoup through which we are looping - [
Tag
String
] -> r: The function we are applying per entry; the TagSoup passed in to the function represents the current entry.
Example:
forEachEntry soup $ \e -> let y = case getYear e of "" -> "s.a." x -> x a = case getAuthorNames e of [] -> "Anonymous" as -> head as ++ in a ++ " (" ++ y ++ ")"
Would retrieve the name of the first author and the year of publication (like Aaronson (2013)) from all entries.
forEachEntryM :: Monad m => [Tag String] -> ([Tag String] -> m r) -> m [r]Source
Variant of forEachEntry
for monadic actions.
forEachEntryM_ :: Monad m => [Tag String] -> ([Tag String] -> m ()) -> m ()Source
Variant of forEachEntryM
for actions
that do not return a result.
checkForError :: [Tag String] -> Either String ()Source
Checks if the feed contains an error message, i.e.
- it has only one entry,
- the title of this entry is "Error" and
- its id field contains an error message,
which is returned as
Left
.
Apparently, this function is not necessary, since the Arxiv site returns error feeds with status code 400 (bad request), which should be handled by your http library anyway.
exhausted :: [Tag String] -> BoolSource
Checks whether the query is exhausted or not, i.e. whether all pages have been fetched already. The first argument is the entire response (not just a part of it).
getId :: [Tag String] -> StringSource
Gets the article identifier as it can be used in an "id_list" query, i.e. without the URL. The [Tag String] argument is expected to be a single entry.
getIdUrl :: [Tag String] -> StringSource
Gets the full contents of the id field (which contains an URL before the article identifier). The [Tag String] argument is expected to be a single entry.
getUpdated :: [Tag String] -> StringSource
Gets the contents of the "update" field in this entry, i.e. the date when the article was last updated.
getPublished :: [Tag String] -> StringSource
Gets the contents of the "published" field in this entry, i.e. the date when the article was last uploaded.
getSummary :: [Tag String] -> StringSource
Gets the summary of this entry.
getComment :: [Tag String] -> StringSource
Gets author''s comment (in "arxiv:comment") of this entry.
getJournal :: [Tag String] -> StringSource
Gets the journal information (in "arxiv:journal_ref") of this entry.
getDoi :: [Tag String] -> StringSource
Gets the digital object identifier (in "arxiv:doi") of this entry.
The Link data type
getPdf :: [Tag String] -> StringSource
Gets the hyperlink to the pdf document of this entry (if any).
Category data type
getCategories :: [Tag String] -> [Category]Source
Gets the categories of this entry.
getPrimaryCategory :: [Tag String] -> Maybe CategorySource
Gets the primary category of this entry (if any).
The Author data type
getAuthors :: [Tag String] -> [Author]Source
Gets the authors of this entry.
getAuthorNames :: [Tag String] -> [String]Source
Gets the names of all authors of this entry.
A complete Example using http-conduit
module Main where import qualified Network.Api.Arxiv as Ax import Network.Api.Arxiv (Expression(..), Field(..), (/*/), (/+/)) import Network (withSocketsDo) import Network.HTTP.Conduit import Network.HTTP.Types.Status import Data.List (intercalate) import qualified Data.ByteString as B hiding (unpack) import qualified Data.ByteString.Char8 as B (unpack) import Data.Conduit (($$+-), (=$), ($$)) import qualified Data.Conduit as C import qualified Data.Conduit.List as CL import Text.HTML.TagSoup import Control.Monad.Trans (liftIO) import Control.Monad.Trans.Resource (MonadResource) import Control.Applicative ((<$>)) main :: IO () main = withSocketsDo (execQuery makeQuery) makeQuery :: Ax.Query makeQuery = let au = Exp $ Au ["Aaronson"] t1 = Exp $ Ti ["quantum"] t2 = Exp $ Ti ["complexity"] x = au /*/ (t1 /+/ t2) in Ax.Query { Ax.qExp = Just x, Ax.qIds = [], Ax.qStart = 0, Ax.qItems = 25} type Soup = Tag String execQuery :: Ax.Query -> IO () execQuery q = withManager $ \m -> searchAxv m q $$ outSnk ----------------------------------------------------------------- -- Execute query and start a source ----------------------------------------------------------------- searchAxv :: MonadResource m => Manager -> Ax.Query -> C.Source m String searchAxv m q = do u <- liftIO (parseUrl $ mkQuery q) rsp <- http u m case responseStatus rsp of (Status 200 _) -> getSoup rsp >>= results m q st -> error $ "Error:" ++ show st ----------------------------------------------------------------- -- Consume page by page ----------------------------------------------------------------- getSoup :: MonadResource m => Response (C.ResumableSource m B.ByteString) -> m [Soup] getSoup rsp = concat <$> (responseBody rsp $$+- toSoup =$ CL.consume) ----------------------------------------------------------------- -- Receive a ByteString and yield Soup ----------------------------------------------------------------- toSoup :: MonadResource m => C.Conduit B.ByteString m [Soup] toSoup = C.awaitForever (C.yield . parseTags . B.unpack) ------------------------------------------------------------------ -- Yield all entries and fetch next page ------------------------------------------------------------------ results :: MonadResource m => Manager -> Ax.Query -> [Soup] -> C.Source m String results m q sp = if Ax.exhausted sp then C.yield ("EOT: " ++ show (Ax.totalResults sp) ++ " results") else Ax.forEachEntryM_ sp (C.yield . mkResult) >> searchAxv m (Ax.nextPage q) ------------------------------------------------------------------ -- Get data and format somehow ------------------------------------------------------------------ mkResult :: [Soup] -> String mkResult sp = let aus = Ax.getAuthorNames sp y = Ax.getYear sp tmp = Ax.getTitle sp ti = if null tmp then "No title" else tmp in intercalate ", " aus ++ " (" ++ y ++ "): " ++ ti ------------------------------------------------------------------ -- Sink results ------------------------------------------------------------------ outSnk :: MonadResource m => C.Sink String m () outSnk = C.awaitForever (liftIO . putStrLn)