http-conduit-downloader-1.1.5: HTTP downloader tailored for web-crawler needs.
Safe HaskellSafe-Inferred
LanguageHaskell2010

Network.HTTP.Conduit.Downloader

Description

HTTP downloader tailored for web-crawler needs.

  • Handles all possible http-client exceptions and returns human readable error messages.
  • Handles some web server bugs (returning deflate data instead of gzip, invalid gzip encoding).
  • Uses OpenSSL instead of tls package (since tls doesn't handle all sites and works slower than OpenSSL).
  • Ignores invalid SSL sertificates.
  • Receives data in 32k chunks internally to reduce memory fragmentation on many parallel downloads.
  • Download timeout.
  • Total download size limit.
  • Returns HTTP headers for subsequent redownloads and handles Not modified results.
  • Can be used with external DNS resolver (hsdns-cache for example).
  • Keep-alive connections pool (thanks to http-client).

Typical workflow in crawler:

 withDnsCache $  c -> withDownloader $  d -> do
 ... -- got URL from queue
 ra <- resolveA c $ hostNameFromUrl url
 case ra of
     Left err -> ... -- uh oh, bad host
     Right ha -> do
         ... -- crawler politeness stuff (rate limits, queues)
         dr <- download d url (Just ha) opts
         case dr of
             DROK dat redownloadOptions ->
                 ... -- analyze data, save redownloadOpts for next download
             DRRedirect .. -> ...
             DRNotModified -> ...
             DRError e -> ...
 

It's highly recommended to use http://hackage.haskell.org/package/concurrent-dns-cache (preferably with single resolver pointing to locally running BIND) for DNS resolution since getAddrInfo used in http-client can be buggy and ineffective when it needs to resolve many hosts per second for a long time.

Synopsis

Download operations

urlGetContents :: String -> IO ByteString Source #

Download single URL with default DownloaderSettings. Fails if result is not DROK.

urlGetContentsPost :: String -> ByteString -> IO ByteString Source #

Post data and download single URL with default DownloaderSettings. Fails if result is not DROK.

download Source #

Arguments

:: Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Perform download

downloadG Source #

Arguments

:: (Request -> IO Request)

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Generic version of download with ability to modify http-client Request.

rawDownload Source #

Arguments

:: (Request -> IO Request)

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO (DownloadResult, Maybe RawDownloadResult) 

Even more generic version of download, which returns RawDownloadResult. RawDownloadResult is optional since it can not be determined on timeouts and connection errors.

data DownloadResult Source #

Result of download operation.

Constructors

DROK ByteString DownloadOptions

Successful download with data and options for next download.

DRRedirect String

Redirect URL

DRError String

Error

DRNotModified

HTTP 304 Not Modified

type DownloadOptions = [String] Source #

If-None-Match and/or If-Modified-Since headers.

Downloader

data DownloaderSettings Source #

Settings used in downloader.

Constructors

DownloaderSettings 

Fields

  • dsUserAgent :: ByteString

    User agent string. Default: "Mozilla/5.0 (compatible; HttpConduitDownloader/1.0; +http://hackage.haskell.org/package/http-conduit-downloader)".

    Be a good crawler. Provide your User-Agent please.

  • dsTimeout :: Int

    Download timeout. Default: 30 seconds.

  • dsManagerSettings :: ManagerSettings

    Conduit Manager settings. Default: ManagerSettings with SSL certificate checks removed.

  • dsMaxDownloadSize :: Int

    Download size limit in bytes. Default: 10MB.

Instances

Instances details
Default DownloaderSettings Source # 
Instance details

Defined in Network.HTTP.Conduit.Downloader

data Downloader Source #

Keeps http-client Manager and DownloaderSettings.

withDownloader :: (Downloader -> IO a) -> IO a Source #

Create a new Downloader, use it in the provided function, and then release it.

withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source #

Create a new Downloader with provided settings, use it in the provided function, and then release it.

Utils

postRequest :: ByteString -> Request -> Request Source #

Make HTTP POST request.