http-conduit-downloader-1.0.32: HTTP downloader tailored for web-crawler needs.

Safe HaskellNone
LanguageHaskell98

Network.HTTP.Conduit.Downloader

Contents

Description

HTTP downloader tailored for web-crawler needs.

  • Handles all possible http-conduit exceptions and returns human readable error messages.
  • Handles some web server bugs (returning deflate data instead of gzip, invalid gzip encoding).
  • Uses OpenSSL instead of tls package (since tls doesn't handle all sites).
  • Ignores invalid SSL sertificates.
  • Receives data in 32k chunks internally to reduce memory fragmentation on many parallel downloads.
  • Download timeout.
  • Total download size limit.
  • Returns HTTP headers for subsequent redownloads and handles Not modified results.
  • Can be used with external DNS resolver (hsdns-cache for example).
  • Keep-alive connections pool (thanks to http-conduit).

Typical workflow in crawler:

 withDnsCache $  c -> withDownloader $  d -> do
 ... -- got URL from queue
 ra <- resolveA c $ hostNameFromUrl url
 case ra of
     Left err -> ... -- uh oh, bad host
     Right ha -> do
         ... -- crawler politeness stuff (rate limits, queues)
         dr <- download d url (Just ha) opts
         case dr of
             DROK dat redownloadOptions ->
                 ... -- analyze data, save redownloadOpts for next download
             DRRedirect .. -> ...
             DRNotModified -> ...
             DRError e -> ...
 

It's highly recommended to use http://hackage.haskell.org/package/concurrent-dns-cache (preferably with single resolver pointing to locally running BIND) for DNS resolution since getAddrInfo used in http-conduit can be buggy and ineffective when it needs to resolve many hosts per second for a long time.

Synopsis

Download operations

urlGetContents :: String -> IO ByteString Source #

Download single URL with default DownloaderSettings. Fails if result is not DROK.

urlGetContentsPost :: String -> ByteString -> IO ByteString Source #

Post data and download single URL with default DownloaderSettings. Fails if result is not DROK.

download Source #

Arguments

:: Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Perform download

downloadG Source #

Arguments

:: (Request -> ResourceT IO Request)

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Generic version of download with ability to modify http-conduit Request.

rawDownload Source #

Arguments

:: (Request -> ResourceT IO Request)

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO (DownloadResult, Maybe RawDownloadResult) 

Even more generic version of download, which returns RawDownloadResult. RawDownloadResult is optional since it can not be determined on timeouts and connection errors.

type DownloadOptions = [String] Source #

If-None-Match and/or If-Modified-Since headers.

Downloader

data DownloaderSettings Source #

Settings used in downloader.

Constructors

DownloaderSettings 

Fields

  • dsUserAgent :: ByteString

    User agent string. Default: "Mozilla/5.0 (compatible; HttpConduitDownloader/1.0; +http://hackage.haskell.org/package/http-conduit-downloader)".

    Be a good crawler. Provide your User-Agent please.

  • dsTimeout :: Int

    Download timeout. Default: 30 seconds.

  • dsManagerSettings :: ManagerSettings

    Conduit Manager settings. Default: ManagerSettings with SSL certificate checks removed.

  • dsMaxDownloadSize :: Int

    Download size limit in bytes. Default: 10MB.

data Downloader Source #

Keeps http-conduit Manager and DownloaderSettings.

withDownloader :: (Downloader -> IO a) -> IO a Source #

Create a new Downloader, use it in the provided function, and then release it.

withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source #

Create a new Downloader with provided settings, use it in the provided function, and then release it.

Utils

postRequest :: ByteString -> Request -> Request Source #

Make HTTP POST request.

sinkByteString :: MonadIO m => Int -> ConduitT ByteString Void m (Maybe ByteString) Source #

Sink data using 32k buffers to reduce memory fragmentation. Returns Nothing if downloaded too much data.