http-conduit-downloader-1.0.21: HTTP downloader tailored for web-crawler needs.

Safe HaskellNone
LanguageHaskell98

Network.HTTP.Conduit.Downloader

Contents

Description

HTTP downloader tailored for web-crawler needs.

  • Handles all possible http-conduit exceptions and returns human readable error messages.
  • Handles some web server bugs (returning deflate data instead of gzip, invalid @gzip encoding).
  • Uses OpenSSL instead of tls package since it doesn't handle all sites.
  • Ignores invalid SSL sertificates.
  • Receives data in 32k blocks internally to reduce memory fragmentation on many parallel downloads.
  • Download timeout.
  • Total download size limit.
  • Returns HTTP headers for subsequent redownloads and handles Not modified results.
  • Can be used with external DNS resolver (hsdns-cache for example).
  • Keep-alive connections pool (thanks to http-conduit).

Typical workflow in crawler:

 withDnsCache $  c -> withDownloader $  d -> do
 ... -- got URL from queue
 ra <- resolveA c $ hostNameFromUrl url
 case ra of
     Left err -> ... -- uh oh, bad host
     Right ha -> do
         ... -- crawler politeness stuff (rate limits, domain queues)
         dr <- download d url (Just ha) opts
         case dr of
             DROK dat redownloadOpts ->
                 ... -- analyze data, save redownloadOpts for next download
             DRRedirect .. -> ...
             DRNotModified -> ...
             DRError e -> ...
 

It's highly recommended to use http://hackage.haskell.org/package/hsdns-cache for DNS resolution since getAddrInfo used in http-conduit can be buggy and ineffective when it needs to resolve many hosts per second for a long time.

Synopsis

Download operations

urlGetContents :: String -> IO ByteString Source

Download single URL with default DownloaderSettings. Fails if result is not DROK.

urlGetContentsPost :: String -> ByteString -> IO ByteString Source

Post data and download single URL with default DownloaderSettings. Fails if result is not DROK.

download Source

Arguments

:: Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Perform download

downloadG Source

Arguments

:: (Request -> ResourceT IO Request)

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Generic version of download with ability to modify http-conduit Request.

data DownloadResult Source

Result of download operation.

Constructors

DROK ByteString DownloadOptions

Successful download with data and options for next download.

DRRedirect String

Redirect URL

DRError String

Error

DRNotModified

HTTP 304 Not Modified

type DownloadOptions = [String] Source

If-None-Match and/or If-Modified-Since headers.

Downloader

data DownloaderSettings Source

Settings used in downloader.

Constructors

DownloaderSettings 

Fields

dsUserAgent :: ByteString

User agent string. Default: "Mozilla/5.0 (compatible; HttpConduitDownloader/1.0; +http://hackage.haskell.org/package/http-conduit-downloader)".

Be a good crawler. Provide your User-Agent please.

dsTimeout :: Int

Download timeout. Default: 30 seconds.

dsManagerSettings :: ManagerSettings

Conduit Manager settings. Default: ManagerSettings with SSL certificate checks removed.

dsMaxDownloadSize :: Int

Download size limit in bytes. Default: 10MB.

data Downloader Source

Keeps http-conduit Manager and DownloaderSettings.

withDownloader :: (Downloader -> IO a) -> IO a Source

Create a new Downloader, use it in the provided function, and then release it.

withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source

Create a new Downloader with provided settings, use it in the provided function, and then release it.

Utils

postRequest :: ByteString -> Request -> Request Source

Make HTTP POST request.

sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString) Source

Sink data using 32k buffers to reduce memory fragmentation. Returns Nothing if downloaded too much data.