http-conduit-downloader-1.0.6: HTTP downloader tailored for web-crawler needs.

Safe HaskellNone

Network.HTTP.Conduit.Downloader

Contents

Description

HTTP downloader tailored for web-crawler needs.

  • Handles all possible http-conduit exceptions and returns human readable error messages.
  • Handles some web server bugs (returning deflate data instead of gzip).
  • Ignores invalid SSL sertificates.
  • Receives data in 32k blocks internally to reduce memory fragmentation on many parallel downloads.
  • Download timeout.
  • Total download size limit.
  • Returns HTTP headers for subsequent redownloads and handles Not modified results.
  • Can be used with external DNS resolver (hsdns-cache for example).
  • Keep-alive connections pool (thanks to http-conduit).

Typical workflow in crawler:

  withDnsCache $  c -> withDownloader $  d -> do
  ... -- got URL from queue
  ra <- resolveA c $ hostNameFromUrl url
  case ra of
      Left err -> ... -- uh oh, bad host
      Right ha -> do
          ... -- crawler politeness stuff (rate limits, domain queues)
          dr <- download d url (Just ha) opts
          case dr of
              DROK dat redownloadOpts ->
                  ... -- analyze data, save redownloadOpts for next download
              DRRedirect .. -> ...
              DRNotModified -> ...
              DRError e -> ...

It's highly recommended to use http://hackage.haskell.org/package/hsdns-cache for DNS resolution since getAddrInfo used in http-conduit can be buggy and ineffective when it needs to resolve many hosts per second for a long time.

Synopsis

Download operations

urlGetContents :: String -> IO ByteStringSource

Download single URL with default DownloaderSettings. Fails if result is not DROK.

urlGetContentsPost :: String -> ByteString -> IO ByteStringSource

Post data and download single URL with default DownloaderSettings. Fails if result is not DROK.

downloadSource

Arguments

:: Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Perform download

downloadGSource

Arguments

:: (Request (ResourceT IO) -> ResourceT IO (Request (ResourceT IO)))

Function to modify Request (e.g. sign or make postRequest)

-> Downloader 
-> String

URL

-> Maybe HostAddress

Optional resolved HostAddress

-> DownloadOptions 
-> IO DownloadResult 

Generic version of download with ability to modify http-conduit Request.

data DownloadResult Source

Result of download operation.

Constructors

DROK ByteString DownloadOptions

Successful download with data and options for next download.

DRRedirect String

Redirect URL

DRError String

Error

DRNotModified

HTTP 304 Not Modified

type DownloadOptions = [String]Source

If-None-Match and/or If-Modified-Since headers.

Downloader

data DownloaderSettings Source

Settings used in downloader.

Constructors

DownloaderSettings 

Fields

dsUserAgent :: ByteString

User agent string. Default: "Mozilla/5.0 (compatible; HttpConduitDownloader/1.0; +http://hackage.haskell.org/package/http-conduit-downloader)".

Be a good crawler. Provide your User-Agent please.

dsTimeout :: Int

Download timeout. Default: 30 seconds.

dsManagerSettings :: ManagerSettings

Conduit Manager settings. Default: ManagerSettings with SSL certificate checks removed.

dsMaxDownloadSize :: Int

Download size limit. Default: 10MB.

data Downloader Source

Keeps http-conduit Manager and DownloaderSettings.

withDownloader :: (Downloader -> IO a) -> IO aSource

Create a new Downloader, use it in the provided function, and then release it.

withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO aSource

Create a new Downloader with provided settings, use it in the provided function, and then release it.

Utils

postRequest :: ByteString -> Request a -> Request bSource

Make HTTP POST request.

sinkByteString :: MonadIO m => Int -> Sink ByteString m (Maybe ByteString)Source

Sink data using 32k buffers to reduce memory fragmentation. Returns Nothing if downloaded too much data.