Creating a BitTorrent client in Haskell #1

Jakub Okoński – October 5th, 2015

Haskell, BitTorrent

This is the first post in a series on writing a BitTorrent client in Haskell.

BitTorrent is a protocol for the practice of peer-to-peer file sharing that is used to distribute large amounts of data over the Internet. BitTorrent is one of the most common protocols for transferring large files, and peer-to-peer networks have been estimated to collectively account for approximately 43% to 70% of all Internet traffic (depending on geographical location) as of February 2009.

Source: Wikipedia

I chose Haskell for this project because I wanted to try out some of the features in real use. The synchronous non-blocking I/O and green threads were appealing and so I decided to handle every peer connection in a separate thread… Which leads to Software Transactional Memory to coordinate them. I’ve used it before, but I figured there will be a lot of state and that this will be a good opportunity to evaluate it.

Sources:

Before we dive deep into managing connections and actual protocol logic, let us first make sure that we speak the data exchange protocols. This post will be all about that. The first one is called Bencode and is used for .torrent files and for talking to the tracker. The other is the Peer Wire Protocol and is used by peers to communicate with each other. Both are binary protocols. We’re going to use the well known Attoparsec parser combinator library for Bencoding and Data.Binary for PWP.

Speaking the Bencode format

At it’s core, Bencoding is very similar to JSON. Even though it’s a binary format, the numbers are encoded in ASCII text so they are unaffected by endianness.

Let’s make a table of raw values and their Bencoded representations.

Raw Values Bencoded Notes
"announce" 8:announce The number before the semicolon represents the string’s length in bytes.
412 i412e The number is encoded in ASCII.
["ok"] l2:oke Lists are heterogeneous and contain multiple values.
[("ok", 0)] d2:oki0ee Maps are heterogeneous and contain multiple (string, value) pairs. The keys are ordered.

Now that we understand the possible Bencoded values, we can create a datatype to store them.

data BValue = String ByteString
            | Number Integer
            | List [BValue]
            | Dictionary [(ByteString, BValue)]

I’m using an Integer to store the number. This will handle both sign and arbitrarily large inputs. We will be converting this to an efficient integer type depending on the context later on.

Let’s parse every variant of a value as we go in Attoparsec:

string :: Parser BValue
string = do
  n <- P.decimal
  void $ P.char ':'
  String <$> P.take n

Nothing complicated here, we read the number of characters, the delimiter and finally the raw bytestring.

number :: Parser BValue
number = Number <$> (P.char 'i' *> P.signed P.decimal <* P.char 'e')

In this one we scan the surroundings and extract a signed decimal number.

list :: Parser BValue
list = List <$> (P.char 'l' *> P.many' value <* P.char 'e')

We’re scanning the surroundings again and parsing all the values inside. The value function will be revealed in a second. All it does is combine all the parsers.

dictionary :: Parser BValue
dictionary = do
  void $ P.char 'd'
  pairs <- P.many' ((,) <$> string <*> value)
  void $ P.char 'e'
  return $ Dictionary $ fixPair <$> pairs
  where fixPair (String s, v) = (s, v)

Here we take care of the surroundings and extract multiple pairs of string-value pairs. Then we use fixPair to extract the raw bytestring from our boxed String subtype.

And to combine those variants into a full parser.

value :: Parser BValue
value = string <|> number <|> list <|> dictionary

Let’s test the parser on a chunk of a .torrent file I extracted.

$ cat > input
d8:announce42:http://tracker.archlinux.org:6969/announce
13:creation datei1438414567e4:infod6:lengthi688914432e
4:name29:archlinux-2015.08.01-dual.iso12:piece lengthi524288eee
$ stack ghci
...
*Bencoding> input <- B.readFile "input"
*Bencoding> :set -XOverloadedStrings
*Bencoding> let join = B.intercalate ""
*Bencoding> let joined = join $ B.split (fromIntegral $ fromEnum '\n') input
*Bencoding> P.parseOnly value joined
Right (Dictionary [
  ("announce",String "http://tracker.archlinux.org:6969/announce"),
  ("creation date",Number 1438414567),
  ("info",Dictionary [
    ("length",Number 688914432),
    ("name",String "archlinux-2015.08.01-dual.iso"),
    ("piece length",Number 524288)]
  )])

Output formatted by me.

Because multiline input is not convenient in GHCi, I split it up into lines, save it in a file and read it in in the interpreter. After reassembling the input, we can see the parser in action.

Lenses to query Bencoded values

Because traversing and extracting data from our BValue structure would involve a ton of pattern matches and nesting, we’ll define lenses to help us with that. I’m using the lens-family package. I won’t go into detail here, there are many lenses tutorials out there. We’re only defining traversal lenses, so this should be simple - just look where function f is applied.

bstring :: Traversal' BValue ByteString
bstring f (String s) = String <$> f s
bstring _ bv = pure bv

bnumber :: Traversal' BValue Integer
bnumber f (Number n) = Number <$> f n
bnumber _ bv = pure bv

blist :: Traversal' BValue BValue
blist f (List xs) = List <$> traverse f xs
blist _ bv = pure bv

bkey :: ByteString -> Traversal' BValue BValue
bkey k f bv@(Dictionary m) = case lookup k m of
                               Just v -> f v
                               Nothing -> pure bv
bkey _ _ bv = pure bv

I’m not an expert on lenses, but this worked pretty well for me so far.

Testing it out in GHCi, let’s continue where we left off in the earlier session:

...
*Bencoding> let Right result = P.parseOnly value joined
*Bencoding> result ^? bkey "announce" . bstring
Just "http://tracker.archlinux.org:6969/announce"
*Bencoding> result ^? bkey "info" . bkey "length" . bnumber
Just 688914432

I have been associating .torrent files with Bencoding until this point and it’s time to expand on that.

Those files actually have a strict schema called Metainfo. The structure is documented here.

Lenses for our Bencoded values are convenient, but we can do better by repackaging that data into a simpler data structure with a fixed schema. This will also serve as a validation layer for us, as we’ll see in a second.

While the BitTorrent protocol allows downloading multiple files per torrent, let’s skip that for now. Furthermore, we’ll fetch a subset of the data, just what we need to move on. I’m more interested in getting it to work first, then expanding features.

data MetaInfo = MetaInfo {
    info     :: InfoDictionary
  , infoHash :: ByteString
  , announce :: ByteString
  }

data InfoDictionary = InfoDictionary {
    pieceLength :: Word64
  , pieces      :: ByteString
  , name        :: ByteString
  , length      :: Word64
  }

pieces store 20-byte SHA1 hashes for each piece. These hashes are used to validate every piece the client downloads.

To extract the data from raw BValues into those types, I’m using Applicative Maybe to make this elegant.

parseMetaInfo :: BValue -> Maybe MetaInfo
parseMetaInfo bv = MetaInfo
                <$> (bv ^? bkey "info" >>= parseInfoDictionary)
                <*> (hash . serialize <$> bv ^? bkey "info")
                <*> bv ^? bkey "announce" . bstring

parseInfoDictionary :: BValue -> Maybe InfoDictionary
parseInfoDictionary bv = InfoDictionary
                      <$> (fromIntegral <$> bv ^? bkey "piece length" . bnumber)
                      <*> bv ^? bkey "pieces" . bstring
                      <*> bv ^? bkey "name" . bstring
                      <*> (fromIntegral <$> bv ^? bkey "length" . bnumber)

I’m serializing back the info dictionary to Bencoding and then hashing it with a SHA1. This is referred to as the Info hash and is used by the protocol to uniquely identify this torrent. We’ll use this later for things like asking trackers about peers and handshaking them.

Peer Wire Protocol

Messages in the Peer Wire Protocol begin with a 4 byte integer describing the remaining length of the message. Peers use these messages to talk to each other. Some messages exist to modify the state and drive the protocol, while others are used for requesting & transferring data between the peers.

Before I describe each message, let’s see a typical exchange so it’s easier to understand what each message does later on.

When a peer reaches out to another peer, it sets up a TCP connection with that peer. Handshakes are exchanged to identify the protocol, the version, extensions and so on. Bitfields are then exchanged to inform the other peer about our progress. A bitfield is just an efficiently packed array of bits that signifies completion of each piece in the torrent.

Every peer remembers four things:

  • if he’s choking the other peer,
  • if the other peer is choking him,
  • if he’s interested in the other peer,
  • if the other peer is interested in him

These four booleans are enough to coordinate the protocol. Peers start out as choked and uninterested, so how do we get to actually download some data? We first express interest in the other peer, letting him know we want to request some chunks. Then, we wait for the other peer to unchoke us. It’s entirely possible that he never will, depending on his algorithm, the limit of connections, etc. The BitTorrent protocol lets clients decide how to handle these messages. They are free to implement their own algorithms/heuristics. Finally, when the other peer unchokes us, we can request chunks and the other client must respond with data.

Let’s now go over the variants of PWP messages and see their hexdumps.

Keep alive

This is the most basic message used to keep the connections alive. Clients send it periodically.

00 00 00 00
     ^
     |
  length=0

Choke

This message tells the other peer that it’s choked and cannot request any data.

00 00 00 01    00
     ^         ^
     |         |
  length=1    id=0

Unchoke

This message tells the other peer that it’s unchoked and may request data.

00 00 00 01    01
     ^         ^
     |         |
  length=1    id=1

Interested

This message tells the other peer that we are interested in his data. It’s a hint for the other peer to unchoke us.

00 00 00 01    02
     ^         ^
     |         |
  length=1    id=2

Uninterested

This message tells the other peer we are not interested in his data anymore. He can choke us to use the upload slot for someone else.

00 00 00 01    03
     ^         ^
     |         |
  length=1    id=3

Have

This message tells the other peer that we have completed a piece.

00 00 00 05    04     00 00 00 05
     ^         ^           ^
     |         |           |
  length=5    id=4     pieceid=5

Bitfield

Sends the full bitfield (piece download progress) to the other peer.

00 00 00 05    05     ff fd ff ff
     ^         ^           ^
     |         |           |
 length=1+n   id=5      bitfield
                        length=n

In this example, the bitfield indicates all pieces except 14th as completed. A bitfield of 4 bytes could store the status of up to 32 pieces.

It’s interesting to note how small the overhead of the protocol is. The standard advises selecting a piece size (when creating the torrent) so that our Metainfo doesn’t exceed 70KB. If we assume there’s 64KB of just piece hashes, that makes 3200 pieces, which will make the bitfield only 400 bytes. On top of that, this message is sent only once per peer connection.

Request

Requests a chunk of n bytes inside a piece at offset. To download a full piece, the client has to sent out requests like those. The standard practice seems to be to request data at chunks of 215 bytes.

00 00 00 0d      06      00 00 00 01   00 00 00 02  00 00 00 03
     ^           ^            ^             ^            ^
     |           |            |             |            |
  length=13     id=6      pieceid=1      offset=2    length=3

Piece

Transfers data to the other peer.

00 00 00 0d      07    00 00 00 01  00 00 00 02   74 65 73 74
     ^           ^          ^            ^             ^
     |           |          |            |             |
  length=9+n    id=7    pieceid=1     offset=2     data="test"
                                                   length=n

Cancel

Cancels an outstanding request for data.

00 00 00 0d      08      00 00 00 01   00 00 00 02  00 00 00 03
     ^           ^            ^             ^            ^
     |           |            |             |            |
  length=13     id=8      pieceid=1      offset=2    length=3

Putting all of that together into a datatype and a Binary instance:

data PWP = KeepAlive
         | Choke
         | Unchoke
         | Interested
         | Uninterested
         | Have Word32
         | Bitfield ByteString
         | Request Word32 Word32 Word32
         | Piece Word32 Word32 ByteString
         | Cancel Word32 Word32 Word32

The only thing worth noting here is the use of unsigned 32bit integers as required by the protocol.

instance Binary PWP where
  put KeepAlive =
    put (0 :: Word32)
  put Choke = do
    put (1 :: Word32)
    put (0 :: Word8)
  put Unchoke = do
    put (1 :: Word32)
    put (1 :: Word8)
  put Interested = do
    put (1 :: Word32)
    put (2 :: Word8)
  put Uninterested = do
    put (1 :: Word32)
    put (3 :: Word8)
  put (Have pieceId) = do
    put (5 :: Word32)
    put (4 :: Word8)
    put pieceId
  put (Bitfield field) = do
    put (fromIntegral $ 1 + B.length field :: Word32)
    put (5 :: Word8)
    putByteString field
  put (Request piece offset len) = do
    put (13 :: Word32)
    put (6 :: Word8)
    put piece
    put offset
    put len
  put (Piece piece offset d) = do
    put (fromIntegral $ 9 + B.length d :: Word32)
    put (7 :: Word8)
    put piece
    put offset
    putByteString d
  put (Cancel piece offset len) = do
    put (13 :: Word32)
    put (8 :: Word8)
    put piece
    put offset
    put len

  get = do
    len <- get :: Get Word32
    case len of
      0 -> return KeepAlive
      _ -> do
        messageId <- get :: Get Word8
        case messageId of
          0 -> return Choke
          1 -> return Unchoke
          2 -> return Interested
          3 -> return Uninterested
          4 -> Have <$> get
          5 -> Bitfield <$> getByteString (fromIntegral len - 1)
          6 -> Request <$> get <*> get <*> get
          7 -> Piece <$> get <*> get <*> getByteString (fromIntegral len - 9)
          8 -> Cancel <$> get <*> get <*> get
          _ -> fail "incorrect message!"

We’re now capable of speaking all the required protocols and are ready to initiate some connections. The journey continues in the next post, where we’ll implement the peer monad to handle connections with peers.

We love to solve tough problems. Got one?   Hire us