This is the first post in a series on writing a BitTorrent client in Haskell.
BitTorrent is a protocol for the practice of peer-to-peer file sharing that is
used to distribute large amounts of data over the Internet. BitTorrent is one of
the most common protocols for transferring large files, and peer-to-peer networks
have been estimated to collectively account for approximately 43% to 70% of all
Internet traffic (depending on geographical location) as of February 2009.
Source: Wikipedia
I chose Haskell for this project because I wanted to try out some of the
features in real use.
The synchronous non-blocking I/O and green threads were appealing and so I
decided to handle every peer connection in a separate thread…
Which leads to Software Transactional Memory to coordinate them.
I’ve used it before, but I figured there will be a lot of state and that
this will be a good opportunity to evaluate it.
Sources:
Before we dive deep into managing connections and actual protocol logic,
let us first make sure that we speak the data exchange protocols.
This post will be all about that.
The first one is called Bencode and is used for .torrent
files
and for talking to the tracker.
The other is the Peer Wire Protocol and is used by peers to communicate
with each other.
Both are binary protocols.
We’re going to use the well known Attoparsec parser combinator
library for Bencoding and Data.Binary for PWP.
At it’s core, Bencoding is very similar to JSON.
Even though it’s a binary format, the numbers are encoded in ASCII text
so they are unaffected by endianness.
Let’s make a table of raw values and their Bencoded representations.
Raw Values |
Bencoded |
Notes |
"announce" |
8:announce |
The number before the semicolon represents the string’s length in bytes. |
412 |
i412e |
The number is encoded in ASCII. |
["ok"] |
l2:oke |
Lists are heterogeneous and contain multiple values. |
[("ok", 0)] |
d2:oki0ee |
Maps are heterogeneous and
contain multiple (string, value) pairs.
The keys are ordered.
|
Now that we understand the possible Bencoded values, we can create a datatype
to store them.
data BValue = String ByteString
| Number Integer
| List [BValue]
| Dictionary [(ByteString, BValue)]
I’m using an Integer
to store the number.
This will handle both sign and arbitrarily large inputs.
We will be converting this to an efficient integer type depending
on the context later on.
Let’s parse every variant of a value as we go in Attoparsec:
string :: Parser BValue
string = do
n <- P.decimal
void $ P.char ':'
String <$> P.take n
Nothing complicated here, we read the number of characters, the delimiter
and finally the raw bytestring.
number :: Parser BValue
number = Number <$> (P.char 'i' *> P.signed P.decimal <* P.char 'e')
In this one we scan the surroundings and extract a signed decimal number.
list :: Parser BValue
list = List <$> (P.char 'l' *> P.many' value <* P.char 'e')
We’re scanning the surroundings again and parsing all the values inside.
The value
function will be revealed in a second. All it does is combine
all the parsers.
dictionary :: Parser BValue
dictionary = do
void $ P.char 'd'
pairs <- P.many' ((,) <$> string <*> value)
void $ P.char 'e'
return $ Dictionary $ fixPair <$> pairs
where fixPair (String s, v) = (s, v)
Here we take care of the surroundings and extract multiple pairs of string-value pairs.
Then we use fixPair
to extract the raw bytestring from our boxed String subtype.
And to combine those variants into a full parser.
value :: Parser BValue
value = string <|> number <|> list <|> dictionary
Let’s test the parser on a chunk of a .torrent
file I extracted.
$ cat > input
d8:announce42:http://tracker.archlinux.org:6969/announce
13:creation datei1438414567e4:infod6:lengthi688914432e
4:name29:archlinux-2015.08.01-dual.iso12:piece lengthi524288eee
$ stack ghci
...
*Bencoding> input <- B.readFile "input"
*Bencoding> :set -XOverloadedStrings
*Bencoding> let join = B.intercalate ""
*Bencoding> let joined = join $ B.split (fromIntegral $ fromEnum '\n') input
*Bencoding> P.parseOnly value joined
Right (Dictionary [
("announce",String "http://tracker.archlinux.org:6969/announce"),
("creation date",Number 1438414567),
("info",Dictionary [
("length",Number 688914432),
("name",String "archlinux-2015.08.01-dual.iso"),
("piece length",Number 524288)]
)])
Output formatted by me.
Because multiline input is not convenient in GHCi, I split it up into lines,
save it in a file and read it in in the interpreter.
After reassembling the input, we can see the parser in action.
Because traversing and extracting data from our BValue
structure would involve
a ton of pattern matches and nesting, we’ll define lenses to help us with that.
I’m using the lens-family
package.
I won’t go into detail here, there are many lenses tutorials out there.
We’re only defining traversal lenses, so this should be simple - just look where
function f
is applied.
bstring :: Traversal' BValue ByteString
bstring f (String s) = String <$> f s
bstring _ bv = pure bv
bnumber :: Traversal' BValue Integer
bnumber f (Number n) = Number <$> f n
bnumber _ bv = pure bv
blist :: Traversal' BValue BValue
blist f (List xs) = List <$> traverse f xs
blist _ bv = pure bv
bkey :: ByteString -> Traversal' BValue BValue
bkey k f bv@(Dictionary m) = case lookup k m of
Just v -> f v
Nothing -> pure bv
bkey _ _ bv = pure bv
I’m not an expert on lenses, but this worked pretty well for me so far.
Testing it out in GHCi, let’s continue where we left off in the earlier session:
...
*Bencoding> let Right result = P.parseOnly value joined
*Bencoding> result ^? bkey "announce" . bstring
Just "http://tracker.archlinux.org:6969/announce"
*Bencoding> result ^? bkey "info" . bkey "length" . bnumber
Just 688914432
I have been associating .torrent
files with Bencoding until this point
and it’s time to expand on that.
Those files actually have a strict schema called Metainfo.
The structure is documented here.
Lenses for our Bencoded values are convenient, but we can do better by
repackaging that data into a simpler data structure with a fixed schema.
This will also serve as a validation layer for us, as we’ll see in
a second.
While the BitTorrent protocol allows downloading multiple files per torrent,
let’s skip that for now.
Furthermore, we’ll fetch a subset of the data, just what we need to move on.
I’m more interested in getting it to work first, then expanding features.
data MetaInfo = MetaInfo {
info :: InfoDictionary
, infoHash :: ByteString
, announce :: ByteString
}
data InfoDictionary = InfoDictionary {
pieceLength :: Word64
, pieces :: ByteString
, name :: ByteString
, length :: Word64
}
pieces
store 20
-byte SHA1 hashes for each piece.
These hashes are used to validate every piece the client downloads.
To extract the data from raw BValue
s into those types, I’m using
Applicative Maybe
to make this elegant.
parseMetaInfo :: BValue -> Maybe MetaInfo
parseMetaInfo bv = MetaInfo
<$> (bv ^? bkey "info" >>= parseInfoDictionary)
<*> (hash . serialize <$> bv ^? bkey "info")
<*> bv ^? bkey "announce" . bstring
parseInfoDictionary :: BValue -> Maybe InfoDictionary
parseInfoDictionary bv = InfoDictionary
<$> (fromIntegral <$> bv ^? bkey "piece length" . bnumber)
<*> bv ^? bkey "pieces" . bstring
<*> bv ^? bkey "name" . bstring
<*> (fromIntegral <$> bv ^? bkey "length" . bnumber)
I’m serializing back the info dictionary to Bencoding and then hashing it with a SHA1.
This is referred to as the Info hash and is used by the protocol to uniquely identify
this torrent.
We’ll use this later for things like asking trackers about peers and
handshaking them.
Messages in the Peer Wire Protocol begin with a 4 byte integer describing the
remaining length of the message.
Peers use these messages to talk to each other.
Some messages exist to modify the state and drive the protocol,
while others are used for requesting & transferring data between the peers.
Before I describe each message, let’s see a typical exchange so it’s easier to
understand what each message does later on.
When a peer reaches out to another peer, it sets up a TCP connection with that
peer.
Handshakes are exchanged to identify the protocol, the version, extensions and so on.
Bitfields are then exchanged to inform the other peer about our progress.
A bitfield is just an efficiently packed array of bits that signifies completion
of each piece in the torrent.
Every peer remembers four things:
- if he’s choking the other peer,
- if the other peer is choking him,
- if he’s interested in the other peer,
- if the other peer is interested in him
These four booleans are enough to coordinate the protocol.
Peers start out as choked and uninterested, so how do we get to actually download
some data?
We first express interest in the other peer, letting him know we want to request
some chunks.
Then, we wait for the other peer to unchoke us.
It’s entirely possible that he never will, depending on his algorithm, the limit of
connections, etc.
The BitTorrent protocol lets clients decide how to handle these messages.
They are free to implement their own algorithms/heuristics.
Finally, when the other peer unchokes us, we can request chunks and the other client
must respond with data.
Let’s now go over the variants of PWP messages and see their hexdumps.
This is the most basic message used to keep the connections alive.
Clients send it periodically.
00 00 00 00
^
|
length=0
This message tells the other peer that it’s choked and cannot request any data.
00 00 00 01 00
^ ^
| |
length=1 id=0
This message tells the other peer that it’s unchoked and may request data.
00 00 00 01 01
^ ^
| |
length=1 id=1
This message tells the other peer that we are interested in his data.
It’s a hint for the other peer to unchoke us.
00 00 00 01 02
^ ^
| |
length=1 id=2
This message tells the other peer we are not interested in his data anymore.
He can choke us to use the upload slot for someone else.
00 00 00 01 03
^ ^
| |
length=1 id=3
This message tells the other peer that we have completed a piece.
00 00 00 05 04 00 00 00 05
^ ^ ^
| | |
length=5 id=4 pieceid=5
Sends the full bitfield (piece download progress) to the other peer.
00 00 00 05 05 ff fd ff ff
^ ^ ^
| | |
length=1+n id=5 bitfield
length=n
In this example, the bitfield indicates all pieces except 14th as completed.
A bitfield of 4 bytes could store the status of up to 32 pieces.
It’s interesting to note how small the overhead of the protocol is.
The standard advises selecting a piece size (when creating the torrent)
so that our Metainfo doesn’t exceed 70KB
.
If we assume there’s 64KB
of just piece hashes, that makes
3200
pieces, which will make the bitfield only 400
bytes.
On top of that, this message is sent only once per peer connection.
Requests a chunk of n bytes inside a piece at offset.
To download a full piece, the client has to sent out requests like those.
The standard practice seems to be to request data at chunks of
215
bytes.
00 00 00 0d 06 00 00 00 01 00 00 00 02 00 00 00 03
^ ^ ^ ^ ^
| | | | |
length=13 id=6 pieceid=1 offset=2 length=3
Transfers data to the other peer.
00 00 00 0d 07 00 00 00 01 00 00 00 02 74 65 73 74
^ ^ ^ ^ ^
| | | | |
length=9+n id=7 pieceid=1 offset=2 data="test"
length=n
Cancels an outstanding request for data.
00 00 00 0d 08 00 00 00 01 00 00 00 02 00 00 00 03
^ ^ ^ ^ ^
| | | | |
length=13 id=8 pieceid=1 offset=2 length=3
Putting all of that together into a datatype and a Binary
instance:
data PWP = KeepAlive
| Choke
| Unchoke
| Interested
| Uninterested
| Have Word32
| Bitfield ByteString
| Request Word32 Word32 Word32
| Piece Word32 Word32 ByteString
| Cancel Word32 Word32 Word32
The only thing worth noting here is the use of unsigned 32bit integers as required
by the protocol.
instance Binary PWP where
put KeepAlive =
put (0 :: Word32)
put Choke = do
put (1 :: Word32)
put (0 :: Word8)
put Unchoke = do
put (1 :: Word32)
put (1 :: Word8)
put Interested = do
put (1 :: Word32)
put (2 :: Word8)
put Uninterested = do
put (1 :: Word32)
put (3 :: Word8)
put (Have pieceId) = do
put (5 :: Word32)
put (4 :: Word8)
put pieceId
put (Bitfield field) = do
put (fromIntegral $ 1 + B.length field :: Word32)
put (5 :: Word8)
putByteString field
put (Request piece offset len) = do
put (13 :: Word32)
put (6 :: Word8)
put piece
put offset
put len
put (Piece piece offset d) = do
put (fromIntegral $ 9 + B.length d :: Word32)
put (7 :: Word8)
put piece
put offset
putByteString d
put (Cancel piece offset len) = do
put (13 :: Word32)
put (8 :: Word8)
put piece
put offset
put len
get = do
len <- get :: Get Word32
case len of
0 -> return KeepAlive
_ -> do
messageId <- get :: Get Word8
case messageId of
0 -> return Choke
1 -> return Unchoke
2 -> return Interested
3 -> return Uninterested
4 -> Have <$> get
5 -> Bitfield <$> getByteString (fromIntegral len - 1)
6 -> Request <$> get <*> get <*> get
7 -> Piece <$> get <*> get <*> getByteString (fromIntegral len - 9)
8 -> Cancel <$> get <*> get <*> get
_ -> fail "incorrect message!"
We’re now capable of speaking all the required protocols
and are ready to initiate some connections.
The journey continues in the next post, where we’ll implement the
peer monad to handle connections with peers.