Reading data from a socket

Match word(s).	If you have any questions or comments, please visit us on the Forums.
FAQ > How do I... (Level 3) > Reading data from a socket
This item was added on: 2005/02/12 Introduction Sample of incorrect code Mistake 1: Oh no, overflow...! Possible Mistake 2: Where did that extra byte come from? Mistake 3: Did I read you right? Stream Rules Some diagrams! Solution 1 - Use a delimiter Solution 2 - Use a data length indicator Solution 3 - Fixed sized messages Break up you code! Introduction When people create their first TCP/IP socket based program, they often don't realise what a mine field they're stepping into. There's a big leap involved in getting from a basic "hello world" style socket program to something that is actually of any real use. In the haste to build something bigger, better and more fun, it is easy to overlook the basic steps needed to allow successful exchange of data through a socket. Here we discuss some of these problems in detail, in hope that the reader can avoid making incorrect assumptions from the start. This particular article covers the princples of controlling the receiving buffer. You'll see a classic buffer overflow. Sample of incorrect code The following is an extract from a simple socket program that receives data. It shows a few common mistakes: */ Sample1.c /* 01 char Response[] = "COMMAND OK"; 02 char CommandBuffer[BUFSIZ]; 03 04 nBytes = recv(socket, CommandBuffer, sizeof(CommandBuffer), 0); 05 06 if (nBytes == -1) 07 { 08 /* 09 * Socket in error state 10 / 11 perror ("recv"); 12 return 0; 13 } 14 15 if (nBytes == 0) 16 { 17 / 18 * Socket has been closed 19 / 20 fprintf (stderr, "Socket %d closed", socket); 21 close (socket); 22 return 0; 23 } 24 25 / 26 * Command read OK, let's process it! 27 / 28 29 CommandBuffer[nBytes] = '\0'; 30 31 if (strcmp (CommandBuffer, "QUIT") == 0) 32 { 33 printf ("Remote program said QUIT!\n"); 34 send(socket, Response, sizeof(Response), 0); 35 } 36 37 / 38 * and so on.... 39 / Using this code, I'll highlight three problems. One minor, and two major (but not in that order!). Mistake 1: Oh no, overflow...!* On line 04 of Sample1.c, `recv()` is asked to fill `CommandBuffer` with upto `sizeof(CommandBuffer)` bytes. Let's assume that it does so successfully, and the resulting count is stored in `nBytes`. Having got past the subsequent conditional statements, line 29 applies a \0 character to the buffer to null terminate ready for use as a string. This is a serious mistake, we've just written to memory we weren't supposed to; it's a classic buffer overflow. Effectively, we've just done this: char CommandBuffer[512]; CommandBuffer[512] = '\0'; /* Oops, this is one byte passed the end of the array / If you're expecting to `recv()` a string, I'd suggest giving the `recv()` function `arraysize-1` bytes to write to, something like: nBytes = recv(socket, CommandBuffer, sizeof(CommandBuffer) - 1, 0); That way, `nBytes` can be used to safely apply the \0 character, as in the original code. Possible mistake 2: Where did that extra byte come from?* We can see that the code above expects the data it receives to be suitable for use as a string, with the exception that it will null terminate the array itself. This means it does not expect to receive a \0 character in the data from the socket. It would therefore be reasonable to assume that the application at the other end expects the same. On line 34 of Sample1.c, the `send()` function returns a string denoting that a command was accepted, but this code sends the null terminator as well. This is because `sizeof(Response)` is used, which will yield the length of `COMMAND OK\0`. A better choice would have been to use `strlen()` to determine the length. This may not be a problem; the other end may be able to cope with or without a \0 but, at the very least, our design should be consistent, and we should be fully aware of what we're sending. Hence I labelled this section a "Possible mistake" . In fact, if you're only moving strings around, the extra \0 is unlikely to cause problems, but if you start moving more complex data structures, then you definitely need to be more careful. Mistake 3: Did I read you right? This last mistake it a little more complicated than the first two, and will take a lot to fix, but it is something that must be done. To summarise the problem: you can never be sure about how much data the application will receive when it calls `recv()`. Just because you think the other end might send you one of a preset number of one-word commands, e.g. `QUIT`, doesn't mean that's what you'll `recv()`. To fully appreciate the dilemma, you need to first understand that TCP/IP is a "stream" based protocol. Let's discuss that bit first... Stream Rules: - Data is delivered as a series of bytes that will arrive at the target application in the order they were sent. - Data arrives as and when everything in between the two applications feels like delivering it. - Data can be split into multiple packets, dependant on lots of things that are mostly out of your control. - `send()`ing multiple messages does not guarantee that you'll `recv()` the same number of messages. - The receiving application must cater for split messages. - The receiving application must cater for joined messages. - There is no automatic magic marker at the start of a message, nor at the the end of a message. The key thing to note is that messages can be split or joined, and yes, you will probably have to do something about re-assembling them on the receiving end. This is where Sample1.c has failed; it makes no attempt to ensure that it has received a single, complete command before processing it. Let's look at some samples of what could happen when you call `recv()` to get the command in Sample1.c: Scenario 1: One command, one recv(). In this case, one call to recv() gets one command. Nice and simple! +-------------+ \| recv() 1 \| +-------------+ \|USER MYNAME\0\| +-------------+ Scenario 2: One command, two recv()s. We need two calls to recv(), and we must re-assemble the data. +--------+--------+ \|recv() 1\|recv() 2\| +--------+--------+ \|USER MYN\|AME\0 \| +--------+--------+ Scenario 3: Two commands, one recv(). We need one call to recv(), and we must split the data, in order to process both commands +------------------------------+ \| recv() 1 \| +------------------------------+ \|USER MYNAME\0PASSWORD MYPASS\0\| +------------------------------+ Scenario 4: Two commands, two recv()s. We need two calls to recv(), we must split and re-assemble data. +------------------+------------+ \| recv() 1 \| recv() 2 \| +------------------+------------+ \|USER MYNAME\0PASSW\|ORD MYPASS\0\| +------------------+------------+ There are other scenarios, including "no data available" and "error conditions", which we will not cover here. As you can see, the streaming protocol is quite ruthless with your data. It will chop and join wherever it feels like; it's up to you to fix it! Now we move to the fun part, how to actually do that fix... There are three solutions on offer here, none of which come with full code; that is left as an exercise for the reader. Solution 1 - Use a delimiter When processing only strings, as in Sample1.c, you can use a delimiter byte to break up messages. A good choice would be the \0 character that terminates all C strings. The `recv()`ing program can behave in one of two ways: 1) Read a single byte at a time.. ... until it hits the \0 character, at which point it can assume it has received a complete command. This option is nice and simple, but does come with an overhead of repeatedly calling `recv()`, which is inefficient. 2) Read an arbitary number of bytes... ... then parse the receiving buffer, looking for a \0 character. Once found, pass the details back to the calling function. But we mustn't forget about the bytes in the buffer after the first string; they'll need processing at some point, too. Solution 2 - Use a data length indicator Every message that is sent can be prefixed with a value that represents the data's length. The receiver starts by `recv()`ing a fixed number of bytes to get this length indicator and, once it has it, it `recv()`s that specific number of bytes. If all goes well, two `recv()`s are all that are needed to read one message. Of course, the data may be split, meaning that you need two or more calls to get it... and don't forget this includes the getting of the length indicator in the first place! Sample data stream, using a 4 byte length indicator: 0017This is a message0010So is this Psuedo Code: Prototype: int myrecv(void buf, size_t max_buffer_size, size_t bytes_to_read); INDICATOR_LEN = 4 rc = myrecv(buf, sizeof(buf), INDICATOR_LEN); if (rc != OK) Then exit(); UserDataLength = ConvertNumberFromString(buf, INDICATOR_LEN); if (UserDataLength Is Out_Of_Bounds) Then exit(); UserData = malloc(UserDataLength); rc = myrecv(buf, sizeof(buf), UserDataLength); if (rc == UserDataLength) Then We Received A Complete Message! Solution 3 - Fixed sized messages* This option is not really practical for strings, but if you're sending data structures around, they might well be of fixed size. In this case, `recv()`ing the sizeof(dataStructure) number of bytes would be a good choice, plus including handling of split packets. Use functions Another common problem is making functions too long and complex. Remember, no matter how you choose to read data from a socket, break your program up into lots of small functions that perform specific tasks. It will make management of the code, and debugging, a lot easier. Written by: Hammer

Script provided by SmartCGIs