Windows Sockets 2.0: Write Scalable Winsock Apps Using Completion Ports
windows socket 2.0:用completion ports写可伸缩的winsock程序
Anthony Jones and Amol Deshpande
This article assumes you’re familiar with 1 2 3
Download the code for this article: Jones1000.exe (33KB)
Browse the code for this article at Code Center: WinSock Demo
SUMMARY
Writing a network-aware application isn't difficult, but writing one that is scalable can be challenging. Overlapped I/O using completion ports provides true scalability on Windows NT and Windows 2000. Completion ports and Windows Sockets 2.0 can be used to design applications that will scale to thousands of connections.
The article begins with a discussion of the implementation
of a scalable
server, discusses handling low-resource, high-demand conditions, and addresses
the most common problems with scalability.
earning to write network-aware
applications has never been considered easy. In reality, though, there are just
a few principles to master—creating and connecting a socket, accepting a
connection, and sending and receiving data. The real difficulty is writing
network applications that scale from a single connection to many thousands of
connections. This article will discuss the development of scalable Windows NT® and
Windows 2000-based applications that use Windows® Sockets 2.0 (Winsock2). The
primary focus will be the server side of the client/ server model; however, many
of the topics discussed apply to both.
Because the notion of writing a
scalable Winsock
application implies a server application, the following discussion is pertinent
to applications running on Windows NT 4.0 and Windows 2000. We're not including
Windows NT 3.x because this solution relies on the features of Winsock2 that are
available only on Windows NT 4.0 and newer.
APIs and Scalability
The overlapped I/O mechanism in Win32® allows an application to initiate an operation and receive notification of its completion later. This is especially useful for operations that take a long time to complete. The thread that initiates the overlapped operation is then free to do other things while the overlapped request completes behind the scenes. The only I/O model that provides true scalability on Windows NT and Windows 2000 is overlapped I/O using completion ports for notification. Mechanisms like the WSAAsyncSelect and select functions are provided for easier porting from Windows 3.1 and Unix respectively, but are not designed to scale. The completion port mechanism is optimized for the operating system's internal workings.
Completion Ports
A completion port is a queue into which the
operating system puts notifications of completed overlapped I/O requests. Once
the operation completes, a notification is sent to a worker thread that can
process the result. A socket may be associated with a completion port at any
point after creation.
Typically an application will also create a
number of worker threads to process these notifications. The number of worker
threads depends on the specific needs of the application. The ideal number is
one per processor, but that implies that none of these threads should execute a
blocking operation such as a synchronous read/write or a wait on an
event. Each thread is given a certain amount of CPU time, known as the quantum,
for which it can execute before another thread is allowed to grab a time slice.
If a thread performs a blocking operation, the operating system will throw away
its unused time slice and let other threads execute instead. Thus, the first
thread has not fully utilized its quantum, and the application should therefore
have other threads ready to run and utilize that time slot.
Using a
completion port is a two-step process. First, the completion port is created, as
shown in the following code:
HANDLE hIocp;
hIocp = CreateIoCompletionPort(
INVALID_HANDLE_VALUE,
NULL,
(ULONG_PTR)0,
0);
if (hIocp == NULL) {
// Error
}
Once the completion port is created, each socket that wants to use the completion port must be associated with it. This is done by calling CreateIoCompletionPort again, this time setting the first parameter, FileHandle, to the socket handle to be associated, and setting ExistingCompletionPort to the handle of the completion port you just created.
The following code
creates a socket and associates it with the completion port created earlier:
SOCKET s;
s = socket(AF_INET, SOCK_STREAM, 0);
if (s == INVALID_SOCKET) {
// Error
if (CreateIoCompletionPort((HANDLE)s,
hIocp,
(ULONG_PTR)0,
0) == NULL)
{
// Error
}
•••
}
At this point, the socket s is associated with the completion port. Any overlapped operations performed on the socket will use the completion port for notification. Note that the third parameter of CreateIoCompletionPort allows a completion key to be specified along with the socket handle to be associated. This can be used to pass context information that is associated with the socket. Each time a completion notification arrives, this context information can be retrieved.
Once the completion port has
been created and sockets have been associated with it, one or more threads are
needed to process the completion notifications. Each thread will sit in a loop
that calls GetQueuedCompletionStatus each time through and returns completion
notifications.
Before illustrating what a typical worker thread looks
like, we need to address the ways in which an application keeps track of its
overlapped operations. When an overlapped call is made, a pointer to an
overlapped structure is passed as a parameter. GetQueuedCompletionStatus will
return the same pointer when the operation completes. With this structure alone,
however, an application can't tell which operation just completed. In order to
keep track of the operations that have completed, it's useful to define your own
OVERLAPPED structure that contains any extra information about each operation
queued to the completion port (see Figure 1).
Whenever an overlapped
operation is performed, an OVERLAPPEDPLUS structure is passed as the
lpOverlapped parameter (as in WSASend, WSARecv, and so on). This allows you to
set operation state information for each overlapped call. When the operation
completes, the OVERLAPPED pointer returned from GetQueuedCompletionStatus will
now point to your extended structure. Note that the OVERLAPPED field within the
extended structure does not necessarily have to be the first field. After the
pointer to the OVERLAPPED structure is returned, the CONTAINING_RECORD macro can
be used to obtain a pointer to the extended structure.
Take a look at
the example worker thread in Figure 2. The
PerHandleKey variable will return anything that was passed as the CompletionKey
parameter to CreateIoCompletionPort when associating a given socket handle. The
Overlap parameter returns a pointer to the OVERLAPPEDPLUS structure that is used
to initiate the overlapped operation. Keep in mind that if an overlapped
operation fails immediately (that is, returns SOCKET_ERROR and the error is not
WSA_IO_PENDING), then no completion notification will be posted to the queue.
Alternately, if the overlapped call succeeds or fails with WSA_IO_PENDING, a
completion event will always be posted to the completion port.
For more
information on using completion ports with Winsock, take a look at
the Microsoft® Platform SDK, which includes a Winsock completion port
sample (under the Winsock section in the iocp directory). More sample
information can be found at http://msdn.microsoft.com/library/techart/msdn_servrapp.htm.
Additionally, consult Network Programming for Microsoft Windows by
Anthony Jones and Jim Ohlund (Microsoft Press, 1999), which includes samples for
completion ports as well as the other I/O models.
Figure 1 Overlapped Structure
typedef struct _OVERLAPPEDPLUS {
OVERLAPPED ol;
SOCKET s, sclient;
int OpCode;
WSABUF wbuf;
DWORD dwBytes, dwFlags;
// other useful information
} OVERLAPPEDPLUS;
#define OP_READ 0
#define OP_WRITE 1
#define OP_ACCEPT 2
Figure 2 Worker Thread
DWORD WINAPI WorkerThread(LPVOID lpParam)
{
ULONG_PTR *PerHandleKey;
OVERLAPPED *Overlap;
OVERLAPPEDPLUS *OverlapPlus,
*newolp;
DWORD dwBytesXfered;
while (1)
{
ret = GetQueuedCompletionStatus(
hIocp,
&dwBytesXfered,
(PULONG_PTR)&PerHandleKey,
&Overlap,
INFINITE);
if (ret == 0)
{
// Operation failed
continue;
}
OverlapPlus = CONTAINING_RECORD(Overlap, OVERLAPPEDPLUS, ol);
switch (OverlapPlus->OpCode)
{
case OP_ACCEPT:
// Client socket is contained in OverlapPlus.sclient
// Add client to completion port
CreateIoCompletionPort(
(HANDLE)OverlapPlus->sclient,
hIocp,
(ULONG_PTR)0,
0);
// Need a new OVERLAPPEDPLUS structure
// for the newly accepted socket. Perhaps
// keep a look aside list of free structures.
newolp = AllocateOverlappedPlus();
if (!newolp)
{
// Error
}
newolp->s = OverlapPlus->sclient;
newolp->OpCode = OP_READ;
// This function prepares the data to be sent
PrepareSendBuffer(&newolp->wbuf);
ret = WSASend(
newolp->s,
&newolp->wbuf,
1,
&newolp->dwBytes,
0,
&newolp.ol,
NULL);
if (ret == SOCKET_ERROR)
{
if (WSAGetLastError() != WSA_IO_PENDING)
{
// Error
}
}
// Put structure in look aside list for later use
FreeOverlappedPlus(OverlapPlus);
// Signal accept thread to issue another AcceptEx
SetEvent(hAcceptThread);
break;
case OP_READ:
// Process the data read
// •••
// Repost the read if necessary, reusing the same
// receive buffer as before
memset(&OverlapPlus->ol, 0, sizeof(OVERLAPPED));
ret = WSARecv(
OverlapPlus->s,
&OverlapPlus->wbuf,
1,
&OverlapPlus->dwBytes,
&OverlapPlus->dwFlags,
&OverlapPlus->ol,
NULL);
if (ret == SOCKET_ERROR)
{
if (WSAGetLastError() != WSA_IO_PENDING)
{
// Error
}
}
break;
case OP_WRITE:
// Process the data sent, etc.
break;
} // switch
} // while
} // WorkerThread
The Windows NT and Windows 2000 Sockets Architecture
A basic understanding of the sockets architecture of Windows NT and Windows 2000 is helpful in fully comprehending the principles of scalability. Figure 3 illustrates the current implementation of Winsock in Windows 2000. An application should not depend on the specific details mentioned here (names of drivers, DLLs, and so on), as these may change in a future release of the operating system.
Figure 3 Socket
Architecture
The Windows Sockets 2.0 specification allows for a
variety of protocols and their related providers. These user-mode service
providers can be layered on top of existing providers in order to extend their
functionality. For example, a proxy layered service provider (LSP) may install
itself on top of the existing TCP/IP provider. This allows the proxy LSP to
intercept and redirect or log calls to the base provider.
Unlike some
other operating systems, the Windows NT and Windows 2000 transport protocols do
not have a sockets-style interface which applications can use to talk to them
directly. Instead, they implement a much more general API called the Transport
Driver Interface (TDI). The generality of this API keeps the subsystems of
Windows NT from being tied to a particular flavor-of-the-decade network
programming interface. The Winsock kernel mode driver provides the sockets emulation
(currently implemented in AFD.SYS). This driver is responsible for the
connection and buffer management needed to provide a sockets-style interface to
an application. AFD.SYS, in turn, uses TDI to talk to the transport protocol
driver.
Who Manages the Buffers?
As just mentioned, AFD.SYS handles buffer
management for applications that use Winsock to talk to the transport protocol drivers. This
means that when an application calls the send or WSASend function to send data,
the data gets copied by AFD.SYS to its internal buffers (up to the SO_SNDBUF
setting) and the send or WSASend function returns immediately. The data is then
sent by AFD.SYS behind the application's back, so to speak. Of course, if the
application wants to issue a send for a buffer larger than the SO_SNDBUF
setting, the WSASend call blocks until all the data is sent.
Similarly,
on receiving data from the remote client, AFD.SYS will copy the data to its own
buffers as long as there is no outstanding data to receive from the application,
and as long as the SO_RCVBUF setting is not exceeded. When the application calls
recv or WSARecv, the data is copied from AFD.SYS's buffers to the
application-provided buffer.
In most cases, this architecture works
very well. This is especially true for applications that use traditional socket
paradigms with nonoverlapped sends and receives. Before going apoplectic over
the buffer copying that's involved in sending and receiving data, a programmer
should take great care to understand the consequences of turning off the
buffering in AFD.SYS, which can be done by setting the SO_SNDBUF and SO_RCVBUF
values to 0 using the setsockopt API.
Consider, for example, an
application that turns off buffering by setting SO_SNDBUF to 0 and issues a
blocking send. In this case, the application's buffer is locked into memory by
the kernel and the send API does not return until the other end of the
connection acknowledges the entire buffer. That may seem like a neat way to
determine whether all your data has actually been received by the other side,
but in fact it is a bad thing to do. For one thing, even acknowledgment by the
remote TCP is no guarantee that the data will be delivered to the client
application, as there may be out-of-resource conditions that prevent it from
copying the data from AFD.SYS. An even more significant problem with this
approach is that your application can only do one send at a time in each thread.
This is extremely inefficient, to say the least.
Turning off receive
buffering in AFD.SYS by setting SO_RCVBUF to 0 offers no real performance gains.
Setting the receive buffer to 0 forces received data to be buffered at a lower
layer than Winsock.
Again, this leads to buffer copying when you actually post a receive, which
defeats your purpose in turning off AFD's buffering.
It should be clear
by now that turning off buffering is a really bad idea for most applications.
Turning off receive buffering is not usually necessary, as long as the
application takes care to always have a few overlapped WSARecvs outstanding on a
connection. The availability of posted application buffers removes the need for
AFD to buffer incoming data.
However, a high-performance server
application can turn off the send buffering, yet not lose performance. Such an
application must, however, take great care to ensure that it posts multiple
overlapped sends, instead of waiting for one overlapped send to complete before
posting another. If the application posts overlapped sends in a sequential
manner, it wastes the time window between one send completion and the posting of
the next send. If it had another buffer already posted, the transport would be
able to use that buffer immediately and not wait for the application's next send
operation.
Resource Constraints
A major design goal of any server application
is robustness. That is, you want your server application to ride out any
transient problems that might occur, such as a spike in the number of client
requests, temporary lack of available memory, or other relatively short-lived
phenomena. To handle these incidents gracefully, the application developer
should be aware of the resource constraints on typical Windows NT and Windows
2000-based systems.
The most basic resource that you have direct
control over is the bandwidth of the network on which the application is sending
data. It's a fair assumption that an application that uses the User Datagram
Protocol (UDP) is probably already aware of this limitation, since such a server
would want to minimize packet loss. However, even with TCP connections, a server
should take great care to never overrun the network for extended periods of
time. Otherwise, there will be a lot of retransmissions and aborted connections.
The specifics of the bandwidth management method are application-dependent and
are beyond the scope of this article.
Virtual memory used by the
application also needs careful management. Conservative memory allocations and
frees, perhaps using lookaside lists (a cache) to reuse previous allocations,
will keep the server application's footprint smaller and allow the system to
keep more of the application address space in memory all the time. An
application can also use the SetWorkingSetSize Win32 API to increase the amount
of physical memory the operating system will let it use.
There are two
other resource constraints that an application indirectly encounters when using
Winsock. The first
one is the locked page limit. Whenever an application posts a send or receive,
and AFD.SYS's buffering is disabled, all pages in the buffer are locked into
physical memory. They need to be locked because the memory will be accessed by
kernel-mode drivers and cannot be paged out for the duration of the access. This
would not be a problem in most circumstances, except that the operating system
must make sure that there is always some pageable memory available to other
applications. The goal is to prevent an ill-behaved application from locking up
all of the physical RAM and bringing down the system. This means that your
application must be conscious of hitting a system-defined limit on the number of
pages locked in memory.
The limit on locked memory in Windows NT and
Windows 2000 is about one-eighth the physical RAM for all applications combined.
This is a rough estimate and should not be used as an exact figure on which to
base calculations. Just be aware that an overlapped operation may occasionally
fail with ERROR_INSUFFICIENT_RESOURCES, and this limitation is a likely cause if
there are too many send/receives pending. The application should take care not
to have an excessive amount of memory locked in this fashion. Also note that all
pages containing your buffer(s) will be locked, so it pays to have buffers that
are aligned on page boundaries.
The other resource limitation that an
application will run into somewhere in its lifetime is the system non-paged pool
limit. The Windows NT and Windows 2000 drivers have the ability to allocate
memory from a special non-paged pool. The memory allocated from this region is
never paged out. It is intended to store information that can be accessed by
various kernel-mode components, some of which may not be able to access a
location in memory that is paged out. Whenever an application creates a socket
(or opens a file, for that matter), some amount of non-paged pool is
allocated.
In addition, the act of binding and/or connecting a socket
also results in additional non-paged pool allocations. Add to this the fact that
an outstanding I/O request, such as a send or a receive, allocates a little more
non-paged pool (a small structure is required to keep track of pending I/O
operations), and you can see that eventually there will be a problem. The
operating system therefore limits the amount of non-pageable memory. The exact
amount of non-paged pool allocated per connection is different for Windows NT
4.0 and Windows 2000 and will likely be different again for future versions of
Windows. In the interests of your application's longevity, you should not
calculate the exact amount of non-paged pool you need.
However, the
application must take care to avoid hitting the non-paged limit. When the system
runs low on non-paged pool memory, you expose yourself to the risk that some
driver that's completely unrelated to your application will throw a fit because
it cannot allocate a non-paged pool at that particular time. In the worst case,
this can lead to a system crash. This is especially likely (but impossible to
predict in advance) in the presence of third-party devices and drivers on a
system. You must also remember that there might be other server applications
running on the same machine that consume non-paged pool memory. It is best to be
very conservative in your resource estimation, and design the application
accordingly.
Handling the resource constraints is complicated by the
fact that there is no special error code returned when either of the conditions
is encountered. The application will get generic WSAENOBUFS or
ERROR_INSUFFICIENT_RESOURCES errors from various calls. To handle these errors,
first increase the working set of the application to some reasonable maximum.
(For more information on adjusting your working set, see the Bugslayer column by John Robbins in this issue of
MSDN Magazine.) Then, if you still continue to get these errors, check
the possibility that you may be exceeding the bandwidth of the medium. Once you
have done that, make sure you don't have too many send or receives outstanding.
Finally, if you still receive out-of-resource errors, you're most probably
running into non-paged pool limits. To free up a non-paged pool, the application
must close a good portion of its outstanding connections and wait for the
transient situation to correct itself.
Accepting Connections
One of the most common things a server does is accept connections from clients. The AcceptEx function is the only Winsock API capable of using overlapped I/O to accept connections on a socket. The interesting thing about AcceptEx is that it requires an additional socket as one of the parameters to the API. In a normal, synchronous accept function call, the new socket is the return value from the API. However, since AcceptEx is an overlapped operation, the accepted socket must be created (but not bound or connected) in advance, and passed to the API. A typical psuedocode snippet that uses AcceptEx might look like the following:
do {
-Wait for a previous AcceptEx to complete
-Create a new socket and associate it with the completion port
-Allocate context structure etc.
-Post an AcceptEx request.
}while(TRUE);
A responsive server must always have enough AcceptEx calls outstanding so that any client connection can be immediately handled. The number of posted AcceptEx operations will depend on the type of traffic your server expects. A high incoming connection rate (because of short-lived connections or spurts in traffic) requires more outstanding AcceptEx calls than an application where the clients connect infrequently. It may be wise to let the number of posted AcceptEx operations vary between application-specific low and high watermarks, and avoid deciding on one fixed number as the magic figure.
On Windows 2000, Winsock provides some
help in determining if the application is falling behind on posting AcceptEx
requests. When creating the listening socket, associate it with an event by
using the WSAEventSelect API and registering for an FD_ACCEPT notification. If
there are no accept operations pending, the event will be signaled by an
incoming connection. This event can thus be used as an indication that you need
to post more AcceptEx requests or detect a possible misbehaving remote entity,
as we'll describe shortly. This mechanism is not available on Windows NT
4.0.
A significant benefit to using the AcceptEx call is the ability to
receive data and accept a client connection in one call via the lpOutputBuffer
parameter. This means that if a client connects and immediately sends data,
AcceptEx will complete only after the connection is established and the client
sends data. This can be very useful, but it can also lead to problems since the
AcceptEx call will not return until data is received, even if a connection has
been established. This is because an AcceptEx call with an output buffer is not
one atomic operation, but a two-step process consisting of accepting a
connection and waiting for incoming data. However, the application is not
notified that a connection has been accepted before data is received. That means
a client could connect to your server and not send any data. With enough of
these connections, your server will start to refuse connections to legitimate
clients because it has no more accepts pending. This is a common method of
waging a denial of service attack.
To prevent malicious attacks or
stale connections, the accepting thread should occasionally check the sockets
outstanding in AcceptEx by calling getsockopt and SO_CONNECT_TIME. The option
value is set to the length of time the socket has been connected for, or -1 if
it is still unconnected. The WSAEventSelect feature serves as an excellent
indicator that the sockets that are outstanding in AcceptEx need their
connection times checked. Any connections that have existed for a while without
receiving data from the client should be terminated by closing the socket
supplied to AcceptEx. An application should not, under most noncritical
circumstances, close a socket that is outstanding in AcceptEx but not yet
connected. For performance reasons, the kernel-mode data structures created for
and associated with such an AcceptEx request will not be cleaned up until a new
connection comes in or the listening socket itself is closed.
It may
seem that the logical thread to post AcceptEx requests is one of the worker
threads that is associated with the completion port and involved in processing
other I/O completion notifications. However, recall that a worker thread should
not execute a blocking or high-latency system call if such an action can be
avoided. One of the side effects of the layered architecture of Winsock2 is that
the overhead to a socket or WSASocket API call may be significant. Every
AcceptEx call requires the creation of a new socket, so it is best to have a
separate thread that posts AcceptEx and is not involved in other I/O processing.
You may also choose to use this thread for performing other tasks such as event
logging.
One last thing to note about AcceptEx is that a Winsock2
implementation from another vendor is not required to implement these APIs. This
also applies to the other APIs that are specific to Microsoft, such as
TransmitFile, GetAcceptExSockAddrs, and any others that Microsoft may add in a
later version of Windows. On systems running Windows NT and Windows 2000, these
APIs are implemented in the Microsoft provider DLL (mswsock.dll), and can be
invoked by linking with mswsock.lib, or dynamically loading the function
pointers via WSAIoctl SIO_GET_EXTENSION_FUNCTION_POINTER.
Calling the
function without previously obtaining a function pointer (that is, by linking
with mswsock.lib and calling AcceptEx directly) is costly because AcceptEx sits
outside the layered architecture of Winsock2. AcceptEx must request a function
pointer using WSAIoctl for every call on the off chance that the application is
actually trying to invoke AcceptEx from a provider layered on top of mswsock
(see Figure 3). To avoid this significant performance penalty on each
call, an application that intends to use these APIs should obtain the pointers
to these functions directly from the layered provider by calling WSAIoctl.
TransmitFile and TransmitPackets
Winsock offers two
functions for transmitting data that are optimized for file and memory
transfers. The TransmitFile API is present on both Windows NT 4.0 and Windows
2000, while TransmitPackets is a new Microsoft extension function that is
expected to be available in a future release of Windows. TransmitFile allows the
contents of a file to be transferred on a socket. Normally, if an application
were to send the contents of a file over a socket, it would have to call
CreateFile to open the file and then loop on ReadFile and WSASend until the
entire file was read. This is very inefficient because each ReadFile and WSASend
call requires a transition from user mode to kernel-mode. TransmitFile simply
requires an open handle to the file to transmit and the number of bytes to
transfer. The overhead is incurred when opening the file via CreateFile,
followed by a single kernel-mode transition. If your app sends the contents of
files over sockets, this is the API to use.
The TransmitPackets API
takes the TransmitFile API a step further by allowing the caller to specify
multiple file handles and memory buffers to be transmitted in a single call. The
function prototype looks like this:
BOOL TransmitPackets(
SOCKET hSocket,
LPTRANSMIT_PACKET_ELEMENT lpPacketArray,
DWORD nElementCount,
DWORD nSendSize,
LPOVERLAPPED lpOverlapped,
DWORD dwFlags
);
The lpPacketArray is an array of structures. Each entry can specify either a file handle or a memory buffer to be transmitted. The structure is defined as:
typedef struct _TRANSMIT_PACKETS_ELEMENT {
DWORD dwElFlags;
DWORD cLength;
union {
struct {
LARGE_INTEGER nFileOffset;
HANDLE hFile;
};
PVOID pBuffer;
};
} TRANSMIT_FILE_BUFFERS;
The fields are self explanatory. The dwElFlags field identifies whether the current element specifies a file handle or memory buffer via the constants TF_ELEMENT_FILE and TF_ELEMENT_MEMORY. The cLength field dictates how many bytes to send from the given data source (a zero indicates the entire file in the case of a file element). The unnamed union then contains the memory buffer of file handle (and possible offset) of the data to be sent.
Another benefit of using these two APIs is that you can reuse
the socket handle by specifying the TF_REUSE_SOCKET flag in addition to the
TF_DISCONNECT flag. Once the API completes the data transfer, a transport-level
disconnect is initiated. The socket can then be reused in an AcceptEx call.
Using this optimization would lessen the overhead associated with creating
sockets in the separate accept thread, as discussed earlier.
The only
caveat of using either of these two extension APIs is that on Windows NT
Workstation or Windows 2000 Professional only two requests will be processed at
a time. You must be running on Windows NT or Windows 2000 Server, Windows 2000
Advanced Server, or Windows 2000 Data Center to get full usage of these
specialized APIs.
Putting it Together
In the preceding sections, we covered the APIs
and methods necessary for high-performance, scalable applications, as
well as the resource bottlenecks that may be encountered. What does this mean to
you? Well, that depends on how your server and client are structured. The more
control you have over the design of both the client and server, the better you
can avoid bottlenecks.
Let's look at a sample scenario. In this
situation we'll design a server that handles clients that connect, send a
request, receive data from the server, and then disconnect. In this situation,
the server will create a listening socket and associate it with a completion
port, creating a worker thread for each CPU. Another thread will post the
AcceptEx calls. Since you know the client will connect and immediately send
data, supplying a receive buffer can make things substantially easier. Of
course, you shouldn't forget to occasionally poll the client sockets used in the
AcceptEx calls, using the SO_CONNECT_TIME option to make sure there are no stale
connections.
An important issue in this design is to determine how many
outstanding AcceptEx calls are allowed. Because a receive buffer is being posted
with each accept call, a significant number of pages could be locked in memory.
(Remember each overlapped operation consumes a small portion of non-paged pool
and also locks any data buffers into memory.) There is no real answer or
concrete formula for determining how many accept calls should be allowed. The
best solution is to make this number tunable so that performance tests may be
run to determine the best value for the typical environment that the server will
be running in.
Now that you have determined how the server will accept
connections, the next step is sending data. An important factor in deciding how
to send data is the number of concurrent connections you expect the server to
handle. In general, the server should limit the number of concurrent
connections, as well as the number of outstanding send calls. More established
connections mean more non-paged pool usage. The number of concurrent send calls
should be limited to prevent reaching the locked pages limit. Again, both of
these limits should be tunable.
In this situation it is not necessary
to disable the per-socket receive buffers since the only receive that occurs is
in AcceptEx call. Of course it wouldn't hurt for you to guarantee that each
connection has a receive buffer posted. Now, if the client/server interaction
changes so that the client sends additional data after the initial request,
disabling the receive buffer would be a bad idea unless, in order to receive
these additional requests, you guarantee that an overlapped receive is posted on
each connection.
Conclusion
Developing a scalable Winsock server is not terribly difficult. It's a matter of setting up a listening socket, accepting connections, and making overlapped send and receive calls. The main challenge lies in managing resources by placing limits on the number of outstanding overlapped calls so that the non-paged pool is not exhausted. Following the guidelines we covered here will allow you to create high-performance, scalable server applications.
For related articles see:
Writing Windows NT Server Applications in MFC Using I/O
Completion Ports
I/O Completion
Ports
For background information see:
Network
Programming for Microsoft Windows by Anthony Jones and Jim Ohlund (Microsoft
Press, 1999)
Anthony Jones and Amol Deshpande work in the Microsoft Windows 2000 Networking group. Anthony is coauthor of Network Programming for Microsoft Windows (Microsoft Press, 1999).
From the October 2000 issue of MSDN Magazine.