Thursday, September 29, 2011

Cisco H-REAP Local Authentication Improvements

A while back I wrote about Cisco H-REAP, how it works, and some considerations and feature limitations.

As a follow-up, I'd like to briefly cover some recent changes in how H-REAP access points perform client authentication, specifically with the H-REAP Local Authentication features since this topic is confusing for most engineers (in-part because of the name).

H-REAP Local Authentication in code version 7.0.116.0 encompasses three distinct options for authentication of clients when "connected" to a WLC or "standalone".

Backup RADIUS Servers (AP as NAS)
This is the long-standing H-REAP feature that allows the AP to source RADIUS authentication to external 'backup' RADIUS servers when in standalone mode and the WLC in unreachable. In this scenario, the AP is the authenticator which forwards requests to the authentication server (RADIUS server). The authentication server must have the APs defined as a NAS.

This permits the APs to accept new clients and successfully complete authentication without reliance on the WLC. This is really a high availability feature for the WLAN to prevent an outage if one or multiple controllers fail.

Define the H-REAP backup RADIUS servers from the H-REAP Group:

H-REAP Backup RADIUS Servers


Update - Based on feedback in the comments, it should be pointed out that the AP and RADIUS server must use the same shared secret as configured on the wireless LAN controller. The AP inherits the RADIUS server definition and configuration from the controller, and it cannot be configured separately. This applies to both AP as a NAS situations.

Local Authentication (AP as RADIUS)
This feature allows the H-REAP access point to assume the role of both the authenticator and the authentication server (RADIUS server). The AP is required to have local users defined and only support a few EAP types, notably EAP-FAST and LEAP.

This feature is what most engineers describe this feature as meaning, without knowledge of the 3rd option below. Having locally defined users and restricted EAP types makes this option not all that attractive for most organizations, but there are some who have obviously requested it from Cisco. It beats static pre-shared keys for security, but creates administrative burden on network administrators to define and manage local users on the H-REAP APs.

Many organizations already have user accounts stored in corporate directory systems such as Active Directory, and definition of users in an alternate database is not preferred or allowed. That said, this feature doesn't make a lot of sense for user accounts tied to real people since corporate security policies usually dictate that passwords are rotated regularly and support personnel don't have access to those passwords, preventing them from being replicated on H-REAP APs. However, it may make sense for non-user IDs attached to embedded devices such as IP phones, Vocera badges, or handheld scanners.

Enable H-REAP Local Authentication (AP as RADIUS) in the H-REAP Group:

Enable H-REAP Local Authentication by the AP

Additionally, define local users and configure EAP-FAST and LEAP:

H-REAP Local Auth User Definition

Local Authentication (AP as NAS)
Historically, when the H-REAP AP was connected to a WLC, all authentication flowed through the controller. In code version 7.0.116.0 this changed. Now administrators can configure H-REAP APs to source client authentication and bypass the WLC, even in connected mode. This is beneficial when the AP is logically closer to the RADIUS server than the WLC and should source authentication itself to improve performance. A remote site that contains both H-REAP APs and a RADIUS server, with a centralized controller across the WAN is one example.

Enable H-REAP Local Authentication (AP as NAS) in each WLAN:

Enable H-REAP Local Authentication in the WLAN

When this option is enabled, the AP sends client authentication requests to servers in this order:
  1. WLAN AAA Servers
  2. H-REAP Group Backup RADIUS Servers (primary/secondary)
  3. AP Local Authentication (local users on the AP)
One other quick note is that the H-REAP APs also support RADIUS fallback. So if a primary RADIUS server fails and subsequently comes back online, the AP will recognize this and begin sending client authentications back to the preferred server.

Hopefully this clarifies some of these new features that aid both high availability and performance.

Cheers,
Andrew

Wednesday, September 7, 2011

Microsoft Lync QoS

This week I'm back to my favorite topic, quality of service. Engineers across multiple teams at my current employer have been working on a project to enable wireless VoIP using softphones running Microsoft Lync on both Windows 7 and XP. This project has provided an opportunity to review how our organization handles multi-function wireless devices and network performance, with particular focus on quality of service mechanisms.

We've found some interesting things...

Wireless QoS - Not a "One-Size Fits All" Policy (Anymore)
Since Wi-Fi is a shared medium with control distributed to all active nodes in the system, proper network arbitration and performance is heavily dependent on client behavior. When dealing with QoS using 802.11e/WMM, this means accurate application traffic marking and queuing by client devices themselves. 802.11e prescribes 8 user priorities and 4 priority queues to provide a base level of differentiated services for traffic (you can read more about this in my Wireless QoS 5-part series).

Many wireless LAN vendors have only provided basic support for wireless QoS by reading the QoS values within frames and queuing traffic for downstream transmission to clients. Some vendors have begun going beyond basics to provide customers with a feature called "Airtime Fairness", which are proprietary extensions to ensure more in-depth control of clients by the infrastructure. I particularly like how Devin Akin at Aerohive highlights that this feature is really manipulation of the environment to suit a specific policy, whether it be equitable between clients or not. However, these are still mainly downlink traffic mechanisms, and the uplink flow is still controlled by the client. (There are some exceptions to this that involve infrastructure vendors monkeying with client TCP windows, acknowledgments, etc., but let's not get into that now.)

Device Convergence
A Swiss-Army Knife of Capabilities
Traditionally, vendors have implemented QoS on a per-SSID basis. This worked decent once-upon-a-time when all devices had only a single purpose in life. It was fairly easy for administrators to segment data-only devices such as laptops from voice-capable devices such as IP phones. No problem. Everything in the SSID gets slapped with a QoS template and we're done.

But what happens now when we have veritable Swiss-Army devices that perform multiple functions. What SSID do we put those in? How do we differentiate network performance and QoS based on application flows rather than device or SSID?

The answer is that wireless clients and the network must both have more intelligence to handle dynamic QoS requirements. Device convergence and the use of multi-purpose devices eliminates the ability to effectively use static 1:1 QoS policies tied to wireless SSIDs. We need something better!

Microsoft Lync QoS - A Case Study
Microsoft Lync is a software platform for unified communications providing data, voice, and video collaboration on Windows workstations. Many organizations are exploring device convergence to expand capabilities available to all employees while controlling capital and recurring costs.

As part of our lab verification of Lync prior to production deployment, we tested Wi-Fi quality of service integration. Microsoft Windows Vista, Windows 7, and Server 2008 platforms support policy-based QoS, while older Windows XP systems only support two service levels with the QoS Packet Scheduler. I'll focus on Windows 7 workstations with the more robust policy-based QoS capabilities. We setup an Active Directory GPO to classify and mark all voice traffic coming from Lync with a DSCP value of 46 (expedited forwarding), which is a general best-practice on IP-based networks (see RFC 3246 section 2.7) and conforms with Cisco's QoS Baseline recommendations.

The resulting traffic analysis verified that Lync correctly marked DSCP in the packets, but we also noticed that the layer 2 Class of Service (CoS) marking for 802.11e/WMM is set to 5. We expected to see 6, since the 802.11-2007 standard clearly states in table 9-1 that user priorities 6 and 7 are reserved for voice, while 4 and 5 are reserved for video (note that the Wikipedia entry on 802.11e contains incorrect data). This is distinctly in conflict with the 802.1p CoS values for wired LANs (using the revised 802.1Q-2005 values) which places voice traffic in priority 5.

Microsoft Lync marks layer 2 CoS values based on mapping 
IP Precedence values (3 most-significant bits in IP ToS header field)

Moreover, the mappings between layer 2 and layer 3 markings are distinctly different between vendors. Microsoft maps layer 2 CoS values, whether 802.1p or 802.11e, to the same set of layer 3 values based on IP Precedence (not DSCP) and does not differentiate between wired and wireless networks. Mappings in the opposite direction (DSCP to layer 2) also rely on only the IP Precedence values and not on the full DSCP codepoint. Therefore, a DSCP value of 46 is translated as an IP Precedence of 5 and a layer 2 priority of 5. Cisco, on the other hand, maps values differently for wired and wireless networks (see Table 2-7) to accommodate the variations between standards.

Differences in layer 2 Class of Service (CoS) values between IEEE standards
and layer 2 to layer 3 mapping implementations by vendors create complexity

The root of the issue is two-fold. First, the 802.1p and 802.11e QoS values are clearly in conflict. This makes accurate QoS implementation difficult to achieve because variations need to be dealt with correctly by each and every solution implementation. This introduces plenty of room for error. Second, Microsoft's QoS implementation in Windows maps DSCP to layer 2 CoS based on legacy IP Precedence values. It does not differentiate between wired and wireless network connections and adjust markings appropriately.

Network-Wide Ramifications
The standard practice of marking voice with DSCP 46 will result in the improper classification, marking and queuing of the traffic throughout the network.

First, wireless client transmisions (upstream) by the Windows workstations will get placed into the video queue and will not receive the appropriate contention window values. This will reduce the statistical advantage that voice frames receive for transmission over the air and could impact voice latency and jitter, especially in video rich networks. As video adoption in the enterprise increases, especially with increasing mobile device usage, this could have serious effects on voice quality.

Second, many wireless network infrastructure vendors to do not provide the ability to inspect and re-classify traffic and are forced to trust the client markings. For example, the Cisco Unified Wireless Network simply enforces a maximum QoS value for each SSID that clients cannot exceed. If the SSID is configured for Platinum QoS, then no maximum can be exceeded and traffic will be translated based on the client marking. The CAPWAP tunnel will map the client's layer 2 value of 5 (video) to a DSCP value of 34 for the outer tunnel IP header. Once the traffic is de-encapsulated by the controller, it applies an 802.1p value of 4 based on the CAPWAP packet DSCP value 34 mapped back to a layer 2 wired value, while leaving the client packet's original DSCP value of 46 untouched. Given the best practice of configuring switches to trust DSCP from CAPWAP APs and trust CoS from controllers, this traffic will be mishandled by intermediate switches and routers as well. In fact, the default Cisco switch behavior (when QoS is enabled) is for DSCP transparency to be disabled, and will result in the trusted layer 2 value coming from the WLC overriding the original client DSCP marking and being re-written to 32 based on the default switch CoS-to-DSCP mapping.

Third, advanced wireless voice control features will be ineffective and broken since the voice traffic is not properly identified within the voice queue. This includes TSPEC, call admission control (CAC), traffic bandwidth reservation, expedited bandwidth requests to facilitate emergency 911 calls, and voice stream metrics collection and reporting. Many of these features are only available to traffic streams within the voice queue. Additionally, off-channel scanning used by RRM and rogue detection are configured to defer if traffic in user priorities 4, 5, or 6 have been received in the last 100ms. If this policy has been changed to only defer for priority 6 traffic it could negatively impact Lync traffic in priority 5 by increasing network re-transmissions and network latency.

Finally, WAN bandwidth across the network could be negatively impacted. Typically, network administrators will design WAN circuits to reserve bandwidth for a specific amount of voice calls (based on Erlang calculations) as well as enforce a per-call bandwidth limitation based concurrent call volume. Since the voice calls will not be placed into the voice queue, no traffic admission or policing can occur. Should the environment grow to a point where concurrent voice calls exceeds the design, then WAN bandwidth may not be sufficient to handle the additional calls but will not be able to restrict call admission, resulting in poor voice performance.

Let's also not forget the risk of human error somewhere down the road. Assuming documentation is in proper order, policy accommodations have been made throughout the network, and engineer transitions include adequate knowledge transfer, a clear risk still exists that future changes will inadvertently affect voice traffic since it isn't supposed to be within the video queue.

Integrating Lync Into a Broader Network Architecture
Our networks aren't created in a vacuum; it's a shared resource and engineers must design networks to handle varying capabilities and demands of all connected endpoints and traffic flows. So, what options are available to mitigate this issue and effectively handle Microsoft Lync voice traffic on a converged wireless network?

We could ignore the problem and apply the standard voice QoS value of DSCP 46 (EF) in our Windows policy. This will result in accurate marking and traffic handling for wired clients, but suffers the ramifications previously described for wireless clients and broader network resource impacts. Hardly an optimal solution.

Instead, consider applying a non-standard DSCP marking. Using this approach we use policy-based QoS on Windows platforms to mark voice traffic as DSCP 48 (CS6), which allows it to be mapped to the correct wireless layer 2 CoS value of 6 and be queued correctly for transmission by the client.

The problem is that this DSCP value is reserved for Internetwork Control traffic by most networking vendors (including Cisco). Network administrators will need to ensure that wired and wireless integration is configured correctly, otherwise incorrect classification could propogate throughout the network for these traffic flows. For Cisco wireless networks this means that switch ports should be configured to trust DSCP from APs and CoS from controllers. This trust coupled with the Cisco wireless mappings between CoS and DSCP ensure that the correct DSCP value of 46 (EF) will be used both in the CAPWAP tunnel and re-written by the switch (based on WLC CoS trust) on the client packet once forwarded upstream out of the controller.

If using another wireless vendor, engineers should review the QoS capabilities of the solution and integration options to ensure accurate handling of this traffic. For example, many vendors support robust inspection and classification of traffic flows to match configured QoS policy directly in the access point. In these instances, configure APs to identify voice traffic and apply QoS based on defined policy.

This configuration will also cause incorrect markings when the workstations are connected to a wired network, which is still a likely occurrence for laptops in most environments. Network administrators should ensure that switches are configured to strip the client QoS markings on switch ports throughout the network, which is standard practice. This way the switch will ignore the client markings and re-apply QoS policy using traffic inspection techniques. Ethernet switching largely eliminates medium contention concerns on wired networks, so having an incorrect policy on the first-hop upstream is not a large concern.

I prefer to implement this solution. With proper end-to-end network QoS implemented, accurate traffic handling can be accomplished on both wired and wireless networks.

Revolution or Evolution? - Andrew's Take
Clearly something went wrong with standards-based QoS development. Divergent layer 2 QoS definitions exist from the IEEE standards. Furthermore, no standard mappings exist between layer 2 and layer 3 QoS values, as evidenced by differences in vendor implementations.

As mobile device convergence proliferates and wired and wireless networks continue to become more closely integrated, consistent end-to-end QoS policy definition and traffic handling will be critical in supporting increasing network demands.

Microsoft Lync highlights these challenges. Increasingly, organizations are looking for converged solutions to provide improved business capabilities and service. And many client platforms won't have as robust control or policy definition options as Windows. How will your organization handle voice, video and data over iOS, Android, Blackberry, Windows Phone 7, and other platforms?

Cheers,
Andrew


Additional Links for Microsoft Lync and Windows QoS: