Infrastructure Musings

Monday 15 September 2014

Cloud Security FUD - Worrying

Apparently, according to a Computer Weekly article:

http://www.computerweekly.com/news/2240230281/Cloud-FUD-far-removed-from-realities-finds-research?asrc=EM_MDN_34048363&utm_medium=EM&utm_source=MDN&utm_campaign=20140915_Cloud%20FUD%20far%20removed%20from%20reality,%20finds%20research_

there is a lot of FUD surrounding security problems in the cloud.

Let me start by saying that I believe that there are many good ways in which cloud offerings can be secured. Some will be inadequate, some will be adequate, some will be very good and some will be so secure that it really should be called a private network and not a cloud.

But this article claims that there should be little concern about security. But its not a good article to prove the premise, in fact it actually does the opposite.

In particular, this line concerns me: "only 2% of organisations admitted to experiencing a cloud service-related data security breach".

2% admitted to a problem. That's 1 in 50 companies. 1 in 50. To me, this heightens concerns about security, it certainly does nothing to make me comfortable. Given that many organisations may not want to admit to security breaches in a survey, then the number is probably higher than is declared here. So is it 1 in 25 companies? 1 in 10?

The survey reported in the article appears to be promoted by the Cloud Industry Forum. I thought this was a reasonably reputable body. Is this article and the CIF's lack of concern over 2% of companies experiencing data breaches really doing the industry any good? Or CIF itsself?

Tuesday 20 May 2014

How Many Is Too Many VDI Sessions?

I was reading through some reference architectures this morning and I noticed a (non-Dell) offering for 7,000 VDI user sessions. This was billed as a large scale reference architecture for 7,000 VDI sessions - i.e. a fairly typical medium sized deployment.

Dig deeper and we find the testing was performed on 7,000 running sessions, but only 80% of the sessions actually running any application activity within the sessions - so that's 5,600 sessions. User density per server (in this example) drops from an apparent 145 sessions to a more realistic 116 per server. Allocation of resources to VDI sessions was 1 vCPU and 2GB RAM. The servers used were each deployed with 256GB RAM. At a user density of 116 per server, that's a total of 232GB RAM - add on the requirement for the vSphere hypervisor and that's taking the server to its maximum memory. With the 7,000 users meaning 145 sessions per server, then the actual server specification was not sufficient to support the allocation per user (145 x 2GB = 290GB + an allowance required for vSphere). Reported memory utilisation in the report was fine, but there is a risk of over commitment of memory

The tests were conducted using the industry standard LoginVSI activity simulation tool. With 7,000 users the CPUs were maxed out and the VSIMax score was reached which indicates users sessions that will be suffering performance degradation, something to be avoided.

Storage is reported to be very low at around 3.5TB which is an impressive level of compression delivered through the use of linked cloned images, data reduction techniques and thin provisioning. However, the system deployed in the testing comprised just under 60TB of storage. This would appear to mean the storage footprint could be much smaller than that deployed, but its not clear if the storage volume would still be required to deliver the required level of IOPs. VDI sessions were all non-persistent which typically gives greater storage efficiency than would be seen with persistent sessions.

So here are some points to think about when looking at reference architectures for VDI:

How many sessions are deployed?
How many sessions are concurrently active?
How much CPU is being consumed during peak concurrent activity (85% is a reasonable maximum)?
How much memory is being consumed during peak concurrent activity?
Is the server memory configuration high enough to avoid memory contention / over-commitment for the resources allocated to the VDI sessions?
How large are the VDI sessions - vCPU and memory allocations? Do these match your actual requirements? If not, how will this affect density per server when they are matched to your requirements?
How many IOPs are available when the system is under stress?
Are industry standard tools being used to generate load and measure performance?
If LoginVSI is in use, consider the point at which the VSI Index curve starts to climb steeply - this is when session performance is starting to degrade and is often well before reaching VSIMax. If VSIMax is reached, performance is likely to be well beyond acceptability for users.
Are sessions persistent or non-persistent? Does this match your users' need?
Is the volume of storage required there to provide storage capacity or is it there to provide IOPs capacity? If volume of storage is matched to utilisation, will IOPs available suffer?
What IOPs have been assumed per user? Will this reflect realistic IOPs in use?
Check IOPs proportions. Typcially 75% / 25% or 80% / 20% write / read ratios are seen as reasonable for VDI sessions.
Has user experience been measured and reported (either subjectively or via a tool such as Stratusphere UX or similar)?
In a typical hypervisor environment - what will happen when a host is lost within the cluster design? Will there be headroom on the surviving servers to handle the re-distributed workload?
What density will be achieved once you have applied your disaster recovery standards?

Explore the white papers and reference architectures very carefully as each one takes a different approach to deployment and reporting. They are very useful papers, they give very strong indications as to what you can expect to be able to deploy. Make sure when you compare between vendor papers, that it truly is an apples for apples comparison, and apply lots of "what if?" analysis and ensure you understand the differences between the papers and the actual deployments that will work for you in the real World.

Friday 2 May 2014

Domestics This Time - Leaky HomePlug Networks

A bit of a domestic diversion this time: I've been using HomePlugs to feed my Squeezeboxes, TVs, DS etc. for approximately 3 years and they've been completely reliable to date. I use a mix of TP-Link single plugs and Zyxel multi-port units.

Just in the last week I've started to have reliability issues with my network - weird stuff was happening like the wifi recycling every 2 to 3 minutes and having trouble logging into my EE router from my PC. Re-setting the router to factory default seem to resolve the problem for up to half a day or so then everything started going wobbly again.

Then, just yesterday, I did a factory default reset of the router and as I was trying to log into it, the router suddenly flipped from an EE log in page to a Huawei log in page. Weird as I don't have an Huawei router in the house, never mind plugged in. Access to the internet continued OK, but the devices plugged into the HomePlugs were not connecting properly with the network and the DS started doing that fading then coming back thing every couple of minutes which happens only when there's a network problem. Removing all the HomePlugs and running only on wifi made the whole thing stable, but the hifi very quiet of course.

I decided on a process of elimination by plugging in one HomePlug at a time to see which one was causing the problem. But this wasn't consistent - sometimes the problem arose with one device, then with another. Then, the weirdest thing happened - McAfee on my PC popped up with a warning that I had about 10 devices on my network that "are not protected by McAfee". One of these was a laptop named "CHRIS". A Chris lives over the road from me! This only happened when a HomePlug was connected up - any of them - but not when only running wifi.

A bit of Googling later and I find that the general concensus on "leakage" of HomePlug signals beyond the house circuit breaker box / consumption meter has changed significantly compared to 3 years ago. When I first started with these things, leakage beyond the consumer unit was "extremely unlikely". Now, its reckoned to be virtually certain in an apartment block, and "fairly likely" in houses.

In order to make these devices "plug and play" they all present themselves to a default network grouping of HomePlug devices called "HomePlugAV". So any of these devices that stick with this default name will talk to each other. I think one of my near neighbours must've recently decided to use HomePlugs when they haven't before, and our 2 networks are talking to each other. And because many routers default to the same network IP address, I was getting access to my neighbour's Huawei router. For some reason, that was "over powering" the IP address of my own router. So not surprising that my HomePlug devices and the rest of the network was getting confused - the DHCP server and default gateway were probably flipping between my own router and that of my neighbour.

So the resolution was to install the HomePlug management software on my PC and rename the network name on all the devices away from HomePlugAV and to something unique to my own house.

Now all seems to be well and stable - across HomePlug and wifi. Will keep an eye on how it goes and I have a list of the MAC addresses of the outsider HomePlugs so I can go and advise whichever neighbour they belong to to secure his/her own network.

So there you go. Take note and make the changes to your own HomePlugs before things start to go a bit wobbly in your network too.TP

Friday 21 March 2014

How Will You Cope With A Cloud Disaster?

What happens when a cloud service fails?

What happens when a cloud provider has a disaster?

How does your business continue?

Why does my internal IT cost me so much more than buying a few VMs from a cloud provider?

I've recently read some interesting points on this topic (they are paraphrased below and the authors will not be named), which gives me cause to think about the approaches that businesses are taking to cloud services, and some of the areas which appear to be adding risk to their business for their business critical IT services. Here are a couple of examples and why they're not correct:

"One of the benefits of cloud is that IT service continuity becomes the problem of the service provider". No! This is one of the most concerning statements I've seen recently. Service continuity is never the problem for the service provider. It is always the concern of the business consuming that service. It is possible that the way in which the service continuity is provided is within the remit of the service provider, but the continuity of the service itself is the problem of the consuming business. Your business must ensure that the right provisions are in place, either through contractual arrangements (guaranteed service level agreements, recovery point objectives, recovery time objectives all with penalties which are commensurate with the business you will be loosing if these are not met) or, your business needs to think about service design as part of adopting cloud. See thoughts further down this article.

"public clouds do not offer disaster recovery". That's also not strictly true. By default, most probably don't. However, many will offer the option to add such services. Also, if you have an application that can run in multiple sites to provide high availability (HA) facilities, by agreeing with your service provider that they guarantee to run your application in multiple sites, then your HA can become your disaster recovery (DR) approach. Also, the public cloud could be your DR facility - more later.

So a cloud strategy, just like any other IT strategy, needs back ups and DR plans. To dismiss public cloud as not providing DR is missing some of the options, equally to just assume that the service provider will look after DR is equally problematical. There are many approaches to this, and I'll suggest some below, but this isn't a comprehensive list or guide, its here to provoke some thoughts and some ideas.

By default, public cloud usually does not include DR in the traditional sense. However, by carefully selecting multiple cloud providers for a single business process or a cloud provider who can guarantee your systems will run in multiple sites, DR can be achieved - as long as your application is also designed with multi-site capabilities. In this case, DR is essentially just an extension of your HA approach.

Additionally, cloud can be part of your DR strategy. For example, you can run all your systems in house and upload data and source code up to cloud services on a regular basis. With the right process design and cloud service provider contract you can then invoke your DR by running that code and using that data that is stored in the cloud wherever your cloud provider has the capacity available.

So be careful how you design your service provision. Think about the options, of which these are some:

- multiple physical locations offered by your service provider (make sure your data is still geographically compatible with relevant regulations, and your applications are designed for multi-site use);
- service provider provisioning of DR processes and facilities. Make sure the design and contractual arrangements meet the business criticality of the system. Where contractual penalties are agreed, make sure an hour of penalties equals or exceeds an hour of lost business, and this is maintained as business volumes fluctate;
- choose multiple service providers for the same service on a continuous basis - make sure they don't share data centre facilities;
- choose one service provider to provide the service on an ongoing basis, and another to provide quick burstable capacity for use in a DR situation. Design processes to ensure that the second service provider has current copies of your systems and regular updates of data, commensurate with your recovery point and recovery time objectives;
- use the cloud as your DR strategy for your in-house systems. Regular copies of systems and data up to the cloud provider with a contract that allows you to start up your systems very quickly. Treating the cloud provider as your second "warm" standby data centre is a viable approach.

Make sure you use one of the above, or a similar approach to ensuring your business can continue when there's an IT disaster and make sure that each of the above can meet your security requirements. And test them frequently to make sure whichever approach you choose, actually delivers to the design parameters and service requirements. Remember that when to invoke, how to invoke and who is responsible for what during invocation is just as important as the IT elements. Also cover what happens when the disaster is over and you need to return to your regular service provision - getting back to normal can be just as hard or harder than invoking disaster recovery.

When comparing the costs of your internal IT with the cost of buying a few VMs from a cloud provider, remember that you are unlikely to be comparing apples with apples - make sure the capacities, performance and guarantees of up time and recovery time are equal. Only then will costs of the insurance service provided by most internal IT teams become more apparent. Its surprising how quickly the cost of a cloud service escalates when you truly match the service levels to existing provision. This is one of the greatest risks with shadow IT services - they are often not comparable to internal services and are often putting critical business processes at risk.

Or choose to take the risk of course - that's always an option, as some business services are not critical to the business - if you can afford for service to be unavailable for a few days or weeks, then its probably appropriate not to pay for contingency planning up front.

Wednesday 19 March 2014

What Are The Top 3 Benefits of Cloud?

Thanks to linked in user Brian Murphy for raising the following question:

What are the top 3 benefits of cloud computing?

Here are my musings:
The top 3 benefits will vary by customer size, complexity, maturity, market vertical etc. What do I mean by this?
Well, for an SMB, lower cost may be number 1, but they'll only get that if they buy from public cloud - they won't achieve lower cost if they build a private cloud from scratch until they reach a critical mass..

For a large organisation's dev and test teams, agility, low cost of entry, ability to tear the environment down when they've finished etc., will be a great benefit. If they're doing that in a public cloud offering, that could well be acceptable for that particular use case. But for a production environment in the same organisation's highly regulated market place (where regulators are notoriously difficult to pin down when they use the term "adequate"), a private on site solution (cloud or not) could be the most cost effective and / or lowest risk approach to meeting regulatory requirements.

For a large and complex organisation that needs agility, that might be their number 1 benefit, but it doesn't apply if the main business of a large and complex organisation is doing the same business activity over and over again and they're busy refining their processes to eek out every last cent of efficiency. However, that organisation might choose to sell their super-efficient process as a "cloud" offering to other companies...

And so it goes on. I wish you luck with your video, but I do find that there are many attempts to simplify "the cloud" that business execs are exposed to (e.g. airline magazines) which really don't cover off the basics of going back to what the organisation needs to achieve and then mapping the right approach specific to their needs, be that public, private, hybrid or no cloud at all. Missing this fundamental point and doing cloud because an article, video or other source says you should do cloud because its a "good thing" is a mistake.

Thursday 13 March 2014

Struggling With Large Scale VDI?

Through customer conversations and ad-hoc surveys at events, I find that most organisations that have embarked on VDI projects tend to deliver to somewhere between 10 and 20% of their user base. This is usually where the business case is easily justified - typical examples would be offshore developers or senior executives who would like to use their tablets for business use.

Those who have not embarked on VDI are often deterred from doing so by the costs per user, or the complexity implications. Those that have delivered to more than 20% of their user base often find that they are seeing poor performance or much higher costs per user than they were expecting - often needing to throw more and more storage at the environment to get somewhere near the performance users are expecting.

Many of the costs are related to the costs of purchasing and operating the storage environment that underpins VDI sessions - either the cost of capacity or the cost of providing enough storage performance to support the required user experience.

As a result of this, I've been working on looking at a number of options for removing these performance and cost bottlenecks. Using the Dell lab facilities, we've been stress testing a number of these options technically and from a business case perspective, in partnership with large customer organisations.

Our conclusions lead us to a different way of thinking about VDI and a solution that will allow organisations to scale out to 10s of thousands of users and give those users high performing VDI sessions. You can learn about them through our webcast which will be recorded for future viewing, but if you want to hear about this first, and in an interactive session where you can ask questions, please register for the event on 27 March 2014 at the link below. This will also give you access to the extensive white paper documenting the test results in our labs:

REGISTER NOW

Monday 10 February 2014

Counting Cores

Once upon a time, an application ran on a CPU.

Over time, those CPUs got faster and more powerful.

Then, one day, those CPUs had more than one core per CPU. The multi-core CPU was born.

Over more time, those cores have multiplied many times and 8, 10, 12 and 16 core CPUs are now common. Whilst CPU clock speeds have settled in the 2.0GHz to 3.0GHz range over the past few years, more "bang per buck" has been delivered through adding more and more cores, and through improvements in memory bandwidth, caching and moving connectivity ever closer to or onto the CPU.

So the applications will have kept pace and will be making use of all of these extra cores of compute, right? Well yes, and no. The advent of virtualization and hypervisors has effectively allowed multiple applications to share a common CPU compute platform. This is an excellent way to consolidate and get more value out of this very powerful infrastructure. It also allows the opportunity for laziness amongst application developers as virtualization can allow them to "hide" those applications that haven't been adapted to the multi-core World. So those "single threaded" apps are still out there, although they are becoming less and less common.

Why do I raise this now? Well just at the end of last week I was in conversation with a customer who had a bit of a dilemma when migrating a VM from an older virtual server farm to a newer one. From a 3 year old cluster to a 3 month old cluster. The VM ran slower on the new farm than it did on the old. We had a good discussion about resource utilisation (everything unstressed), storage IOPs and performance (nothing of real note) and versions of firmware, VMtools etc. But the point of raising this here is that the old cluster had servers with 4 core 3.05GHz CPUs, the new one has 8 cores of 2.5GHz CPUs. Multi-thread capable apps would probably be seen to run faster on the new cluster which has more cores. However, it turns out that this particular application is still single threaded. And when the hypervisor starts scheduling this across multiple cores, all we get is a queuing effect. And once the work gets done, its getting done on slower cores. So the single thread is interrupted as the work is moved across multiple cores, which is slower than a dedicated core. In the older 4 core CPU, there's less scheduling being done (or maybe very little if any scheduling, depending on what else is happening on that host) so the application runs quicker.

So, do we have to go back to the drawing board to re-write the application? Scarey from a cost point of view, I'm sure. Well, that would be best, but probably isn't practical in very many circumstances. So the settings in the hypervisor to tie a particular VM to a single CPU core ("core affinity") can be selected to overcome this performance hit. Its a good and effective fix for the short to medium term.

To fix this for the longer term, the next time the hood's up on that application for some other fix or development, see if you can persuade the developers to move it to multi-thread - its about time!

Thursday 6 February 2014

When Should You Not Go To The Cloud?

This post is in response to a question posed on LinkedIn by Paul Calento Chief Executive Officer at TriVu Media Top:

Disclosure: I work for Dell - we sell data centre products and services to end user organisations and to cloud / service provider organisations.

There are many factors in the decision to use or not use cloud. For the purposes of responding to your question, I'm going to assume you are questioning the use of public cloud vs internal provision.

Scale / Cost:
If you are very large enterprise, there is every reason for you to be capable of delivering an internal cloud that rivals or betters an external cloud in terms of performance, cost and security. So if you have the scale, with the skills, then the potential is there to do something better within your organisation.
We should also consider the scale of individual projects. For the dev, test, and small scale production, external cloud can make sense. Pay as you grow models at the lower end of the scale are attractive. However, the "$ per GB" model (as an example) of quick to deploy commodity compute environments are initially very attractive - but because costs scale linearly with consumption, there is the real chance that, at some point, the multiples of unit cost are going to gross up to a total cost that exceeds the costs of providing the service in a more traditional way. So it makes sense to fully understand what you expect a service to consume in 18 months or 3 years time, when deciding to go external cloud or more traditional route. You also need to consider that, over time, the cost / benefit model will change through growth, or shrinking of the scale of the service. So what originally made sense internally might become smaller over time, or less critical to the business and should eventually be farmed out to a commodity cloud provider. And vice-versa, what initially might've been an experimental service to "see how it goes" could grow into something large scale and upon which you are now spending more than you could provide that service internally. Typically I don't see much reference to this kind of change over time in the cloud discussion, which is a high risk strategy.
For those organisations with a large investment in existing systems, the business case for ANY change (whether it involves cloud or not) must take into account the value of the existing service provision. For example, it's likely to be difficult to justify the business case for migrating a service to the cloud, if you've just made a multi-million dollar investment in the existing platform, and that needs to be depreciated over 4 or 5 years. Similarly, if you have a very large data centre facility - as you migrate services out to the cloud, the unit cost per square metre of data centre space becomes significantly larger for each of those services that remain in the data centre. So if team A move their service out to the cloud to save 25% on their costs, but it increases the running costs of teams B and C that stay behind by 30% then the holistic business case doesn't stack up. So be careful about the big picture, not just the individual services.

Control and Risk:
For some, control and security of data is paramount to their business - particularly in heavily regulated environments. Whilst many public cloud offerings now offer the potential for good security, for some this just isn't enough, as any compromise is potentially ruinous to their business, including the risk of hosting customer data outside appropriate geographic boundaries. The more a company needs to have an iron-grip on who accesses data, where that data lives, how they can have audit rights to that data and when they need to guarantee physical separation, all of that heavily weighs against a public cloud service. The economics of cookie-cutter services, shared infrastructure etc. of the public cloud don't stack up as soon as you start to add customization.
I would also suggest, for many companies, that if they put something into an external cloud service then it may be wise to use more than one provider, if that is practical. This has the benefit of keeping the providers in competition with each other, and provides a measure of protection in the event that one provider ceases to exist, under-performs, over-charges or removes that service from their offerings.

Integration Complexity:
Some organisations are struggling to cope with the complexity of their internal systems and how they all interact with each other, often not having full documentation. This means they struggle to understand what service impacts will occur when they un-pick old systems, or add in new functionality. If you then add to that the need to include external connectivity and industry standard APIs, then its not easy. The cost of change may well exceed the benefits of hooking existing services to new external services. Requirements such as single customer view, which are required for some regulations, encourages organisations to keep services internal to ensure that the data remains under control and accurate, moving large volumes of data up and down external connections doesn't always make much sense in these situations.

Management:
Many larger companies who have outsourced services (including management of on-premises services) such as networks, data / storage management, some commodity processing applications, have found managing the interfaces to these providers to be complex and onerous. Getting all these providers to work together for the benefit of the company who is buying the services can be something that needs a great deal of management time and rigorous processes, and that's just when everything is going smoothly. When there are service issues and those issues cross boundaries then who is at fault and who needs to rectify the issues can be complex to identify and resolve at best, and can end up with long and expensive legal disputes at worst. Add the potential for multiple cloud service vendors into that mix and you can see how quickly the costs of managing such an environment (and by that I don't just mean the operational costs, but also the risk / cost to customer services and regulatory compliance) could outweigh the expected benefits.

Some of the above issues can be mitigate with a rigorous management approach (which is not happening in the grey IT economy) and/or the right tooling (such as Dell Boomi or Dell Multi-Cloud Manager). This is why internal IT needs to become the conduit for IT services - so control to meet the organisation's objectives can be maintained whilst ensuring that IT services are provided in the most effective way - internally, private cloud or public cloud. The correctly balanced mix will be the best solution for many organisations. One size fits all is not likely to lead to a good res

VMWORLD Europe 2013 BLOG LINKS

Whilst attending VMWORLD Europe 2013, I made notes on the sessions and captured some early thoughts about the potential impact of what I saw and heard. Here's the summary of links to each of the blog posts, to help you quickly get the information that interests you most:

Technical Content

VMWORLD General Content

Local and Travel Content

The First Musing

This blog will be about thoughts that occur to me as I go about my daily role. Currently I work for Dell in the UK, in a mixed infrastructure architecture / design role for our largest global customer organisations. Much of the time I'm working on specific virtualisation / cloud solutions for specific customers or helping them develop their strategic direction. This gives me a good insight into the common challenges I see across a number of customers in different markets, which allows the development of reference architectures that we can take as solutions to many or our customers.

Brief career history, prior to Dell (most recent first):

Head of x86 Server Infrastructure Architecture, Lloyds Banking Group: 5 year overall infrastructure strategy & Direction; 3 year investment planning and ownership of architecture assets for x86 server, Windows server, VMware and Linux OS; Team of 11 architects; design of VMware platform for bank merger – 1100 host servers.
Head of x86 Infrastructure Architecture for HBOSplc – role as above.
Head of Infrastructure for Bank of Scotland Corporate Banking
Head of IT Audit for Bank of Scotland
Several roles in Civil Service IT

I Work For Dell