Friday, October 2, 2020

Answer to "Why Are WAN File Copies Slow"

Hi All,

I handle mergers and acquisitions, and during that process we are responsible for transferring terabytes of data from one old company to the new company.

I wrote this "how to / guide" for the guys doing the remote office file copy to the data center.

Though this is for huge data, the limitations are the same for everyone who now can get large network WAN pipes for cheap money but don't understand why their file copies / data transfers are running slow. This post is design for Network IT staff to point business users to this post for understanding as to why file copies are slow over giant WAN network circuits in "GENERAL".

** This post also contains solutions to beat the limitations - Do not do this stuff when on a shared network with lots of users who ALL depend network performance. It could easily saturate the network and drop users from WAN file shares, applications, etc ..

Anyway, here is some of my recommendations / you will need to test which solution(s) work best – I am sure many of these you’re using already. Hopefully a few of these are new and can help:

First a reference point: Network speed is in mbps (mega bits) and file size is in MB (mega bytes) Not that it is possible, but in theory to send 100 MB file almost instantly you would need a 800 mpbs /1gig network (8 bits to a byte).

If I had a 1gig network, Why can't I actually send files instantly?

  1. File checking – Small files create a significant problem because Windows will check that the file does not exist first and then copy. This check on small files could take longer than actually coping the file.
  2. Network overhead – 100 MB of files is usually around 110MB of data once wrapped in network headers, etc
  3. Network latency – the further you are away from the destination the longer the systems WAIT for acknowledgements between each packet group sent.

a. Why are the systems waiting? Send 10 data packets – WAIT for a reply that the other side received the data then send 10 more – repeat – when you send Billions of packets, the WAITING is the killer in WAN transfers

b. Example - When systems are sending data but in WAIT state - no data is sent so it appears that the network utilization / transfer speeds are really low.

c. Not actual but examples to demonstrate my point

i. Houston Tx to Dallas Tx perceived network utilization average of 80mb of 100mb line (5ms latency)

ii. California to Dallas may look like file copy speed of 15mb of 100mb line (50ms latency)

iii. If I upgrade the WAN network from 100mb to 10gig monster circuit, California to Dallas is still 15mb copy speed because you did NOT change the network latency. The wait time is the killer.

iv. You can not improve network latency as networks are mostly built on fiber optics. (your limited by the speed of light in long haul circuits)

Solutions:

  1. Multi-treaded RoboCopy using xxx threads where x=Site Network bandwidth / 5. – with high latency and fat network pipes – each thread should be able to grab a perceived 5mbps

a. Treads are like opening up a different window and running copy x times from different windows.

b. The improvement comes from using the WAIT window to burst a different set of data.

c. Example on how to calculate thread count – 150mbps WAN connection = 150mb/5mb = 30 threads

d. Change thread count based on actual tests from the site (max 128)

i. Run a test with 100mb files with x threads

ii. Run a test with 100mb files with Y threads

iii. Adjust as needed – Crazy high thread count may be worse as multiple reads hit different areas of a local hard drive and will create it’s own internal wait time issue.

e. https://pureinfotech.com/robocopy-multithreaded-file-copy-windows-10/

  1. Most USB drives can’t push 100mb, not even close. Copy the files to the local disk if you can before copying to the DC.

  2. Encrypted local drives also reduce performance

a. Faster CPU improves decryption / grab a new machine not an old one to do the file copy.

b. Test in the office laptop to laptop to determine laptop limitations

i. Use this to gauge how many laptops you need. (see below)

  1. Disk reads from a single disk are limited by drive spinning speed. SSD drive preferred but still have limits. Remember it is not burst speed it is sustained read speed that is important to copy files, which is a slower rating and not usually advertised.

  2. Desktops are FASTER across the board

a. Laptops are designed to be energy (battery) efficient.

b. Desktops use hardware, which is fast / not energy efficient.

c. RoboCopy is a program – uses CPU, etc .. you don’t want energy efficient

d. Network card drivers are a program – it uses CPU

e. Usually you get twice as many CPU cores on a desktop which helps off load other apps running at the same time as the copy

f. Disable any power saving features

  1. You should be able to obtain the same performance improvement or even better than

aggressive multi-threading if, you take the data set and spread it across multiple desktops or laptops with multi-threading robo-copy.

a. User home drives A to P on one laptop

b. Users Home drives P to Z on another

c. Department shares on another

d. etc.

  1. The best solution would be server to server where file data is striped across multiple drives and using high speed disk controllers and killer CPUs / OS designed for file transfers using multi-threaded robocopy.

  2. There are registry hacks that allow for WAIT window tweaking but my preference is for multi-threads / multi-copies as in theory the windows operating system will adjust the WAIT window on it’s own based on response times.

  3. Don’t forget there is also a limitation to how fast data can be written to the Data Center storage solution. It isn’t just you writing data, thousands of people use it all the time during the day and backup job contention at night. If your using laptop to laptop over a WAN, remember Disk reads are usually twice as fast as disk writes. The writing could become the weakest link.

a. Don’t grab 50 laptops from an integration and saturate a 1gig WAN connection which in turn sends 1gig of data to one Storage Solution .. that will impact everyone using the file share or the backup process.

b. The receiving storage solution also has a preferred received thread count – more laptops and threads isn’t always better. Again – test with 100mb of data and determine what works best.

There is a sweet spot in the middle somewhere, just need to find it.

(Reminder - this is not written for network guys - this is summarized for business users where generalizations where used so no need to hate on parts that are not perfect)



No comments:

Post a Comment