shashishailaj
diff --git a/‎articles/data-factory/data-factory-azure-datalake-connector.md
Lines changed: 5 additions & 2 deletions b/‎articles/data-factory/data-factory-azure-datalake-connector.md
Lines changed: 5 additions & 2 deletions
diff --git a/‎articles/data-lake-store/data-lake-store-copy-data-azure-storage-blob.md
Lines changed: 18 additions & 2 deletions b/‎articles/data-lake-store/data-lake-store-copy-data-azure-storage-blob.md
Lines changed: 18 additions & 2 deletions
diff --git a/‎articles/data-lake-store/data-lake-store-copy-data-wasb-distcp.md
Lines changed: 50 additions & 4 deletions b/‎articles/data-lake-store/data-lake-store-copy-data-wasb-distcp.md
Lines changed: 50 additions & 4 deletions
diff --git a/‎articles/data-lake-store/data-lake-store-data-transfer-sql-sqoop.md
Lines changed: 7 additions & 3 deletions b/‎articles/data-lake-store/data-lake-store-data-transfer-sql-sqoop.md
Lines changed: 7 additions & 3 deletions
@@ -1,4 +1,4 @@
----
+---
 title: Move data to/from Azure Data Lake Store | Microsoft Docs
 description: Learn how to move data to/from Azure Data Lake Store using Azure Data Factory
 services: data-factory
@@ -13,7 +13,7 @@ ms.workload: data-services
 ms.tgt_pltfrm: na
 ms.devlang: na
 ms.topic: article
-ms.date: 09/27/2016
+ms.date: 12/01/2016
 ms.author: jingwang
 
 ---
@@ -564,4 +564,7 @@ Properties available in the typeProperties section of the activity on the other
 [!INCLUDE [data-factory-column-mapping](../../includes/data-factory-column-mapping.md)]
 
 ## Performance and Tuning
+
+Depending on whether initial data movement is planned with a big volume of historical data or incremental production data load, Azure Data Factory has options to improve the performance of those tasks. The concurrency parameter is a part of the **Copy Activity** and defines how many different activity windows will be processed in parallel. The **parallelCopies** parameter defines the parallelism for the single activity run. It is important to consider using these parameters when designing data movement pipelines with Azure Data Factory to achieve the best throughput.
+
 See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
@@ -13,7 +13,7 @@ ms.devlang: na
 ms.topic: article
 ms.tgt_pltfrm: na
 ms.workload: big-data
-ms.date: 10/05/2016
+ms.date: 12/02/2016
 ms.author: nitinme
 
 ---
@@ -39,6 +39,7 @@ Before you begin this article, you must have the following:
 
 * **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
 * **Azure Storage Blobs** container with some data.
+* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
 * **Azure Data Lake Analytics account (optional)** - See [Get started with Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-get-started-portal.md) for instructions on how to create a Data Lake Store account.
 * **AdlCopy tool**. Install the AdlCopy tool from [http://aka.ms/downloadadlcopy](http://aka.ms/downloadadlcopy).
 
@@ -86,6 +87,10 @@ The parameters in the syntax are described below:
 
         AdlCopy /Source https://mystorage.blob.core.windows.net/mycluster/example/data/gutenberg/ /dest adl://mydatalakestore.azuredatalakestore.net/mynewfolder/ /sourcekey uJUfvD6cEvhfLoBae2yyQf8t9/BpbWZ4XoYj4kAS5Jf40pZaMNf0q6a8yqTxktwVgRED4vPHeh/50iS9atS5LQ==
 
+### Performance considerations
+
+If you are copying from an Azure Blob Storage account, you may be throttled during copy on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
+
 ## Use AdlCopy (as standalone) to copy data from another Data Lake Store account
 You can also use AdlCopy to copy data between two Data Lake Store accounts.
 
@@ -114,6 +119,10 @@ You can also use AdlCopy to copy data between two Data Lake Store accounts.
 
         AdlCopy /Source adl://mydatastore.azuredatalakestore.net/mynewfolder/ /dest adl://mynewdatalakestore.azuredatalakestore.net/mynewfolder/
 
+### Performance considerations
+
+When using AdlCopy as a standalone tool, the copy is run on shared, Azure managed resources. The performance you may get in this environment depends on system load and available resources. This mode is best used for small transfers on an ad hoc basis. No parameters need to be tuned when using AdlCopy as a standalone tool.
+
 ## Use AdlCopy (with Data Lake Analytics account) to copy data
 You can also use your Data Lake Analytics account to run the AdlCopy job to copy data from Azure storage blobs to Data Lake Store. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.
 
@@ -132,11 +141,14 @@ For example:
 
     AdlCopy /Source https://mystorage.blob.core.windows.net/mycluster/example/data/gutenberg/ /dest swebhdfs://mydatalakestore.azuredatalakestore.net/mynewfolder/ /sourcekey uJUfvD6cEvhfLoBae2yyQf8t9/BpbWZ4XoYj4kAS5Jf40pZaMNf0q6a8yqTxktwVgRED4vPHeh/50iS9atS5LQ== /Account mydatalakeanalyticaccount /Units 2
 
-
 Similarly, run the following command to copy from an Azure Storage blob to a Data Lake Store account using Data Lake Analytics account:
 
     AdlCopy /Source adl://mysourcedatalakestore.azuredatalakestore.net/mynewfolder/ /dest adl://mydestdatastore.azuredatalakestore.net/mynewfolder/ /Account mydatalakeanalyticaccount /Units 2
 
+### Performance considerations
+
+When copying data in the range of terabytes, using AdlCopy with your own Azure Data Lake Analytics account provides better and more predictable performance. The parameter that should be tuned is the number of Azure Data Lake Analytics Units to use for the copy job. Increasing the number of units will increase the performance of your copy job. Each file to be copied can use maximum one unit. Specifying more units than the number of files being copied will not increase performance.
+
 ## Use AdlCopy to copy data using pattern matching
 In this section, you learn how to use AdlCopy to copy data from a source (in our example below we use Azure Storage Blob) to a destination Data Lake Store account using pattern matching. For example, you can use the steps below to copy all files with .csv extension from the source blob to the destination.
 
@@ -156,6 +168,10 @@ In this section, you learn how to use AdlCopy to copy data from a source (in our
 ## Considerations for using AdlCopy
 * AdlCopy (for version 1.0.5), supports copying data from sources that collectively have more than thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into different sub-folders and use the path to those sub-folders as the source instead.
 
+## Performance considerations for using AdlCopy
+
+AdlCopy supports copying data containing thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into smaller sub-folders. AdlCopy was built for ad hoc copies. If you are trying to copy data on a recurring basis, you should consider using [Azure Data Factory](../data-factory/data-factory-azure-datalake-connector.md) that provides full management around the copy operations.
+
 ## Next steps
 * [Secure data in Data Lake Store](data-lake-store-secure-data.md)
 * [Use Azure Data Lake Analytics with Data Lake Store](../data-lake-analytics/data-lake-analytics-get-started-portal.md)
 
@@ -1,4 +1,4 @@
----
+---
 title: Copy data to and from WASB into Data Lake Store using Distcp| Microsoft Docs
 description: Use Distcp tool to copy data to and from Azure Storage Blobs to Data Lake Store
 services: data-lake-store
@@ -13,7 +13,7 @@ ms.devlang: na
 ms.topic: article
 ms.tgt_pltfrm: na
 ms.workload: big-data
-ms.date: 10/28/2016
+ms.date: 12/02/2016
 ms.author: nitinme
 
 ---
@@ -30,7 +30,7 @@ Once you have created an HDInsight cluster that has access to a Data Lake Store
 Before you begin this article, you must have the following:
 
 * **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
-* **Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md).
+* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
 * **Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). Make sure you enable Remote Desktop for the cluster.
 
 ## Do you learn fast with videos?
@@ -41,7 +41,7 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
 
 1. If you have a Windows cluster, remote into an HDInsight cluster that has access to a Data Lake Store account. For instructions, see [Connect to clusters using RDP](../hdinsight/hdinsight-administer-use-management-portal.md#connect-to-clusters-using-rdp). From the cluster Desktop, open the Hadoop command line.
 
-    If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster). Run the commands from the SSH prompt.
+    If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect). Run the commands from the SSH prompt.
 2. Verify whether you can access the Azure Storage Blobs (WASB). Run the following command:
 
         hdfs dfs –ls wasb://<container_name>@<storage_account_name>.blob.core.windows.net/
@@ -63,6 +63,52 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
 
     This will copy the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
 
+## Performance considerations while using DistCp
+
+Because DistCp’s lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Store. This is controlled by setting the number of mappers (‘m’) parameter on the command line. This parameter specifies the maximum number of mappers that will be used to copy data. Default value is 20.
+
+**Example**
+
+	hadoop distcp wasb://<container_name>@<storage_account_name>.blob.core.windows.net/example/data/gutenberg adl://<data_lake_store_account>.azuredatalakestore.net:443/myfolder -m 100
+
+### How do I determine the number of mappers to use?
+
+Here's some guidance that you can use.
+
+* **Step 1: Determine total YARN memory** - The first step is to determine the YARN memory available to the cluster where you run the DistCp job. This information is available in the Ambari portal associated with the cluster. Navigate to YARN and view the Configs tab to see the YARN memory. To get the total YARN memory, multiply the YARN memory per node with the number of nodes you have in your cluster.
+
+* **Step 2: Calculate the number of mappers** - The value of **m** is equal to the quotient of total YARN memory divided by the YARN container size. The YARN container size information is available in the Ambari portal as well. Navigate to YARN and view the Configs tab. The YARN container size is displayed in this window. The equation to arrive at the number of mappers (**m**) is
+
+		m = (number of nodes * YARN memory for each node) / YARN container size
+
+**Example**
+
+Let’s assume that you have a 4 D14v2s nodes in the cluster and you are trying to transfer 10TB of data from 10 different folders. Each of the folders contain varying amounts of data and the file sizes within each folder are different.
+
+* Total YARN memory - From the Ambari portal you determine that the YARN memory is 96GB for a D14 node. So, total YARN memory for 4 node cluster is: 
+
+		YARN memory = 4 * 96GB = 384GB
+
+* Number of mappers - From the Ambari portal you determine that the YARN container size is 3072 for a D14 cluster node. So, number of mappers is:
+
+		m = (4 nodes * 96GB) / 3072MB = 128 mappers
+
+If other applications are using memory, then you can choose to only use a portion of your cluster’s YARN memory for DistCp.
+
+### Copying large datasets
+
+When the size of the dataset to be moved is very large (e.g. >1TB) or if you have many different folders, you should consider using multiple DistCp jobs. There is likely no performance gain, but it will spread out the jobs so that if any job fails, you will only need to restart that specific job rather than the entire job.
+
+### Limitations
+
+* DistCp tries to create mappers that are similar in size to optimize performance. Increasing the number of mappers may not always increase performance.
+
+* DistCp is limited to only one mapper per file. Therefore, you should not have more mappers than you have files. Since DistCp can only assign one mapper to a file, this limits the amount of concurrency that can be used to copy large files.
+
+* If you have a small number of large files, then you should split them into 256MB file chunks to give you more potential concurrency. 
+ 
+* If you are copying from an Azure Blob Storage account, your copy job may be throttled on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
+
 ## See also
 * [Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
 * [Secure data in Data Lake Store](data-lake-store-secure-data.md)
 
@@ -13,7 +13,7 @@ ms.devlang: na
 ms.topic: article
 ms.tgt_pltfrm: na
 ms.workload: big-data
-ms.date: 10/28/2016
+ms.date: 12/02/2016
 ms.author: nitinme
 
 ---
@@ -29,7 +29,7 @@ Big data applications are a natural choice for processing unstructured and semi-
 Before you begin this article, you must have the following:
 
 * **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
-* **Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md). 
+* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
 * **Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). This article assumes you have an HDInsight Linux cluster with Data Lake Store access.
 * **Azure SQL Database**. For instructions on how to create one, see [Create an Azure SQL database](../sql-database/sql-database-get-started.md)
 
@@ -72,7 +72,7 @@ Before you begin this article, you must have the following:
 ## Use Sqoop from an HDInsight cluster with access to Data Lake Store
 An HDInsight cluster already has the Sqoop packages available. If you have configured the HDInsight cluster to use Data Lake Store as an additional storage, you can use Sqoop (without any configuration changes) to import/export data between a relational database (in this example, Azure SQL Database) and a Data Lake Store account.
 
-1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster).
+1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect).
 2. Verify whether you can access the Data Lake Store account from the cluster. Run the following command from the SSH prompt:
 
         hdfs dfs -ls adl://<data_lake_store_account>.azuredatalakestore.net/
@@ -131,6 +131,10 @@ An HDInsight cluster already has the Sqoop packages available. If you have confi
         3    Erna    Myers
         4    Annette    Simpson
 
+## Performance considerations while using Sqoop
+
+For performance tuning your Sqoop job to copy data to Data Lake Store, see [Sqoop performance document](https://blogs.msdn.microsoft.com/bigdatasupport/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop/).
+
 ## See also
 * [Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
 * [Secure data in Data Lake Store](data-lake-store-secure-data.md)