Skip to content

Commit aaa7a1a

Browse files
authored
Merge pull request MicrosoftDocs#2189 from nitinme/nitinme-master
Nitinme master
2 parents 5f3ab3c + fac4827 commit aaa7a1a

10 files changed

+249
-14
lines changed

articles/data-factory/data-factory-azure-datalake-connector.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
---
1+
---
22
title: Move data to/from Azure Data Lake Store | Microsoft Docs
33
description: Learn how to move data to/from Azure Data Lake Store using Azure Data Factory
44
services: data-factory
@@ -13,7 +13,7 @@ ms.workload: data-services
1313
ms.tgt_pltfrm: na
1414
ms.devlang: na
1515
ms.topic: article
16-
ms.date: 09/27/2016
16+
ms.date: 12/01/2016
1717
ms.author: jingwang
1818

1919
---
@@ -564,4 +564,7 @@ Properties available in the typeProperties section of the activity on the other
564564
[!INCLUDE [data-factory-column-mapping](../../includes/data-factory-column-mapping.md)]
565565

566566
## Performance and Tuning
567+
568+
Depending on whether initial data movement is planned with a big volume of historical data or incremental production data load, Azure Data Factory has options to improve the performance of those tasks. The concurrency parameter is a part of the **Copy Activity** and defines how many different activity windows will be processed in parallel. The **parallelCopies** parameter defines the parallelism for the single activity run. It is important to consider using these parameters when designing data movement pipelines with Azure Data Factory to achieve the best throughput.
569+
567570
See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.

articles/data-lake-store/data-lake-store-copy-data-azure-storage-blob.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.devlang: na
1313
ms.topic: article
1414
ms.tgt_pltfrm: na
1515
ms.workload: big-data
16-
ms.date: 10/05/2016
16+
ms.date: 12/02/2016
1717
ms.author: nitinme
1818

1919
---
@@ -39,6 +39,7 @@ Before you begin this article, you must have the following:
3939

4040
* **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
4141
* **Azure Storage Blobs** container with some data.
42+
* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
4243
* **Azure Data Lake Analytics account (optional)** - See [Get started with Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-get-started-portal.md) for instructions on how to create a Data Lake Store account.
4344
* **AdlCopy tool**. Install the AdlCopy tool from [http://aka.ms/downloadadlcopy](http://aka.ms/downloadadlcopy).
4445

@@ -86,6 +87,10 @@ The parameters in the syntax are described below:
8687

8788
AdlCopy /Source https://mystorage.blob.core.windows.net/mycluster/example/data/gutenberg/ /dest adl://mydatalakestore.azuredatalakestore.net/mynewfolder/ /sourcekey uJUfvD6cEvhfLoBae2yyQf8t9/BpbWZ4XoYj4kAS5Jf40pZaMNf0q6a8yqTxktwVgRED4vPHeh/50iS9atS5LQ==
8889

90+
### Performance considerations
91+
92+
If you are copying from an Azure Blob Storage account, you may be throttled during copy on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
93+
8994
## Use AdlCopy (as standalone) to copy data from another Data Lake Store account
9095
You can also use AdlCopy to copy data between two Data Lake Store accounts.
9196

@@ -114,6 +119,10 @@ You can also use AdlCopy to copy data between two Data Lake Store accounts.
114119

115120
AdlCopy /Source adl://mydatastore.azuredatalakestore.net/mynewfolder/ /dest adl://mynewdatalakestore.azuredatalakestore.net/mynewfolder/
116121

122+
### Performance considerations
123+
124+
When using AdlCopy as a standalone tool, the copy is run on shared, Azure managed resources. The performance you may get in this environment depends on system load and available resources. This mode is best used for small transfers on an ad hoc basis. No parameters need to be tuned when using AdlCopy as a standalone tool.
125+
117126
## Use AdlCopy (with Data Lake Analytics account) to copy data
118127
You can also use your Data Lake Analytics account to run the AdlCopy job to copy data from Azure storage blobs to Data Lake Store. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.
119128

@@ -132,11 +141,14 @@ For example:
132141

133142
AdlCopy /Source https://mystorage.blob.core.windows.net/mycluster/example/data/gutenberg/ /dest swebhdfs://mydatalakestore.azuredatalakestore.net/mynewfolder/ /sourcekey uJUfvD6cEvhfLoBae2yyQf8t9/BpbWZ4XoYj4kAS5Jf40pZaMNf0q6a8yqTxktwVgRED4vPHeh/50iS9atS5LQ== /Account mydatalakeanalyticaccount /Units 2
134143

135-
136144
Similarly, run the following command to copy from an Azure Storage blob to a Data Lake Store account using Data Lake Analytics account:
137145

138146
AdlCopy /Source adl://mysourcedatalakestore.azuredatalakestore.net/mynewfolder/ /dest adl://mydestdatastore.azuredatalakestore.net/mynewfolder/ /Account mydatalakeanalyticaccount /Units 2
139147

148+
### Performance considerations
149+
150+
When copying data in the range of terabytes, using AdlCopy with your own Azure Data Lake Analytics account provides better and more predictable performance. The parameter that should be tuned is the number of Azure Data Lake Analytics Units to use for the copy job. Increasing the number of units will increase the performance of your copy job. Each file to be copied can use maximum one unit. Specifying more units than the number of files being copied will not increase performance.
151+
140152
## Use AdlCopy to copy data using pattern matching
141153
In this section, you learn how to use AdlCopy to copy data from a source (in our example below we use Azure Storage Blob) to a destination Data Lake Store account using pattern matching. For example, you can use the steps below to copy all files with .csv extension from the source blob to the destination.
142154

@@ -156,6 +168,10 @@ In this section, you learn how to use AdlCopy to copy data from a source (in our
156168
## Considerations for using AdlCopy
157169
* AdlCopy (for version 1.0.5), supports copying data from sources that collectively have more than thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into different sub-folders and use the path to those sub-folders as the source instead.
158170

171+
## Performance considerations for using AdlCopy
172+
173+
AdlCopy supports copying data containing thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into smaller sub-folders. AdlCopy was built for ad hoc copies. If you are trying to copy data on a recurring basis, you should consider using [Azure Data Factory](../data-factory/data-factory-azure-datalake-connector.md) that provides full management around the copy operations.
174+
159175
## Next steps
160176
* [Secure data in Data Lake Store](data-lake-store-secure-data.md)
161177
* [Use Azure Data Lake Analytics with Data Lake Store](../data-lake-analytics/data-lake-analytics-get-started-portal.md)

articles/data-lake-store/data-lake-store-copy-data-wasb-distcp.md

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
---
1+
---
22
title: Copy data to and from WASB into Data Lake Store using Distcp| Microsoft Docs
33
description: Use Distcp tool to copy data to and from Azure Storage Blobs to Data Lake Store
44
services: data-lake-store
@@ -13,7 +13,7 @@ ms.devlang: na
1313
ms.topic: article
1414
ms.tgt_pltfrm: na
1515
ms.workload: big-data
16-
ms.date: 10/28/2016
16+
ms.date: 12/02/2016
1717
ms.author: nitinme
1818

1919
---
@@ -30,7 +30,7 @@ Once you have created an HDInsight cluster that has access to a Data Lake Store
3030
Before you begin this article, you must have the following:
3131

3232
* **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
33-
* **Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md).
33+
* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
3434
* **Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). Make sure you enable Remote Desktop for the cluster.
3535

3636
## Do you learn fast with videos?
@@ -41,7 +41,7 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
4141

4242
1. If you have a Windows cluster, remote into an HDInsight cluster that has access to a Data Lake Store account. For instructions, see [Connect to clusters using RDP](../hdinsight/hdinsight-administer-use-management-portal.md#connect-to-clusters-using-rdp). From the cluster Desktop, open the Hadoop command line.
4343

44-
If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster). Run the commands from the SSH prompt.
44+
If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect). Run the commands from the SSH prompt.
4545
2. Verify whether you can access the Azure Storage Blobs (WASB). Run the following command:
4646

4747
hdfs dfs –ls wasb://<container_name>@<storage_account_name>.blob.core.windows.net/
@@ -63,6 +63,52 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
6363

6464
This will copy the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
6565

66+
## Performance considerations while using DistCp
67+
68+
Because DistCp’s lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Store. This is controlled by setting the number of mappers (‘m’) parameter on the command line. This parameter specifies the maximum number of mappers that will be used to copy data. Default value is 20.
69+
70+
**Example**
71+
72+
hadoop distcp wasb://<container_name>@<storage_account_name>.blob.core.windows.net/example/data/gutenberg adl://<data_lake_store_account>.azuredatalakestore.net:443/myfolder -m 100
73+
74+
### How do I determine the number of mappers to use?
75+
76+
Here's some guidance that you can use.
77+
78+
* **Step 1: Determine total YARN memory** - The first step is to determine the YARN memory available to the cluster where you run the DistCp job. This information is available in the Ambari portal associated with the cluster. Navigate to YARN and view the Configs tab to see the YARN memory. To get the total YARN memory, multiply the YARN memory per node with the number of nodes you have in your cluster.
79+
80+
* **Step 2: Calculate the number of mappers** - The value of **m** is equal to the quotient of total YARN memory divided by the YARN container size. The YARN container size information is available in the Ambari portal as well. Navigate to YARN and view the Configs tab. The YARN container size is displayed in this window. The equation to arrive at the number of mappers (**m**) is
81+
82+
m = (number of nodes * YARN memory for each node) / YARN container size
83+
84+
**Example**
85+
86+
Let’s assume that you have a 4 D14v2s nodes in the cluster and you are trying to transfer 10TB of data from 10 different folders. Each of the folders contain varying amounts of data and the file sizes within each folder are different.
87+
88+
* Total YARN memory - From the Ambari portal you determine that the YARN memory is 96GB for a D14 node. So, total YARN memory for 4 node cluster is:
89+
90+
YARN memory = 4 * 96GB = 384GB
91+
92+
* Number of mappers - From the Ambari portal you determine that the YARN container size is 3072 for a D14 cluster node. So, number of mappers is:
93+
94+
m = (4 nodes * 96GB) / 3072MB = 128 mappers
95+
96+
If other applications are using memory, then you can choose to only use a portion of your cluster’s YARN memory for DistCp.
97+
98+
### Copying large datasets
99+
100+
When the size of the dataset to be moved is very large (e.g. >1TB) or if you have many different folders, you should consider using multiple DistCp jobs. There is likely no performance gain, but it will spread out the jobs so that if any job fails, you will only need to restart that specific job rather than the entire job.
101+
102+
### Limitations
103+
104+
* DistCp tries to create mappers that are similar in size to optimize performance. Increasing the number of mappers may not always increase performance.
105+
106+
* DistCp is limited to only one mapper per file. Therefore, you should not have more mappers than you have files. Since DistCp can only assign one mapper to a file, this limits the amount of concurrency that can be used to copy large files.
107+
108+
* If you have a small number of large files, then you should split them into 256MB file chunks to give you more potential concurrency.
109+
110+
* If you are copying from an Azure Blob Storage account, your copy job may be throttled on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
111+
66112
## See also
67113
* [Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
68114
* [Secure data in Data Lake Store](data-lake-store-secure-data.md)

articles/data-lake-store/data-lake-store-data-transfer-sql-sqoop.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.devlang: na
1313
ms.topic: article
1414
ms.tgt_pltfrm: na
1515
ms.workload: big-data
16-
ms.date: 10/28/2016
16+
ms.date: 12/02/2016
1717
ms.author: nitinme
1818

1919
---
@@ -29,7 +29,7 @@ Big data applications are a natural choice for processing unstructured and semi-
2929
Before you begin this article, you must have the following:
3030

3131
* **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
32-
* **Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md).
32+
* **An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
3333
* **Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). This article assumes you have an HDInsight Linux cluster with Data Lake Store access.
3434
* **Azure SQL Database**. For instructions on how to create one, see [Create an Azure SQL database](../sql-database/sql-database-get-started.md)
3535

@@ -72,7 +72,7 @@ Before you begin this article, you must have the following:
7272
## Use Sqoop from an HDInsight cluster with access to Data Lake Store
7373
An HDInsight cluster already has the Sqoop packages available. If you have configured the HDInsight cluster to use Data Lake Store as an additional storage, you can use Sqoop (without any configuration changes) to import/export data between a relational database (in this example, Azure SQL Database) and a Data Lake Store account.
7474

75-
1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster).
75+
1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect).
7676
2. Verify whether you can access the Data Lake Store account from the cluster. Run the following command from the SSH prompt:
7777

7878
hdfs dfs -ls adl://<data_lake_store_account>.azuredatalakestore.net/
@@ -131,6 +131,10 @@ An HDInsight cluster already has the Sqoop packages available. If you have confi
131131
3 Erna Myers
132132
4 Annette Simpson
133133

134+
## Performance considerations while using Sqoop
135+
136+
For performance tuning your Sqoop job to copy data to Data Lake Store, see [Sqoop performance document](https://blogs.msdn.microsoft.com/bigdatasupport/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop/).
137+
134138
## See also
135139
* [Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
136140
* [Secure data in Data Lake Store](data-lake-store-secure-data.md)

0 commit comments

Comments
 (0)