You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Depending on whether initial data movement is planned with a big volume of historical data or incremental production data load, Azure Data Factory has options to improve the performance of those tasks. The concurrency parameter is a part of the **Copy Activity** and defines how many different activity windows will be processed in parallel. The **parallelCopies** parameter defines the parallelism for the single activity run. It is important to consider using these parameters when designing data movement pipelines with Azure Data Factory to achieve the best throughput.
569
+
567
570
See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Copy file name to clipboardExpand all lines: articles/data-lake-store/data-lake-store-copy-data-azure-storage-blob.md
+18-2Lines changed: 18 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ ms.devlang: na
13
13
ms.topic: article
14
14
ms.tgt_pltfrm: na
15
15
ms.workload: big-data
16
-
ms.date: 10/05/2016
16
+
ms.date: 12/02/2016
17
17
ms.author: nitinme
18
18
19
19
---
@@ -39,6 +39,7 @@ Before you begin this article, you must have the following:
39
39
40
40
***An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
41
41
***Azure Storage Blobs** container with some data.
42
+
***An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
42
43
***Azure Data Lake Analytics account (optional)** - See [Get started with Azure Data Lake Analytics](../data-lake-analytics/data-lake-analytics-get-started-portal.md) for instructions on how to create a Data Lake Store account.
43
44
***AdlCopy tool**. Install the AdlCopy tool from [http://aka.ms/downloadadlcopy](http://aka.ms/downloadadlcopy).
44
45
@@ -86,6 +87,10 @@ The parameters in the syntax are described below:
If you are copying from an Azure Blob Storage account, you may be throttled during copy on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
93
+
89
94
## Use AdlCopy (as standalone) to copy data from another Data Lake Store account
90
95
You can also use AdlCopy to copy data between two Data Lake Store accounts.
91
96
@@ -114,6 +119,10 @@ You can also use AdlCopy to copy data between two Data Lake Store accounts.
When using AdlCopy as a standalone tool, the copy is run on shared, Azure managed resources. The performance you may get in this environment depends on system load and available resources. This mode is best used for small transfers on an ad hoc basis. No parameters need to be tuned when using AdlCopy as a standalone tool.
125
+
117
126
## Use AdlCopy (with Data Lake Analytics account) to copy data
118
127
You can also use your Data Lake Analytics account to run the AdlCopy job to copy data from Azure storage blobs to Data Lake Store. You would typically use this option when the data to be moved is in the range of gigabytes and terabytes, and you want better and predictable performance throughput.
When copying data in the range of terabytes, using AdlCopy with your own Azure Data Lake Analytics account provides better and more predictable performance. The parameter that should be tuned is the number of Azure Data Lake Analytics Units to use for the copy job. Increasing the number of units will increase the performance of your copy job. Each file to be copied can use maximum one unit. Specifying more units than the number of files being copied will not increase performance.
151
+
140
152
## Use AdlCopy to copy data using pattern matching
141
153
In this section, you learn how to use AdlCopy to copy data from a source (in our example below we use Azure Storage Blob) to a destination Data Lake Store account using pattern matching. For example, you can use the steps below to copy all files with .csv extension from the source blob to the destination.
142
154
@@ -156,6 +168,10 @@ In this section, you learn how to use AdlCopy to copy data from a source (in our
156
168
## Considerations for using AdlCopy
157
169
* AdlCopy (for version 1.0.5), supports copying data from sources that collectively have more than thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into different sub-folders and use the path to those sub-folders as the source instead.
158
170
171
+
## Performance considerations for using AdlCopy
172
+
173
+
AdlCopy supports copying data containing thousands of files and folders. However, if you encounter issues copying a large dataset, you can distribute the files/folders into smaller sub-folders. AdlCopy was built for ad hoc copies. If you are trying to copy data on a recurring basis, you should consider using [Azure Data Factory](../data-factory/data-factory-azure-datalake-connector.md) that provides full management around the copy operations.
174
+
159
175
## Next steps
160
176
*[Secure data in Data Lake Store](data-lake-store-secure-data.md)
161
177
*[Use Azure Data Lake Analytics with Data Lake Store](../data-lake-analytics/data-lake-analytics-get-started-portal.md)
Copy file name to clipboardExpand all lines: articles/data-lake-store/data-lake-store-copy-data-wasb-distcp.md
+50-4Lines changed: 50 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
---
1
+
---
2
2
title: Copy data to and from WASB into Data Lake Store using Distcp| Microsoft Docs
3
3
description: Use Distcp tool to copy data to and from Azure Storage Blobs to Data Lake Store
4
4
services: data-lake-store
@@ -13,7 +13,7 @@ ms.devlang: na
13
13
ms.topic: article
14
14
ms.tgt_pltfrm: na
15
15
ms.workload: big-data
16
-
ms.date: 10/28/2016
16
+
ms.date: 12/02/2016
17
17
ms.author: nitinme
18
18
19
19
---
@@ -30,7 +30,7 @@ Once you have created an HDInsight cluster that has access to a Data Lake Store
30
30
Before you begin this article, you must have the following:
31
31
32
32
***An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
33
-
***Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md).
33
+
***An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
34
34
***Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). Make sure you enable Remote Desktop for the cluster.
35
35
36
36
## Do you learn fast with videos?
@@ -41,7 +41,7 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
41
41
42
42
1. If you have a Windows cluster, remote into an HDInsight cluster that has access to a Data Lake Store account. For instructions, see [Connect to clusters using RDP](../hdinsight/hdinsight-administer-use-management-portal.md#connect-to-clusters-using-rdp). From the cluster Desktop, open the Hadoop command line.
43
43
44
-
If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster). Run the commands from the SSH prompt.
44
+
If you have a Linux cluster, use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect). Run the commands from the SSH prompt.
45
45
2. Verify whether you can access the Azure Storage Blobs (WASB). Run the following command:
@@ -63,6 +63,52 @@ An HDInsight cluster comes with the Distcp utility, which can be used to copy da
63
63
64
64
This will copy the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
65
65
66
+
## Performance considerations while using DistCp
67
+
68
+
Because DistCp’s lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Store. This is controlled by setting the number of mappers (‘m’) parameter on the command line. This parameter specifies the maximum number of mappers that will be used to copy data. Default value is 20.
### How do I determine the number of mappers to use?
75
+
76
+
Here's some guidance that you can use.
77
+
78
+
***Step 1: Determine total YARN memory** - The first step is to determine the YARN memory available to the cluster where you run the DistCp job. This information is available in the Ambari portal associated with the cluster. Navigate to YARN and view the Configs tab to see the YARN memory. To get the total YARN memory, multiply the YARN memory per node with the number of nodes you have in your cluster.
79
+
80
+
***Step 2: Calculate the number of mappers** - The value of **m** is equal to the quotient of total YARN memory divided by the YARN container size. The YARN container size information is available in the Ambari portal as well. Navigate to YARN and view the Configs tab. The YARN container size is displayed in this window. The equation to arrive at the number of mappers (**m**) is
81
+
82
+
m = (number of nodes * YARN memory for each node) / YARN container size
83
+
84
+
**Example**
85
+
86
+
Let’s assume that you have a 4 D14v2s nodes in the cluster and you are trying to transfer 10TB of data from 10 different folders. Each of the folders contain varying amounts of data and the file sizes within each folder are different.
87
+
88
+
* Total YARN memory - From the Ambari portal you determine that the YARN memory is 96GB for a D14 node. So, total YARN memory for 4 node cluster is:
89
+
90
+
YARN memory = 4 * 96GB = 384GB
91
+
92
+
* Number of mappers - From the Ambari portal you determine that the YARN container size is 3072 for a D14 cluster node. So, number of mappers is:
93
+
94
+
m = (4 nodes * 96GB) / 3072MB = 128 mappers
95
+
96
+
If other applications are using memory, then you can choose to only use a portion of your cluster’s YARN memory for DistCp.
97
+
98
+
### Copying large datasets
99
+
100
+
When the size of the dataset to be moved is very large (e.g. >1TB) or if you have many different folders, you should consider using multiple DistCp jobs. There is likely no performance gain, but it will spread out the jobs so that if any job fails, you will only need to restart that specific job rather than the entire job.
101
+
102
+
### Limitations
103
+
104
+
* DistCp tries to create mappers that are similar in size to optimize performance. Increasing the number of mappers may not always increase performance.
105
+
106
+
* DistCp is limited to only one mapper per file. Therefore, you should not have more mappers than you have files. Since DistCp can only assign one mapper to a file, this limits the amount of concurrency that can be used to copy large files.
107
+
108
+
* If you have a small number of large files, then you should split them into 256MB file chunks to give you more potential concurrency.
109
+
110
+
* If you are copying from an Azure Blob Storage account, your copy job may be throttled on the blob storage side. This will degrade the performance of your copy job. To learn more about the limits of Azure Blob Storage, see Azure Storage limits at [Azure subscription and service limits](../azure-subscription-service-limits.md).
111
+
66
112
## See also
67
113
*[Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
68
114
*[Secure data in Data Lake Store](data-lake-store-secure-data.md)
Copy file name to clipboardExpand all lines: articles/data-lake-store/data-lake-store-data-transfer-sql-sqoop.md
+7-3Lines changed: 7 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ ms.devlang: na
13
13
ms.topic: article
14
14
ms.tgt_pltfrm: na
15
15
ms.workload: big-data
16
-
ms.date: 10/28/2016
16
+
ms.date: 12/02/2016
17
17
ms.author: nitinme
18
18
19
19
---
@@ -29,7 +29,7 @@ Big data applications are a natural choice for processing unstructured and semi-
29
29
Before you begin this article, you must have the following:
30
30
31
31
***An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
32
-
***Enable your Azure subscription** for Data Lake Store public preview. See [instructions](data-lake-store-get-started-portal.md).
32
+
***An Azure Data Lake Store account**. For instructions on how to create one, see [Get started with Azure Data Lake Store](data-lake-store-get-started-portal.md)
33
33
***Azure HDInsight cluster** with access to a Data Lake Store account. See [Create an HDInsight cluster with Data Lake Store](data-lake-store-hdinsight-hadoop-use-portal.md). This article assumes you have an HDInsight Linux cluster with Data Lake Store access.
34
34
***Azure SQL Database**. For instructions on how to create one, see [Create an Azure SQL database](../sql-database/sql-database-get-started.md)
35
35
@@ -72,7 +72,7 @@ Before you begin this article, you must have the following:
72
72
## Use Sqoop from an HDInsight cluster with access to Data Lake Store
73
73
An HDInsight cluster already has the Sqoop packages available. If you have configured the HDInsight cluster to use Data Lake Store as an additional storage, you can use Sqoop (without any configuration changes) to import/export data between a relational database (in this example, Azure SQL Database) and a Data Lake Store account.
74
74
75
-
1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect-to-a-linux-based-hdinsight-cluster).
75
+
1. For this tutorial, we assume you created a Linux cluster so you should use SSH to connect to the cluster. See [Connect to a Linux-based HDInsight cluster](../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md#connect).
76
76
2. Verify whether you can access the Data Lake Store account from the cluster. Run the following command from the SSH prompt:
@@ -131,6 +131,10 @@ An HDInsight cluster already has the Sqoop packages available. If you have confi
131
131
3 Erna Myers
132
132
4 Annette Simpson
133
133
134
+
## Performance considerations while using Sqoop
135
+
136
+
For performance tuning your Sqoop job to copy data to Data Lake Store, see [Sqoop performance document](https://blogs.msdn.microsoft.com/bigdatasupport/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop/).
137
+
134
138
## See also
135
139
*[Copy data from Azure Storage Blobs to Data Lake Store](data-lake-store-copy-data-azure-storage-blob.md)
136
140
*[Secure data in Data Lake Store](data-lake-store-secure-data.md)
0 commit comments