How to Ace the AWS Data Analytics Specialty Examination ?

Advanced Data Analytics and Optimal Handling in the AWS Cloud

How to Ace the  AWS Data Analytics Specialty Examination ?
How to Ace the AWS Data Analytics Specialty Examination ?

How to Ace the AWS Data Analytics Specialty Examination ? free download

Advanced Data Analytics and Optimal Handling in the AWS Cloud

Author Note - The Data Analytics Certification is now Retired, and is replaced by Data Engineering Certification by AWS. The Data Analytics test course materials are being kept as it can still be helpful for  in-depth Data Analysis in AWS. Author also has practice test course for Data Engineering Certification as well.

Goal of this Course

This course is focused on the Data Analytics specialty which aims at developing your expertise on ingesting, storing, processing, and securing data. Every aspect of a data pipeline and  data lifecycle are examined. This will prepare you to take on the Data Analytics Certification Exam with confidence.

More specifically. this course enables you to obtain a complete understanding of the full data life cycle in a Cloud environment and ensure the "five Vs" of data analysis and their context are well understood. They are the following:

  • Ingesting Data - Mechanisms for collecting the data which includes attention to details such as variety of data, velocity of data.

  • Storing the Data - Primarily driven by the volume of data but is also subtly influenced by velocity, variety and value (this aspect esp. for securing the data).

  • Processing the Data - The key aspect and the "crown jewel" of data analytics - this ensure the Value of data is properly "enshrined" so data transforms to information and thence on to its key mission of delivering Value and Insights (which brings the "sixth V" - visualization). This layer as part of its data flow analysis is also tasked with verifying the integrity and accuracy of the data (veracity).

  • Obtaining Insights from the Data - this serves as a pre-requisite as well as a post narrative to data that was processed. The key here is to understand how one can categorize a large volume of data succinctly in a picture to derive insights about it.

  • Security - Data security is a dimension that pervades all of the above aspects and involves controls to access as well as encryption (at rest and transit). In addition, multi-level security such as MFA as well as temporally vanishing credentials are key principles towards preserving only those who have the need to access data can indeed access and use the data.

The AWS offerings that come into play for managing data life cycle described in the above dimensions will be examined in detail to test your knowledge in the practice tests. The certification exam has 65 questions, but you are graded on only 50.

Sample Question and Answer with Explanation

As you will see in the sample explanation, not only it gives details of the question and its answer, but explanation on why others are unfit for the given use case. Further, plenty of Exam Tips are underscored which will help you face alternative Exam scenarios, or possible Exam questions. Plenty of relevant inline and explicit references are also cited to help further your data analytic knowledge quest.

Question. You are contemplating changing workload management (WLM) in Redshift as you have diverse set of users and different query characteristics. You have a large number of short queries which can sometime take long depending on the data cardinality in the tables. The previously setup action was to abort short queries if it exceeds a particular time, but cost and  effort is expended in re-running. You want to remediate it by finding a way by which the query can continue to run without aborting, but not affecting other short queries. What is the efficient way to set this up in Redshift?

  • Enable Manual WLM in Redshift and specify Queue Management Rules action to Hop.

  • Enable Automatic WLM in Redshift and specify Queue Management Rules action to Hop.

  • Enable Short Query Acceleration (SQA) to ensure short queries do not exceed their set duration.

  • Run VACUUM command regularly at off hours to ensure database is in sorted order that will enhance query efficiency.





Explanation

Note - Illustrative figures are shown in the Test explanation, but not produced here per Udemy guidelines.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster. Below illustration shows a Redshift Cluster and its components (Spectrum not shown).

Redshift Workload Management

Amazon Redshift workload management  enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime. The WLM queues can be Auto (Redshift manages them) or manual (you manage them). The former is recommended.

Types of WLM

  • Auto WLM determines the amount of resources that queries need and adjusts the concurrency based on the workload. When queries requiring large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower.

    Exam Tip You should use Auto WLM when your workload is highly unpredictable.

  • With manual WLM, you manage query concurrency and memory allocation, as opposed to auto WLM, where it’s managed by Amazon Redshift automatically. You configure separate WLM queues for different workloads like ETL, BI, and ad hoc and customize resource allocation. 

Exam Tip Improve performance of WLM queues with Concurrency Scaling in Redshift. You manage the query concurrency and memory allocation to queries called slots.

WLM Queue Management Rule

You can use Redshift workload management in Redshift to define multiple query queues and to route queries to the appropriate queues at runtime. Queue Management Rules control how query behavior is controlled in WLM queues in Redshift. A QMR rule definition requires a rule name, one to three predicates or conditions, and an action. Action can be Log, Hop, or Abort.

Redshift Short Query Acceleration

Short query acceleration  prioritizes selected short-running queries ahead of longer-running queries. SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries. SQA only prioritizes queries that are short-running and are in a user-defined queue. With SQA, short-running queries begin running more quickly and users see results sooner.

If you enable SQA, you can reduce workload management (WLM) queues that are dedicated to running short queries. In addition, long-running queries don't need to contend with short queries for slots in a queue, so you can configure your WLM queues to use fewer query slots. When you use lower concurrency, query throughput is increased and overall system performance is improved for most workloads.

Redshift VACUUM

When data is inserted into Redshift, it is not sorted and is written on an unsorted block. With unsorted data on disk, query performance might be degraded for operations that rely on sorted data, such as range-restricted scans or merge joins. When you run a DELETE query, redshift soft deletes the data. Similar is the case when you are performing UPDATE, Redshift performs a DELETE followed by an INSERT in the background. When vacuum command is issued it physically deletes the data which was soft-deleted and sorts the data again.

Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the VACUUM command, either against the entire database or against individual tables. This command is commonly used for cleanup of tables.

A VACUUM command in Redshift ensures sorted order and reclaims space.

Exam Tip When you delete a row in Redshift it is not deleted immediately. Further, remember you cannot update a row in Redshift, its delete and insert only.


Scenario

When you are managing multiple query queues via WLM in Redshift, the recommended way is to use automatic WLM to let Redshift manage via its automated query monitoring capabilities.

However, in this case the situation is when some short queries exceeded their threshold, there were aborted but we want to retain their state and continue - this can be done simply and effectively using query queue hopping. However, this is only possible if you have enabled WLM in manual mode and either based on query timeout or via queue management rule you ( QMR) specified.

  • Queue Hopping: With manual WLM, you can manage system performance and your users' experience by modifying your WLM configuration to create separate queues for the long-running queries and the short-running queries. When a query is hopped, WLM attempts to route the query to the next matching queue based on the WLM queue assignment rules.

    NOTE if the query doesn't match any other queue definition, the query is canceled. It's not assigned to the default queue. So ensure your QMR hopping rules specify a queue to avoid the cancellation.


Correct Choice

Enable Manual WLM in Redshift and specify Queue Management Rules action to Hop

Per above explanation, this is the correct choice. By doing this setup you can avoid short queries that take time avoid bogging down other genuine short queries. Use a WLM queue timeout and simply hop to another queue meant for longer duration queries.

Exam Tip Queue hopping does not cancel queries, preserve their state and they continue where they left off. Queue hop action only possible in manual WLM.


For the other choices

Enable Automatic WLM in Redshift and specify Queue Management Rules action to Hop

Automatic WLM, though preferred, cannot enable Queue hopping. So this is ruled out.


Enable Short Query Acceleration (SQA) to ensure short queries do not exceed their set duration.

Short query acceleration can be enabled to improve performance but cannot guarantee time limits. The challenge here is some short queries are not "short" they take long time hogging the resources.

Exam Tip SQA has another use: to minimize WLM queues (and cost) for short queries, think SQA.


Run VACUUM command regularly at off hours to ensure database is in sorted order that will enhance query efficiency

Running VACUUM command regularly is best practice and can improve overall query performance not just short queries.

Suggested Ways to Crack the Exam Questions

1. Study the question Fully. Make an expectation of what the answer should be without reading the choices, but DO NOT attach yourself to this conclusion - keep it tentative.

2. Make sure you note the constraint keywords and asks such as least cost, high performance, least effort, etc.

3. Beware if a question says: "Pick a choice that is NOT  TRUE (or Pick a choice that is FALSE) - natural human thinking tends to gravitate towards True.

4. Review ALL the choices - DO NOT make a selection without reading all. Common mistake is picking the earliest choice that seem to fit as the best answer without reading all the choices. But, the next choice could have been the perfect answer.

5. Answer ALL the questions even if you don't know any - there is no penalty for incorrect answers. Further, if you don't know or not fully confident of the question, use elimination of the choices that definitely are wrong, then focus only on the fit ones to choose the answer (your probability of being correct increases this way!).

6. Keep time to review at the end. Reasons:

  • Sometime you may have inadvertently forgot to answer a question, or

  • Beware, though you knew a question and selected the answer correctly, inadvertently due to mouse click, it may have selected a wrong choice, or

  • You may have skipped a question to save time, but forgot about it: remember, every question's answer counts.

The practice tests will ensure you are prepared on the above especially #6 pitfalls.

Check List

  1. Prepare BEFORE you take these Tests. Pretend the Test is the Exam itself so allocate time.

  2. Read the explanation fully for a Test after completing it before moving to the next Test.

  3. Review your Incorrect Answers.

  4. Review also your Correct Answers.

    • For some questions, you could have used elimination or just guessed - so understand why that choice is indeed the right answer.

    • For some questions, you picked a choice thinking of a reason; the Choice is correct, but that reason is wrong. Reading the explanation will tell you why it is the right choice and its reason precisely.

  5. Review the Incorrect choices because you need to know why they are not the right answers. Moreover, in a different problem context, they may be valid choices.

  6. If you wish to re-take the Tests, give a couple days so your memory will not interfere with your understanding in answering the questions.

  7. A few days before the exam, see if you have time to quickly re-take all the Tests (or simply review the Exam Tip content).

  8. If you get over 85-90 percent or above in the Tests, I believe you are ready to take the Exam.

AWS Data Analytics Certification Specialty Exam Domains and Weighting

The exam consists of five domains with weights indicated below (as of this writing). The questions tend to mix and match the domains to test your understanding of data life cycle as well as within a domain to test your depth of knowledge.

  1. Domain 1: Collection - 18%

  2. Domain 2: Storage and Data Management - 22%

  3. Domain 3: Processing - 24%

  4. Domain 4: Analysis and Visualization - 18%

  5. Domain 5: Security - 18%

This course can strengthen your foundations for advanced Data Analytics in the cloud exam via its depth as well as its use case driven questions. The latter tends to be the style of the AWS exam questions. This course can also help to evolve your data journey from Analytics to more advanced aspects such as ML based processing.

The course is organized via two tests that are designed to bring out deep nuances associated with the setup, development, working, algorithmic, security, and operational aspects of a data life-cycle, AWS services, their interplay in data management, and "How-to-do" data analysis optimizing cost, or effort or other key criteria.


Next Steps

Focus on the pattern of the ask in the practice test questions, not just the question.

Good Luck!