How to pass Amazon MLS-C01 exam with the help of dumps?
DumpsPool provides you the finest quality resources you’ve been looking for to no avail. So, it's due time you stop stressing and get ready for the exam. Our Online Test Engine provides you with the guidance you need to pass the certification exam. We guarantee top-grade results because we know we’ve covered each topic in a precise and understandable manner. Our expert team prepared the latest Amazon MLS-C01 Dumps to satisfy your need for training. Plus, they are in two different formats: Dumps PDF and Online Test Engine.
How Do I Know Amazon MLS-C01 Dumps are Worth it?
Did we mention our latest MLS-C01 Dumps PDF is also available as Online Test Engine? And that’s just the point where things start to take root. Of all the amazing features you are offered here at DumpsPool, the money-back guarantee has to be the best one. Now that you know you don’t have to worry about the payments. Let us explore all other reasons you would want to buy from us. Other than affordable Real Exam Dumps, you are offered three-month free updates.
You can easily scroll through our large catalog of certification exams. And, pick any exam to start your training. That’s right, DumpsPool isn’t limited to just Amazon Exams. We trust our customers need the support of an authentic and reliable resource. So, we made sure there is never any outdated content in our study resources. Our expert team makes sure everything is up to the mark by keeping an eye on every single update. Our main concern and focus are that you understand the real exam format. So, you can pass the exam in an easier way!
IT Students Are Using our AWS Certified Machine Learning - Specialty Dumps Worldwide!
It is a well-established fact that certification exams can’t be conquered without some help from experts. The point of using AWS Certified Machine Learning - Specialty Practice Question Answers is exactly that. You are constantly surrounded by IT experts who’ve been through you are about to and know better. The 24/7 customer service of DumpsPool ensures you are in touch with these experts whenever needed. Our 100% success rate and validity around the world, make us the most trusted resource candidates use. The updated Dumps PDF helps you pass the exam on the first attempt. And, with the money-back guarantee, you feel safe buying from us. You can claim your return on not passing the exam.
How to Get MLS-C01 Real Exam Dumps?
Getting access to the real exam dumps is as easy as pressing a button, literally! There are various resources available online, but the majority of them sell scams or copied content. So, if you are going to attempt the MLS-C01 exam, you need to be sure you are buying the right kind of Dumps. All the Dumps PDF available on DumpsPool are as unique and the latest as they can be. Plus, our Practice Question Answers are tested and approved by professionals. Making it the top authentic resource available on the internet. Our expert has made sure the Online Test Engine is free from outdated & fake content, repeated questions, and false plus indefinite information, etc. We make every penny count, and you leave our platform fully satisfied!
123456
Frequently Asked Questions
Amazon MLS-C01 Sample Question Answers
Question # 1
A data scientist stores financial datasets in Amazon S3. The data scientist uses AmazonAthena to query the datasets by using SQL.The data scientist uses Amazon SageMaker to deploy a machine learning (ML) model. Thedata scientist wants to obtain inferences from the model at the SageMaker endpointHowever, when the data …. ntist attempts to invoke the SageMaker endpoint, the datascientist receives SOL statement failures The data scientist's 1AM user is currently unableto invoke the SageMaker endpointWhich combination of actions will give the data scientist's 1AM user the ability to invoke the SageMaker endpoint? (Select THREE.)
A. Attach the AmazonAthenaFullAccess AWS managed policy to the user identity. B. Include a policy statement for the data scientist's 1AM user that allows the 1AM user toperform the sagemaker: lnvokeEndpoint action, C. Include an inline policy for the data scientist’s 1AM user that allows SageMaker to readS3 objects D. Include a policy statement for the data scientist's 1AM user that allows the 1AM user toperform the sagemakerGetRecord action. E. Include the SQL statement "USING EXTERNAL FUNCTION ml_function_name" in theAthena SQL query. F. Perform a user remapping in SageMaker to map the 1AM user to another 1AM user thatis on the hosted endpoint.
Answer: B,C,E
Explanation: The correct combination of actions to enable the data scientist’s IAM user to
invoke the SageMaker endpoint is B, C, and E, because they ensure that the IAM user has
the necessary permissions, access, and syntax to query the ML model from Athena. These
actions have the following benefits:
B: Including a policy statement for the IAM user that allows the
sagemaker:InvokeEndpoint action grants the IAM user the permission to call the
SageMaker Runtime InvokeEndpoint API, which is used to get inferences from the
model hosted at the endpoint1.
C: Including an inline policy for the IAM user that allows SageMaker to read S3
objects enables the IAM user to access the data stored in S3, which is the source
of the Athena queries2.
E: Including the SQL statement “USING EXTERNAL FUNCTION
ml_function_name” in the Athena SQL query allows the IAM user to invoke the ML
model as an external function from Athena, which is a feature that enables
querying ML models from SQL statements3.
The other options are not correct or necessary, because they have the following
drawbacks:
A: Attaching the AmazonAthenaFullAccess AWS managed policy to the user
identity is not sufficient, because it does not grant the IAM user the permission to
invoke the SageMaker endpoint, which is required to query the ML model4.
D: Including a policy statement for the IAM user that allows the IAM user to
perform the sagemaker:GetRecord action is not relevant, because this action is
used to retrieve a single record from a feature group, which is not the case in this
scenario5.
F: Performing a user remapping in SageMaker to map the IAM user to another
IAM user that is on the hosted endpoint is not applicable, because this feature is
only available for multi-model endpoints, which are not used in this scenario.
References:
1: InvokeEndpoint - Amazon SageMaker
2: Querying Data in Amazon S3 from Amazon Athena - Amazon Athena
3: Querying machine learning models from Amazon Athena using Amazon
SageMaker | AWS Machine Learning Blog 4: AmazonAthenaFullAccess - AWS Identity and Access Management
5: GetRecord - Amazon SageMaker Feature Store Runtime
: [Invoke a Multi-Model Endpoint - Amazon SageMaker]
Question # 2
A Machine Learning Specialist is designing a scalable data storage solution for AmazonSageMaker. There is an existing TensorFlow-based model implemented as a train.py scriptthat relies on static training data that is currently stored as TFRecords.Which method of providing training data to Amazon SageMaker would meet the businessrequirements with the LEAST development overhead?
A. Use Amazon SageMaker script mode and use train.py unchanged. Point the AmazonSageMaker training invocation to the local path of the data without reformatting the trainingdata. B. Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecorddata into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3bucket without reformatting the training data. C. Rewrite the train.py script to add a section that converts TFRecords to protobuf andingests the protobuf data instead of TFRecords. D. Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue orAWS Lambda to reformat and store the data in an Amazon S3 bucket.
Answer: B
Explanation: Amazon SageMaker script mode is a feature that allows users to use training
scripts similar to those they would use outside SageMaker with SageMaker’s prebuilt
containers for various frameworks such as TensorFlow. Script mode supports reading data
from Amazon S3 buckets without requiring any changes to the training script. Therefore,
option B is the best method of providing training data to Amazon SageMaker that would
meet the business requirements with the least development overhead.
Option A is incorrect because using a local path of the data would not be scalable or
reliable, as it would depend on the availability and capacity of the local storage. Moreover,
using a local path of the data would not leverage the benefits of Amazon S3, such as
durability, security, and performance. Option C is incorrect because rewriting the train.py
script to convert TFRecords to protobuf would require additional development effort and
complexity, as well as introduce potential errors and inconsistencies in the data format.
Option D is incorrect because preparing the data in the format accepted by Amazon
SageMaker would also require additional development effort and complexity, as well as
involve using additional services such as AWS Glue or AWS Lambda, which would
increase the cost and maintenance of the solution.
References:
Bring your own model with Amazon SageMaker script mode
GitHub - aws-samples/amazon-sagemaker-script-mode
Deep Dive on TensorFlow training with Amazon SageMaker and Amazon S3
amazon-sagemaker-script-mode/generate_cifar10_tfrecords.py at master
Question # 3
A credit card company wants to identify fraudulent transactions in real time. A data scientistbuilds a machine learning model for this purpose. The transactional data is captured andstored in Amazon S3. The historic data is already labeled with two classes: fraud (positive)and fair transactions (negative). The data scientist removes all the missing data and buildsa classifier by using the XGBoost algorithm in Amazon SageMaker. The model producesthe following results:• True positive rate (TPR): 0.700• False negative rate (FNR): 0.300• True negative rate (TNR): 0.977• False positive rate (FPR): 0.023• Overall accuracy: 0.949Which solution should the data scientist use to improve the performance of the model?
A. Apply the Synthetic Minority Oversampling Technique (SMOTE) on the minority class inthe training dataset. Retrain the model with the updated training data. B. Apply the Synthetic Minority Oversampling Technique (SMOTE) on the majority class in the training dataset. Retrain the model with the updated training data. C. Undersample the minority class. D. Oversample the majority class.
Answer: A
Explanation: The solution that the data scientist should use to improve the performance of
the model is to apply the Synthetic Minority Oversampling Technique (SMOTE) on the
minority class in the training dataset, and retrain the model with the updated training data.
This solution can address the problem of class imbalance in the dataset, which can affect
the model’s ability to learn from the rare but important positive class (fraud).
Class imbalance is a common issue in machine learning, especially for classification tasks.
It occurs when one class (usually the positive or target class) is significantly
underrepresented in the dataset compared to the other class (usually the negative or nontarget
class). For example, in the credit card fraud detection problem, the positive class
(fraud) is much less frequent than the negative class (fair transactions). This can cause the
model to be biased towards the majority class, and fail to capture the characteristics and
patterns of the minority class. As a result, the model may have a high overall accuracy, but
a low recall or true positive rate for the minority class, which means it misses many
fraudulent transactions.
SMOTE is a technique that can help mitigate the class imbalance problem by generating
synthetic samples for the minority class. SMOTE works by finding the k-nearest neighbors
of each minority class instance, and randomly creating new instances along the line
segments connecting them. This way, SMOTE can increase the number and diversity of
the minority class instances, without duplicating or losing any information. By applying
SMOTE on the minority class in the training dataset, the data scientist can balance the
classes and improve the model’s performance on the positive class1.
The other options are either ineffective or counterproductive. Applying SMOTE on the
majority class would not balance the classes, but increase the imbalance and the size of
the dataset. Undersampling the minority class would reduce the number of instances
available for the model to learn from, and potentially lose some important information.
Oversampling the majority class would also increase the imbalance and the size of the
dataset, and introduce redundancy and overfitting.
References:
1: SMOTE for Imbalanced Classification with Python - Machine Learning Mastery
Question # 4
A pharmaceutical company performs periodic audits of clinical trial sites to quickly resolvecritical findings. The company stores audit documents in text format. Auditors haverequested help from a data science team to quickly analyze the documents. The auditorsneed to discover the 10 main topics within the documents to prioritize and distribute thereview work among the auditing team members. Documents that describe adverse eventsmust receive the highest priority. A data scientist will use statistical modeling to discover abstract topics and to provide a listof the top words for each category to help the auditors assess the relevance of the topic.Which algorithms are best suited to this scenario? (Choose two.)
A. Latent Dirichlet allocation (LDA) B. Random Forest classifier C. Neural topic modeling (NTM) D. Linear support vector machine E. Linear regression
Answer: A,C
Explanation: The algorithms that are best suited to this scenario are latent Dirichlet
allocation (LDA) and neural topic modeling (NTM), as they are both unsupervised learning
methods that can discover abstract topics from a collection of text documents. LDA and
NTM can provide a list of the top words for each topic, as well as the topic distribution for
each document, which can help the auditors assess the relevance and priority of the
topic12.
The other options are not suitable because:
Option B: A random forest classifier is a supervised learning method that can
perform classification or regression tasks by using an ensemble of decision
trees. A random forest classifier is not suitable for discovering abstract topics from
text documents, as it requires labeled data and predefined classes3.
Option D: A linear support vector machine is a supervised learning method that
can perform classification or regression tasks by using a linear function that
separates the data into different classes. A linear support vector machine is not
suitable for discovering abstract topics from text documents, as it requires labeled
data and predefined classes4.
Option E: A linear regression is a supervised learning method that can perform
regression tasks by using a linear function that models the relationship between a
dependent variable and one or more independent variables. A linear regression is
not suitable for discovering abstract topics from text documents, as it requires
labeled data and a continuous output variable5.
References:
1: Latent Dirichlet Allocation
2: Neural Topic Modeling
3: Random Forest Classifier
4: Linear Support Vector Machine
5: Linear Regression
Question # 5
A media company wants to create a solution that identifies celebrities in pictures that usersupload. The company also wants to identify the IP address and the timestamp details fromthe users so the company can prevent users from uploading pictures from unauthorizedlocations.Which solution will meet these requirements with LEAST development effort?
A. Use AWS Panorama to identify celebrities in the pictures. Use AWS CloudTrail tocapture IP address and timestamp details. B. Use AWS Panorama to identify celebrities in the pictures. Make calls to the AWSPanorama Device SDK to capture IP address and timestamp details. C. Use Amazon Rekognition to identify celebrities in the pictures. Use AWS CloudTrail tocapture IP address and timestamp details. D. Use Amazon Rekognition to identify celebrities in the pictures. Use the text detectionfeature to capture IP address and timestamp details.
Answer: C
Explanation: The solution C will meet the requirements with the least development effort
because it uses Amazon Rekognition and AWS CloudTrail, which are fully managed
services that can provide the desired functionality. The solution C involves the following
steps:
Use Amazon Rekognition to identify celebrities in the pictures. Amazon
Rekognition is a service that can analyze images and videos and extract insights
such as faces, objects, scenes, emotions, and more. Amazon Rekognition also
provides a feature called Celebrity Recognition, which can recognize thousands of
celebrities across a number of categories, such as politics, sports, entertainment,
and media. Amazon Rekognition can return the name, face, and confidence score
of the recognized celebrities, as well as additional information such as URLs and
biographies1.
Use AWS CloudTrail to capture IP address and timestamp details. AWS CloudTrail
is a service that can record the API calls and events made by or on behalf of AWS
accounts. AWS CloudTrail can provide information such as the source IP address,
the user identity, the request parameters, and the response elements of the API
calls. AWS CloudTrail can also deliver the event records to an Amazon S3 bucket
or an Amazon CloudWatch Logs group for further analysis and auditing2.
The other options are not suitable because:
Option A: Using AWS Panorama to identify celebrities in the pictures and using
AWS CloudTrail to capture IP address and timestamp details will not meet the
requirements effectively. AWS Panorama is a service that can extend computer
vision to the edge, where it can run inference on video streams from cameras and
other devices. AWS Panorama is not designed for identifying celebrities in
pictures, and it may not provide accurate or relevant results. Moreover, AWS
Panorama requires the use of an AWS Panorama Appliance or a compatible
device, which may incur additional costs and complexity3.
Option B: Using AWS Panorama to identify celebrities in the pictures and making
calls to the AWS Panorama Device SDK to capture IP address and timestamp
details will not meet the requirements effectively, for the same reasons as option
A. Additionally, making calls to the AWS Panorama Device SDK will require more
development effort than using AWS CloudTrail, as it will involve writing custom
code and handling errors and exceptions4.
Option D: Using Amazon Rekognition to identify celebrities in the pictures and
using the text detection feature to capture IP address and timestamp details will
not meet the requirements effectively. The text detection feature of Amazon
Rekognition is used to detect and recognize text in images and videos, such as
street names, captions, product names, and license plates. It is not suitable for
capturing IP address and timestamp details, as these are not part of the pictures
that users upload. Moreover, the text detection feature may not be accurate or
reliable, as it depends on the quality and clarity of the text in the images and
A retail company stores 100 GB of daily transactional data in Amazon S3 at periodicintervals. The company wants to identify the schema of the transactional data. Thecompany also wants to perform transformations on the transactional data that is in AmazonS3.The company wants to use a machine learning (ML) approach to detect fraud in thetransformed data.Which combination of solutions will meet these requirements with the LEAST operationaloverhead? {Select THREE.)
A. Use Amazon Athena to scan the data and identify the schema. B. Use AWS Glue crawlers to scan the data and identify the schema. C. Use Amazon Redshift to store procedures to perform data transformations D. Use AWS Glue workflows and AWS Glue jobs to perform data transformations. E. Use Amazon Redshift ML to train a model to detect fraud. F. Use Amazon Fraud Detector to train a model to detect fraud.
Answer: B,D,F
Explanation: To meet the requirements with the least operational overhead, the company
should use AWS Glue crawlers, AWS Glue workflows and jobs, and Amazon Fraud
Detector. AWS Glue crawlers can scan the data in Amazon S3 and identify the schema,
which is then stored in the AWS Glue Data Catalog. AWS Glue workflows and jobs can
perform data transformations on the data in Amazon S3 using serverless Spark or Python
scripts. Amazon Fraud Detector can train a model to detect fraud using the transformed
data and the company’s historical fraud labels, and then generate fraud predictions using a
simple API call.
Option A is incorrect because Amazon Athena is a serverless query service that can
analyze data in Amazon S3 using standard SQL, but it does not perform data
transformations or fraud detection.
Option C is incorrect because Amazon Redshift is a cloud data warehouse that can store
and query data using SQL, but it requires provisioning and managing clusters, which adds
operational overhead. Moreover, Amazon Redshift does not provide a built-in fraud detection capability.
Option E is incorrect because Amazon Redshift ML is a feature that allows users to create,
train, and deploy machine learning models using SQL commands in Amazon Redshift.
However, using Amazon Redshift ML would require loading the data from Amazon S3 to
Amazon Redshift, which adds complexity and cost. Also, Amazon Redshift ML does not
support fraud detection as a use case.
References:
AWS Glue Crawlers
AWS Glue Workflows and Jobs
Amazon Fraud Detector
Question # 7
An automotive company uses computer vision in its autonomous cars. The companytrained its object detection models successfully by using transfer learning from aconvolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK.The vehicles have limited hardware and compute power. The company wants to optimizethe model to reduce memory, battery, and hardware consumption without a significantsacrifice in accuracy.Which solution will improve the computational efficiency of the models?
A. Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights,gradients, biases, and activation outputs. Compute the filter ranks based on the traininginformation. Apply pruning to remove the low-ranking filters. Set new weights based on thepruned set of filters. Run a new training job with the pruned model. B. Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collecta larger labeled dataset with the labelling workflows. Run a new training job that uses thenew labeled data with previous training data. C. Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients,biases, and activation outputs. Compute the filter ranks based on the training information.Apply pruning to remove the low-ranking filters. Set the new weights based on the prunedset of filters. Run a new training job with the pruned model. D. Use Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metricand OverheadLatency metric of the model after the company deploys the model. Increasethe model learning rate. Run a new training job.
Answer: C
Explanation: The solution C will improve the computational efficiency of the models
because it uses Amazon SageMaker Debugger and pruning, which are techniques that can
reduce the size and complexity of the convolutional neural network (CNN) models. The
solution C involves the following steps:
Use Amazon SageMaker Debugger to gain visibility into the training weights,
gradients, biases, and activation outputs. Amazon SageMaker Debugger is a
service that can capture and analyze the tensors that are emitted during the
training process of machine learning models. Amazon SageMaker Debugger can
provide insights into the model performance, quality, and convergence. Amazon
SageMaker Debugger can also help to identify and diagnose issues such as
overfitting, underfitting, vanishing gradients, and exploding gradients1.
Compute the filter ranks based on the training information. Filter ranking is a
technique that can measure the importance of each filter in a convolutional layer
based on some criterion, such as the average percentage of zero activations or
the L1-norm of the filter weights. Filter ranking can help to identify the filters that
have little or no contribution to the model output, and thus can be removed without
affecting the model accuracy2.
Apply pruning to remove the low-ranking filters. Pruning is a technique that can
reduce the size and complexity of a neural network by removing the redundant or
irrelevant parts of the network, such as neurons, connections, or filters. Pruning
can help to improve the computational efficiency, memory usage, and inference speed of the model, as well as to prevent overfitting and improve generalization3.
Set the new weights based on the pruned set of filters. After pruning, the model
will have a smaller and simpler architecture, with fewer filters in each convolutional
layer. The new weights of the model can be set based on the pruned set of filters,
either by initializing them randomly or by fine-tuning them from the original
weights4.
Run a new training job with the pruned model. The pruned model can be trained
again with the same or a different dataset, using the same or a different framework
or algorithm. The new training job can use the same or a different configuration of
Amazon SageMaker, such as the instance type, the hyperparameters, or the data
ingestion mode. The new training job can also use Amazon SageMaker Debugger
to monitor and analyze the training process and the model quality5.
The other options are not suitable because:
Option A: Using Amazon CloudWatch metrics to gain visibility into the SageMaker
training weights, gradients, biases, and activation outputs will not be as effective
as using Amazon SageMaker Debugger. Amazon CloudWatch is a service that
can monitor and observe the operational health and performance of AWS
resources and applications. Amazon CloudWatch can provide metrics, alarms,
dashboards, and logs for various AWS services, including Amazon SageMaker.
However, Amazon CloudWatch does not provide the same level of granularity and
detail as Amazon SageMaker Debugger for the tensors that are emitted during the
training process of machine learning models. Amazon CloudWatch metrics are
mainly focused on the resource utilization and the training progress, not on the
model performance, quality, and convergence6.
Option B: Using Amazon SageMaker Ground Truth to build and run data labeling
workflows and collecting a larger labeled dataset with the labeling workflows will
not improve the computational efficiency of the models. Amazon SageMaker
Ground Truth is a service that can create high-quality training datasets for machine
learning by using human labelers. A larger labeled dataset can help to improve the
model accuracy and generalization, but it will not reduce the memory, battery, and
hardware consumption of the model. Moreover, a larger labeled dataset may
increase the training time and cost of the model7.
Option D: Using Amazon SageMaker Model Monitor to gain visibility into the
ModelLatency metric and OverheadLatency metric of the model after the company
deploys the model and increasing the model learning rate will not improve the
computational efficiency of the models. Amazon SageMaker Model Monitor is a
service that can monitor and analyze the quality and performance of machine
learning models that are deployed on Amazon SageMaker endpoints. The
ModelLatency metric and the OverheadLatency metric can measure the inference
latency of the model and the endpoint, respectively. However, these metrics do not
provide any information about the training weights, gradients, biases, and
activation outputs of the model, which are needed for pruning. Moreover,
increasing the model learning rate will not reduce the size and complexity of the
model, but it may affect the model convergence and accuracy.
References:
1: Amazon SageMaker Debugger
2: Pruning Convolutional Neural Networks for Resource Efficient Inference
3: Pruning Neural Networks: A Survey
4: Learning both Weights and Connections for Efficient Neural Networks 5: Amazon SageMaker Training Jobs
6: Amazon CloudWatch Metrics for Amazon SageMaker
7: Amazon SageMaker Ground Truth
: Amazon SageMaker Model Monitor
Question # 8
A media company is building a computer vision model to analyze images that are on socialmedia. The model consists of CNNs that the company trained by using images that thecompany stores in Amazon S3. The company used an Amazon SageMaker training job inFile mode with a single Amazon EC2 On-Demand Instance.Every day, the company updates the model by using about 10,000 images that thecompany has collected in the last 24 hours. The company configures training with only oneepoch. The company wants to speed up training and lower costs without the need to makeany code changes.Which solution will meet these requirements?
A. Instead of File mode, configure the SageMaker training job to use Pipe mode. Ingest thedata from a pipe. B. Instead Of File mode, configure the SageMaker training job to use FastFile mode withno Other changes. C. Instead Of On-Demand Instances, configure the SageMaker training job to use SpotInstances. Make no Other changes. D. Instead Of On-Demand Instances, configure the SageMaker training job to use SpotInstances. Implement model checkpoints.
Answer: C
Explanation: The solution C will meet the requirements because it uses Amazon
SageMaker Spot Instances, which are unused EC2 instances that are available at up to
90% discount compared to On-Demand prices. Amazon SageMaker Spot Instances can
speed up training and lower costs by taking advantage of the spare EC2 capacity. The
company does not need to make any code changes to use Spot Instances, as it can simply
enable the managed spot training option in the SageMaker training job configuration. The
company also does not need to implement model checkpoints, as it is using only one
epoch for training, which means the model will not resume from a previous state1.
The other options are not suitable because:
Option A: Configuring the SageMaker training job to use Pipe mode instead of File
mode will not speed up training or lower costs significantly. Pipe mode is a data
ingestion mode that streams data directly from S3 to the training algorithm, without
copying the data to the local storage of the training instance. Pipe mode can
reduce the startup time of the training job and the disk space usage, but it does not
affect the computation time or the instance price. Moreover, Pipe mode may
require some code changes to handle the streaming data, depending on the
training algorithm2. Option B: Configuring the SageMaker training job to use FastFile mode instead of
File mode will not speed up training or lower costs significantly. FastFile mode is a
data ingestion mode that copies data from S3 to the local storage of the training
instance in parallel with the training process. FastFile mode can reduce the startup
time of the training job and the disk space usage, but it does not affect the
computation time or the instance price. Moreover, FastFile mode is only available
for distributed training jobs that use multiple instances, which is not the case for
the company3.
Option D: Configuring the SageMaker training job to use Spot Instances and
implementing model checkpoints will not meet the requirements without the need
to make any code changes. Model checkpoints are a feature that allows the
training job to save the model state periodically to S3, and resume from the latest
checkpoint if the training job is interrupted. Model checkpoints can help to avoid
losing the training progress and ensure the model convergence, but they require
some code changes to implement the checkpointing logic and the resuming logic4.
References:
1: Managed Spot Training - Amazon SageMaker
2: Pipe Mode - Amazon SageMaker
3: FastFile Mode - Amazon SageMaker
4: Checkpoints - Amazon SageMaker
Question # 9
A data scientist is building a forecasting model for a retail company by using the mostrecent 5 years of sales records that are stored in a data warehouse. The dataset containssales records for each of the company's stores across five commercial regions The datascientist creates a working dataset with StorelD. Region. Date, and Sales Amount ascolumns. The data scientist wants to analyze yearly average sales for each region. Thescientist also wants to compare how each region performed compared to average salesacross all commercial regions.Which visualization will help the data scientist better understand the data trend?
A. Create an aggregated dataset by using the Pandas GroupBy function to get averagesales for each year for each store. Create a bar plot, faceted by year, of average sales foreach store. Add an extra bar in each facet to represent average sales. B. Create an aggregated dataset by using the Pandas GroupBy function to get averagesales for each year for each store. Create a bar plot, colored by region and faceted by year,of average sales for each store. Add a horizontal line in each facet to represent averagesales. C. Create an aggregated dataset by using the Pandas GroupBy function to get averagesales for each year for each region Create a bar plot of average sales for each region. Addan extra bar in each facet to represent average sales. D. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region Create a bar plot, faceted by year, of average sales foreach region Add a horizontal line in each facet to represent average sales.
Answer: D
Explanation: The best visualization for this task is to create a bar plot, faceted by year, of
average sales for each region and add a horizontal line in each facet to represent average
sales. This way, the data scientist can easily compare the yearly average sales for each
region with the overall average sales and see the trends over time. The bar plot also allows
the data scientist to see the relative performance of each region within each year and
across years. The other options are less effective because they either do not show the
yearly trends, do not show the overall average sales, or do not group the data by region.
A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10hours on average to train the model on GPU instances. The data scientist suspects thattraining is not converging and thatresource utilization is not optimal.What should the data scientist do to identify and address training issues with the LEASTdevelopment effort?
A. Use CPU utilization metrics that are captured in Amazon CloudWatch. Configure aCloudWatch alarm to stop the training job early if low CPU utilization occurs. B. Use high-resolution custom metrics that are captured in Amazon CloudWatch. Configurean AWS Lambda function to analyze the metrics and to stop the training job early if issuesare detected. C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rulesto detect issues and to launch the StopTrainingJob action if issues are detected. D. Use the SageMaker Debugger confusion and feature_importance_overweight built-inrules to detect issues and to launch the StopTrainingJob action if issues are detected.
Answer: C
Explanation: The solution C is the best option to identify and address training issues with
the least development effort. The solution C involves the following steps:
Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in
rules to detect issues. SageMaker Debugger is a feature of Amazon SageMaker
that allows data scientists to monitor, analyze, and debug machine learning
models during training. SageMaker Debugger provides a set of built-in rules that
can automatically detect common issues and anomalies in model training, such as
vanishing or exploding gradients, overfitting, underfitting, low GPU utilization, and
more1. The data scientist can use the vanishing_gradient rule to check if the
gradients are becoming too small and causing the training to not converge. The
data scientist can also use the LowGPUUtilization rule to check if the GPU
resources are underutilized and causing the training to be inefficient2.
Launch the StopTrainingJob action if issues are detected. SageMaker Debugger
can also take actions based on the status of the rules. One of the actions is
StopTrainingJob, which can terminate the training job if a rule is in an error
state. This can help the data scientist to save time and money by stopping the
training early if issues are detected3.
The other options are not suitable because:
Option A: Using CPU utilization metrics that are captured in Amazon CloudWatch
and configuring a CloudWatch alarm to stop the training job early if low CPU
utilization occurs will not identify and address training issues effectively. CPU
utilization is not a good indicator of model training performance, especially for GPU
instances. Moreover, CloudWatch alarms can only trigger actions based on simple
thresholds, not complex rules or conditions4.
Option B: Using high-resolution custom metrics that are captured in Amazon
CloudWatch and configuring an AWS Lambda function to analyze the metrics and
to stop the training job early if issues are detected will incur more development effort than using SageMaker Debugger. The data scientist will have to write the
code for capturing, sending, and analyzing the custom metrics, as well as for
invoking the Lambda function and stopping the training job. Moreover, this solution
may not be able to detect all the issues that SageMaker Debugger can5.
Option D: Using the SageMaker Debugger confusion and
feature_importance_overweight built-in rules and launching the StopTrainingJob
action if issues are detected will not identify and address training issues effectively.
The confusion rule is used to monitor the confusion matrix of a classification
model, which is not relevant for a regression model that predicts prices. The
feature_importance_overweight rule is used to check if some features have too
much weight in the model, which may not be related to the convergence or
resource utilization issues2.
References:
1: Amazon SageMaker Debugger
2: Built-in Rules for Amazon SageMaker Debugger
3: Actions for Amazon SageMaker Debugger
4: Amazon CloudWatch Alarms
5: Amazon CloudWatch Custom Metrics
Question # 11
A company builds computer-vision models that use deep learning for the autonomousvehicle industry. A machine learning (ML) specialist uses an Amazon EC2 instance thathas a CPU: GPU ratio of 12:1 to train the models.The ML specialist examines the instance metric logs and notices that the GPU is idle half ofthe time The ML specialist must reduce training costs without increasing the duration of thetraining jobs.Which solution will meet these requirements?
A. Switch to an instance type that has only CPUs. B. Use a heterogeneous cluster that has two different instances groups. C. Use memory-optimized EC2 Spot Instances for the training jobs. D. Switch to an instance type that has a CPU GPU ratio of 6:1.
Answer: D
Explanation: Switching to an instance type that has a CPU: GPU ratio of 6:1 will reduce
the training costs by using fewer CPUs and GPUs, while maintaining the same level of
performance. The GPU idle time indicates that the CPU is not able to feed the GPU with
enough data, so reducing the CPU: GPU ratio will balance the workload and improve the
GPU utilization. A lower CPU: GPU ratio also means less overhead for inter-process
communication and synchronization between the CPU and GPU processes. References:
Optimizing GPU utilization for AI/ML workloads on Amazon EC2
Analyze CPU vs. GPU Performance for AWS Machine Learning
Question # 12
An engraving company wants to automate its quality control process for plaques. Thecompany performs the process before mailing each customized plaque to a customer. Thecompany has created an Amazon S3 bucket that contains images of defects that shouldcause a plaque to be rejected. Low-confidence predictions must be sent to an internal teamof reviewers who are using Amazon Augmented Al (Amazon A2I).Which solution will meet these requirements?
A. Use Amazon Textract for automatic processing. Use Amazon A2I with AmazonMechanical Turk for manual review. B. Use Amazon Rekognition for automatic processing. Use Amazon A2I with a privateworkforce option for manual review. C. Use Amazon Transcribe for automatic processing. Use Amazon A2I with a privateworkforce option for manual review. D. Use AWS Panorama for automatic processing Use Amazon A2I with AmazonMechanical Turk for manual review
Answer: B
Explanation: Amazon Rekognition is a service that provides computer vision capabilities
for image and video analysis, such as object, scene, and activity detection, face and text
recognition, and custom label detection. Amazon Rekognition can be used to automate the
quality control process for plaques by comparing the images of the plaques with the images
of defects in the Amazon S3 bucket and returning a confidence score for each defect.
Amazon A2I is a service that enables human review of machine learning predictions, such
as low-confidence predictions from Amazon Rekognition. Amazon A2I can be integrated
with a private workforce option, which allows the engraving company to use its own internal
team of reviewers to manually inspect the plaques that are flagged by Amazon
Rekognition. This solution meets the requirements of automating the quality control
process, sending low-confidence predictions to an internal team of reviewers, and using Amazon A2I for manual review. References:
1: Amazon Rekognition documentation
2: Amazon A2I documentation
3: Amazon Rekognition Custom Labels documentation
4: Amazon A2I Private Workforce documentation
Question # 13
An Amazon SageMaker notebook instance is launched into Amazon VPC The SageMakernotebook references data contained in an Amazon S3 bucket in another account Thebucket is encrypted using SSE-KMS The instance returns an access denied error whentrying to access data in Amazon S3.Which of the following are required to access the bucket and avoid the access deniederror? (Select THREE)
A. An AWS KMS key policy that allows access to the customer master key (CMK) B. A SageMaker notebook security group that allows access to Amazon S3 C. An 1AM role that allows access to the specific S3 bucket D. A permissive S3 bucket policy E. An S3 bucket owner that matches the notebook owner F. A SegaMaker notebook subnet ACL that allow traffic to Amazon S3.
Answer: A,B,C
Explanation: To access an Amazon S3 bucket in another account that is encrypted using
SSE-KMS, the following are required:
A. An AWS KMS key policy that allows access to the customer master key (CMK).
The CMK is the encryption key that is used to encrypt and decrypt the data in the
S3 bucket. The KMS key policy defines who can use and manage the CMK. To
allow access to the CMK from another account, the key policy must include a
statement that grants the necessary permissions (such as kms:Decrypt) to the
principal from the other account (such as the SageMaker notebook IAM role).
B. A SageMaker notebook security group that allows access to Amazon S3. A
security group is a virtual firewall that controls the inbound and outbound traffic for
the SageMaker notebook instance. To allow the notebook instance to access the
S3 bucket, the security group must have a rule that allows outbound traffic to the
S3 endpoint on port 443 (HTTPS).
C. An IAM role that allows access to the specific S3 bucket. An IAM role is an
identity that can be assumed by the SageMaker notebook instance to access AWS
resources. The IAM role must have a policy that grants the necessary permissions
(such as s3:GetObject) to access the specific S3 bucket. The policy must also
include a condition that allows access to the CMK in the other account.
The following are not required or correct:
D. A permissive S3 bucket policy. A bucket policy is a resource-based policy that
defines who can access the S3 bucket and what actions they can perform. A
permissive bucket policy is not required and not recommended, as it can expose
the bucket to unauthorized access. A bucket policy should follow the principle of
least privilege and grant the minimum permissions necessary to the specific
principals that need access.
E. An S3 bucket owner that matches the notebook owner. The S3 bucket owner
and the notebook owner do not need to match, as long as the bucket owner grants
cross-account access to the notebook owner through the KMS key policy and the
bucket policy (if applicable).
F. A SegaMaker notebook subnet ACL that allow traffic to Amazon S3. A subnet
ACL is a network access control list that acts as an optional layer of security for
the SageMaker notebook instance’s subnet. A subnet ACL is not required to
access the S3 bucket, as the security group is sufficient to control the traffic.
However, if a subnet ACL is used, it must not block the traffic to the S3 endpoint.
Question # 14
A machine learning (ML) engineer has created a feature repository in Amazon SageMakerFeature Store for the company. The company has AWS accounts for development,integration, and production. The company hosts a feature store in the developmentaccount. The company uses Amazon S3 buckets to store feature values offline. Thecompany wants to share features and to allow the integration account and the productionaccount to reuse the features that are in the feature repository. Which combination of steps will meet these requirements? (Select TWO.)
A. Create an IAM role in the development account that the integration account andproduction account can assume. Attach IAM policies to the role that allow access to thefeature repository and the S3 buckets. B. Share the feature repository that is associated the S3 buckets from the developmentaccount to the integration account and the production account by using AWS ResourceAccess Manager (AWS RAM). C. Use AWS Security Token Service (AWS STS) from the integration account and theproduction account to retrieve credentials for the development account. D. Set up S3 replication between the development S3 buckets and the integration andproduction S3 buckets. E. Create an AWS PrivateLink endpoint in the development account for SageMaker.
Answer: A,B
Explanation:
The combination of steps that will meet the requirements are to create an IAM role in the
development account that the integration account and production account can assume,
attach IAM policies to the role that allow access to the feature repository and the S3
buckets, and share the feature repository that is associated with the S3 buckets from the
development account to the integration account and the production account by using AWS
Resource Access Manager (AWS RAM). This approach will enable cross-account access
and sharing of the features stored in Amazon SageMaker Feature Store and Amazon S3.
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store,
update, search, and share curated data used in training and prediction workflows. The
service provides feature management capabilities such as enabling easy feature reuse, low
latency serving, time travel, and ensuring consistency between features used in training
and inference workflows. A feature group is a logical grouping of ML features whose
organization and structure is defined by a feature group schema. A feature group schema
consists of a list of feature definitions, each of which specifies the name, type, and
metadata of a feature. Amazon SageMaker Feature Store stores the features in both an
online store and an offline store. The online store is a low-latency, high-throughput store
that is optimized for real-time inference. The offline store is a historical store that is backed
by an Amazon S3 bucket and is optimized for batch processing and model training1.
AWS Identity and Access Management (IAM) is a web service that helps you securely
control access to AWS resources for your users. You use IAM to control who can use your
AWS resources (authentication) and what resources they can use and in what ways
(authorization). An IAM role is an IAM identity that you can create in your account that has
specific permissions. You can use an IAM role to delegate access to users, applications, or
services that don’t normally have access to your AWS resources. For example, you can create an IAM role in your development account that allows the integration account and the
production account to assume the role and access the resources in the development
account. You can attach IAM policies to the role that specify the permissions for the feature
repository and the S3 buckets. You can also use IAM conditions to restrict the access
based on the source account, IP address, or other factors2.
AWS Resource Access Manager (AWS RAM) is a service that enables you to easily and
securely share AWS resources with any AWS account or within your AWS Organization.
You can share AWS resources that you own with other accounts using resource shares. A
resource share is an entity that defines the resources that you want to share, and the
principals that you want to share with. For example, you can share the feature repository
that is associated with the S3 buckets from the development account to the integration
account and the production account by creating a resource share in AWS RAM. You can
specify the feature group ARN and the S3 bucket ARN as the resources, and the
integration account ID and the production account ID as the principals. You can also use
IAM policies to further control the access to the shared resources3.
The other options are either incorrect or unnecessary. Using AWS Security Token Service
(AWS STS) from the integration account and the production account to retrieve credentials
for the development account is not required, as the IAM role in the development account
can provide temporary security credentials for the cross-account access. Setting up S3
replication between the development S3 buckets and the integration and production S3
buckets would introduce redundancy and inconsistency, as the S3 buckets are already
shared through AWS RAM. Creating an AWS PrivateLink endpoint in the development
account for SageMaker is not relevant, as it is used to securely connect to SageMaker
services from a VPC, not from another account.
References:
1: Amazon SageMaker Feature Store – Amazon Web Services
2: What Is IAM? - AWS Identity and Access Management
3: What Is AWS Resource Access Manager? - AWS Resource Access Manager
Question # 15
A network security vendor needs to ingest telemetry data from thousands of endpoints thatrun all over the world. The data is transmitted every 30 seconds in the form of records thatcontain 50 fields. Each record is up to 1 KB in size. The security vendor uses AmazonKinesis Data Streams to ingest the data. The vendor requires hourly summaries of therecords that Kinesis Data Streams ingests. The vendor will use Amazon Athena to querythe records and to generate the summaries. The Athena queries will target 7 to 12 of theavailable data fields.Which solution will meet these requirements with the LEAST amount of customization totransform and store the ingested data?
A. Use AWS Lambda to read and aggregate the data hourly. Transform the data and storeit in Amazon S3 by using Amazon Kinesis Data Firehose. B. Use Amazon Kinesis Data Firehose to read and aggregate the data hourly. Transformthe data and store it in Amazon S3 by using a short-lived Amazon EMR cluster. C. Use Amazon Kinesis Data Analytics to read and aggregate the data hourly. Transformthe data and store it in Amazon S3 by using Amazon Kinesis Data Firehose. D. Use Amazon Kinesis Data Firehose to read and aggregate the data hourly. Transform the data and store it in Amazon S3 by using AWS Lambda.
Answer: C
Explanation: The solution that will meet the requirements with the least amount of
customization to transform and store the ingested data is to use Amazon Kinesis Data
Analytics to read and aggregate the data hourly, transform the data and store it in Amazon
S3 by using Amazon Kinesis Data Firehose. This solution leverages the built-in features of
Kinesis Data Analytics to perform SQL queries on streaming data and generate hourly
summaries. Kinesis Data Analytics can also output the transformed data to Kinesis Data
Firehose, which can then deliver the data to S3 in a specified format and partitioning
scheme. This solution does not require any custom code or additional infrastructure to
process the data. The other solutions either require more customization (such as using
Lambda or EMR) or do not meet the requirement of aggregating the data hourly (such as
using Lambda to read the data from Kinesis Data Streams). References:
1: Boosting Resiliency with an ML-based Telemetry Analytics Architecture | AWS
Architecture Blog
2: AWS Cloud Data Ingestion Patterns and Practices
3: IoT ingestion and Machine Learning analytics pipeline with AWS IoT …
4: AWS IoT Data Ingestion Simplified 101: The Complete Guide - Hevo Data
Question # 16
A data scientist is building a linear regression model. The scientist inspects the dataset andnotices that the mode of the distribution is lower than the median, and the median is lowerthan the mean.Which data transformation will give the data scientist the ability to apply a linear regressionmodel?
A. Exponential transformation B. Logarithmic transformation C. Polynomial transformation D. Sinusoidal transformation
Answer: B
Explanation: A logarithmic transformation is a suitable data transformation for a linear
regression model when the data has a skewed distribution, such as when the mode is
lower than the median and the median is lower than the mean. A logarithmic transformation
can reduce the skewness and make the data more symmetric and normally distributed,
which are desirable properties for linear regression. A logarithmic transformation can also
reduce the effect of outliers and heteroscedasticity (unequal variance) in the data. An
exponential transformation would have the opposite effect of increasing the skewness and
making the data more asymmetric. A polynomial transformation may not be able to capture
the nonlinearity in the data and may introduce multicollinearity among the transformed
variables. A sinusoidal transformation is not appropriate for data that does not have a
periodic pattern.
References:
Data Transformation - Scaler Topics
Linear Regression - GeeksforGeeks
Linear Regression - Scribbr
Question # 17
A car company is developing a machine learning solution to detect whether a car is presentin an image. The image dataset consists of one million images. Each image in the datasetis 200 pixels in height by 200 pixels in width. Each image is labeled as either having a caror not having a car.Which architecture is MOST likely to produce a model that detects whether a car is presentin an image with the highest accuracy?
A. Use a deep convolutional neural network (CNN) classifier with the images as input.Include a linear output layer that outputs the probability that an image contains a car. B. Use a deep convolutional neural network (CNN) classifier with the images as input.Include a softmax output layer that outputs the probability that an image contains a car. C. Use a deep multilayer perceptron (MLP) classifier with the images as input. Include alinear output layer that outputs the probability that an image contains a car. D. Use a deep multilayer perceptron (MLP) classifier with the images as input. Include asoftmax output layer that outputs the probability that an image contains a car.
Answer: A
Explanation: A deep convolutional neural network (CNN) classifier is a suitable
architecture for image classification tasks, as it can learn features from the images and
reduce the dimensionality of the input. A linear output layer that outputs the probability that
an image contains a car is appropriate for a binary classification problem, as it can produce
a single scalar value between 0 and 1. A softmax output layer is more suitable for a multiclass
classification problem, as it can produce a vector of probabilities that sum up to 1. A
deep multilayer perceptron (MLP) classifier is not as effective as a CNN for image
classification, as it does not exploit the spatial structure of the images and requires a large
number of parameters to process the high-dimensional input. References:
AWS Whitepaper - An Overview of Machine Learning on AWS
Question # 18
A university wants to develop a targeted recruitment strategy to increase new studentenrollment. A data scientist gathers information about the academic performance history ofstudents. The data scientist wants to use the data to build student profiles. The universitywill use the profiles to direct resources to recruit students who are likely to enroll in theuniversity.Which combination of steps should the data scientist take to predict whether a particularstudent applicant is likely to enroll in the university? (Select TWO)
A. Use Amazon SageMaker Ground Truth to sort the data into two groups named"enrolled" or "not enrolled." B. Use a forecasting algorithm to run predictions. C. Use a regression algorithm to run predictions. D. Use a classification algorithm to run predictions E. Use the built-in Amazon SageMaker k-means algorithm to cluster the data into twogroups named "enrolled" or "not enrolled."
Answer: A,D
Explanation: The data scientist should use Amazon SageMaker Ground Truth to sort the
data into two groups named “enrolled” or “not enrolled.” This will create a labeled dataset
that can be used for supervised learning. The data scientist should then use a classification
algorithm to run predictions on the test data. A classification algorithm is a suitable choice
for predicting a binary outcome, such as enrollment status, based on the input features,
such as academic performance. A classification algorithm will output a probability for each
class label and assign the most likely label to each observation.
References:
Use Amazon SageMaker Ground Truth to Label Data
Classification Algorithm in Machine Learning
Question # 19
An insurance company developed a new experimental machine learning (ML) model toreplace an existing model that is in production. The company must validate the quality ofpredictions from the new experimental model in a production environment before thecompany uses the new experimental model to serve general user requests.Which one model can serve user requests at a time. The company must measure theperformance of the new experimental model without affecting the current live trafficWhich solution will meet these requirements?
A. A/B testing B. Canary release C. Shadow deployment D. Blue/green deployment
Answer: C
Explanation: The best solution for this scenario is to use shadow deployment, which is a technique that allows the company to run the new experimental model in parallel with the
existing model, without exposing it to the end users. In shadow deployment, the company
can route the same user requests to both models, but only return the responses from the
existing model to the users. The responses from the new experimental model are logged
and analyzed for quality and performance metrics, such as accuracy, latency, and resource
consumption12. This way, the company can validate the new experimental model in a
production environment, without affecting the current live traffic or user experience.
The other solutions are not suitable, because they have the following drawbacks:
A: A/B testing is a technique that involves splitting the user traffic between two or
more models, and comparing their outcomes based on predefined
metrics. However, this technique exposes the new experimental model to a portion
of the end users, which might affect their experience if the model is not reliable or
consistent with the existing model3.
B: Canary release is a technique that involves gradually rolling out the new
experimental model to a small subset of users, and monitoring its performance and
feedback. However, this technique also exposes the new experimental model to
some end users, and requires careful selection and segmentation of the user
groups4.
D: Blue/green deployment is a technique that involves switching the user traffic
from the existing model (blue) to the new experimental model (green) at once,
after testing and verifying the new model in a separate environment. However, this
technique does not allow the company to validate the new experimental model in a
production environment, and might cause service disruption or inconsistency if the
new model is not compatible or stable5.
References:
1: Shadow Deployment: A Safe Way to Test in Production | LaunchDarkly Blog
2: Shadow Deployment: A Safe Way to Test in Production | LaunchDarkly Blog
3: A/B Testing for Machine Learning Models | AWS Machine Learning Blog
4: Canary Releases for Machine Learning Models | AWS Machine Learning Blog
5: Blue-Green Deployments for Machine Learning Models | AWS Machine
Learning Blog
Question # 20
A company wants to detect credit card fraud. The company has observed that an averageof 2% of credit card transactions are fraudulent. A data scientist trains a classifier on ayear's worth of credit card transaction data. The classifier needs to identify the fraudulenttransactions. The company wants to accurately capture as many fraudulent transactions aspossible.Which metrics should the data scientist use to optimize the classifier? (Select TWO.)
A. Specificity B. False positive rate C. Accuracy D. Fl score E. True positive rate
Answer: D,E
Explanation: The F1 score is a measure of the harmonic mean of precision and recall,
which are both important for fraud detection. Precision is the ratio of true positives to all
predicted positives, and recall is the ratio of true positives to all actual positives. A high F1
score indicates that the classifier can correctly identify fraudulent transactions and avoid
false negatives. The true positive rate is another name for recall, and it measures the
proportion of fraudulent transactions that are correctly detected by the classifier. A high true
positive rate means that the classifier can capture as many fraudulent transactions as
possible.
References:
Fraud Detection Using Machine Learning | Implementations | AWS Solutions
Detect fraudulent transactions using machine learning with Amazon SageMaker |
A company deployed a machine learning (ML) model on the company website to predictreal estate prices. Several months after deployment, an ML engineer notices that theaccuracy of the model has gradually decreased.The ML engineer needs to improve the accuracy of the model. The engineer also needs toreceive notifications for any future performance issues.Which solution will meet these requirements?
A. Perform incremental training to update the model. Activate Amazon SageMaker Model Monitor to detect model performance issues and to send notifications. B. Use Amazon SageMaker Model Governance. Configure Model Governance toautomatically adjust model hyper para meters. Create a performance threshold alarm inAmazon CloudWatch to send notifications. C. Use Amazon SageMaker Debugger with appropriate thresholds. Configure Debugger tosend Amazon CloudWatch alarms to alert the team Retrain the model by using only datafrom the previous several months. D. Use only data from the previous several months to perform incremental training toupdate the model. Use Amazon SageMaker Model Monitor to detect model performanceissues and to send notifications.
Answer: A
Explanation: The best solution to improve the accuracy of the model and receive
notifications for any future performance issues is to perform incremental training to update
the model and activate Amazon SageMaker Model Monitor to detect model performance
issues and to send notifications. Incremental training is a technique that allows you to
update an existing model with new data without retraining the entire model from scratch.
This can save time and resources, and help the model adapt to changing data patterns.
Amazon SageMaker Model Monitor is a feature that continuously monitors the quality of
machine learning models in production and notifies you when there are deviations in the
model quality, such as data drift and anomalies. You can set up alerts that trigger actions,
such as sending notifications to Amazon Simple Notification Service (Amazon SNS) topics,
when certain conditions are met.
Option B is incorrect because Amazon SageMaker Model Governance is a set of tools that
help you implement ML responsibly by simplifying access control and enhancing
transparency. It does not provide a mechanism to automatically adjust model
hyperparameters or improve model accuracy.
Option C is incorrect because Amazon SageMaker Debugger is a feature that helps you
debug and optimize your model training process by capturing relevant data and providing
real-time analysis. However, using Debugger alone does not update the model or monitor
its performance in production. Also, retraining the model by using only data from the
previous several months may not capture the full range of data variability and may
introduce bias or overfitting.
Option D is incorrect because using only data from the previous several months to perform
incremental training may not be sufficient to improve the model accuracy, as explained
above. Moreover, this option does not specify how to activate Amazon SageMaker Model
Monitor or configure the alerts and notifications.
References:
Incremental training
Amazon SageMaker Model Monitor
Amazon SageMaker Model Governance
Amazon SageMaker Debugger
Question # 22
A retail company wants to build a recommendation system for the company's website. Thesystem needs to provide recommendations for existing users and needs to base thoserecommendations on each user's past browsing history. The system also must filter out anyitems that the user previously purchased.Which solution will meet these requirements with the LEAST development effort?
A. Train a model by using a user-based collaborative filtering algorithm on AmazonSageMaker. Host the model on a SageMaker real-time endpoint. Configure an Amazon APIGateway API and an AWS Lambda function to handle real-time inference requests that theweb application sends. Exclude the items that the user previously purchased from theresults before sending the results back to the web application. B. Use an Amazon Personalize PERSONALIZED_RANKING recipe to train a model.Create a real-time filter to exclude items that the user previously purchased. Create anddeploy a campaign on Amazon Personalize. Use the GetPersonalizedRanking APIoperation to get the real-time recommendations. C. Use an Amazon Personalize USER_ PERSONAL IZATION recipe to train a modelCreate a real-time filter to exclude items that the user previously purchased. Create anddeploy a campaign on Amazon Personalize. Use the GetRecommendations API operationto get the real-time recommendations. D. Train a neural collaborative filtering model on Amazon SageMaker by using GPU instances. Host the model on a SageMaker real-time endpoint. Configure an Amazon APIGateway API and an AWS Lambda function to handle real-time inference requests that theweb application sends. Exclude the items that the user previously purchased from theresults before sending the results back to the web application.
Answer: C
Explanation: Amazon Personalize is a fully managed machine learning service that makes
it easy for developers to create personalized user experiences at scale. It uses the same
recommender system technology that Amazon uses to create its own personalized
recommendations. Amazon Personalize provides several pre-built recipes that can be used
to train models for different use cases. The USER_PERSONALIZATION recipe is designed
to provide personalized recommendations for existing users based on their past
interactions with items. The PERSONALIZED_RANKING recipe is designed to re-rank a
list of items for a user based on their preferences. The USER_PERSONALIZATION recipe
is more suitable for this use case because it can generate recommendations for each user
without requiring a list of candidate items. To filter out the items that the user previously
purchased, a real-time filter can be created and applied to the campaign. A real-time filter is
a dynamic filter that uses the latest interaction data to exclude items from the
recommendations. By using Amazon Personalize, the development effort is minimized
because it handles the data processing, model training, and deployment automatically. The
web application can use the GetRecommendations API operation to get the real-time
recommendations from the campaign. References:
Amazon Personalize
What is Amazon Personalize?
USER_PERSONALIZATION recipe
PERSONALIZED_RANKING recipe
Filtering recommendations
GetRecommendations API operation
Question # 23
A machine learning (ML) specialist is using Amazon SageMaker hyperparameteroptimization (HPO) to improve a model’s accuracy. The learning rate parameter is specifiedin the following HPO configuration:
During the results analysis, the ML specialist determines that most of the training jobs hada learning rate between 0.01 and 0.1. The best result had a learning rate of less than 0.01.Training jobs need to run regularly over a changing dataset. The ML specialist needs tofind a tuning mechanism that uses different learning rates more evenly from the providedrange between MinValue and MaxValue.Which solution provides the MOST accurate result?
A.Modify the HPO configuration as follows:
Select the most accurate hyperparameter configuration form this HPO job. B.Run three different HPO jobs that use different learning rates form the following intervalsfor MinValue and MaxValue while using the same number of training jobs for each HPOjob:[0.01, 0.1][0.001, 0.01][0.0001, 0.001]Select the most accurate hyperparameter configuration form these three HPO jobs. C.Modify the HPO configuration as follows:
Select the most accurate hyperparameter configuration form this training job. D.Run three different HPO jobs that use different learning rates form the following intervalsfor MinValue and MaxValue. Divide the number of training jobs for each HPO job by three:[0.01, 0.1][0.001, 0.01][0.0001, 0.001]Select the most accurate hyperparameter configuration form these three HPO jobs.
Answer: C
Explanation: The solution C modifies the HPO configuration to use a logarithmic scale for
the learning rate parameter. This means that the values of the learning rate are sampled
from a log-uniform distribution, which gives more weight to smaller values. This can help to
explore the lower end of the range more evenly and find the optimal learning rate more
efficiently. The other solutions either use a linear scale, which may not sample enough
values from the lower end, or divide the range into sub-intervals, which may miss some combinations of hyperparameters. References:
How Hyperparameter Tuning Works - Amazon SageMaker
Tuning Hyperparameters - Amazon SageMaker
Question # 24
A data engineer is preparing a dataset that a retail company will use to predict the numberof visitors to stores. The data engineer created an Amazon S3 bucket. The engineersubscribed the S3 bucket to an AWS Data Exchange data product for general economicindicators. The data engineer wants to join the economic indicator data to an existing tablein Amazon Athena to merge with the business data. All these transformations must finishrunning in 30-60 minutes.Which solution will meet these requirements MOST cost-effectively?
A. Configure the AWS Data Exchange product as a producer for an Amazon Kinesis datastream. Use an Amazon Kinesis Data Firehose delivery stream to transfer the data toAmazon S3 Run an AWS Glue job that will merge the existing business data with theAthena table. Write the result set back to Amazon S3. B. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambdafunction. Program the Lambda function to use Amazon SageMaker Data Wrangler tomerge the existing business data with the Athena table. Write the result set back toAmazon S3. C. Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS LambdaFunction Program the Lambda function to run an AWS Glue job that will merge the existingbusiness data with the Athena table Write the results back to Amazon S3. D. Provision an Amazon Redshift cluster. Subscribe to the AWS Data Exchange productand use the product to create an Amazon Redshift Table Merge the data in AmazonRedshift. Write the results back to Amazon S3.
Answer: B
Explanation: The most cost-effective solution is to use an S3 event to trigger a Lambda function that uses SageMaker Data Wrangler to merge the data. This solution avoids the
need to provision and manage any additional resources, such as Kinesis streams, Firehose
delivery streams, Glue jobs, or Redshift clusters. SageMaker Data Wrangler provides a
visual interface to import, prepare, transform, and analyze data from various sources,
including AWS Data Exchange products. It can also export the data preparation workflow to
a Python script that can be executed by a Lambda function. This solution can meet the time
requirement of 30-60 minutes, depending on the size and complexity of the data.
References:
Using Amazon S3 Event Notifications
Prepare ML Data with Amazon SageMaker Data Wrangler
AWS Lambda Function
Question # 25
An online delivery company wants to choose the fastest courier for each delivery at themoment an order is placed. The company wants to implement this feature for existing usersand new users of its application. Data scientists have trained separate models withXGBoost for this purpose, and the models are stored in Amazon S3. There is one model fofeach city where the company operates.The engineers are hosting these models in Amazon EC2 for responding to the web clientrequests, with one instance for each model, but the instances have only a 5% utilization inCPU and memory, ....operation engineers want to avoid managing unnecessary resources.Which solution will enable the company to achieve its goal with the LEAST operationaloverhead?
A. Create an Amazon SageMaker notebook instance for pulling all the models fromAmazon S3 using the boto3 library. Remove the existing instances and use the notebook toperform a SageMaker batch transform for performing inferences offline for all the possibleusers in all the cities. Store the results in different files in Amazon S3. Point the web clientto the files. B. Prepare an Amazon SageMaker Docker container based on the open-source multimodelserver. Remove the existing instances and create a multi-model endpoint inSageMaker instead, pointing to the S3 bucket containing all the models Invoke theendpoint from the web client at runtime, specifying the TargetModel parameter according tothe city of each request. C. Keep only a single EC2 instance for hosting all the models. Install a model server in theinstance and load each model by pulling it from Amazon S3. Integrate the instance with theweb client using Amazon API Gateway for responding to the requests in real time,specifying the target resource according to the city of each request. D. Prepare a Docker container based on the prebuilt images in Amazon SageMaker.Replace the existing instances with separate SageMaker endpoints. one for each citywhere the company operates. Invoke the endpoints from the web client, specifying the URL and EndpomtName parameter according to the city of each request.
Answer: B
Explanation: The best solution for this scenario is to use a multi-model endpoint in
Amazon SageMaker, which allows hosting multiple models on the same endpoint and
invoking them dynamically at runtime. This way, the company can reduce the operational
overhead of managing multiple EC2 instances and model servers, and leverage the
scalability, security, and performance of SageMaker hosting services. By using a multimodel
endpoint, the company can also save on hosting costs by improving endpoint
utilization and paying only for the models that are loaded in memory and the API calls that
are made. To use a multi-model endpoint, the company needs to prepare a Docker
container based on the open-source multi-model server, which is a framework-agnostic
library that supports loading and serving multiple models from Amazon S3. The company
can then create a multi-model endpoint in SageMaker, pointing to the S3 bucket containing
all the models, and invoke the endpoint from the web client at runtime, specifying the
TargetModel parameter according to the city of each request. This solution also enables
the company to add or remove models from the S3 bucket without redeploying the
endpoint, and to use different versions of the same model for different cities if
needed. References:
Use Docker containers to build models
Host multiple models in one container behind one endpoint
Multi-model endpoints using Scikit Learn
Multi-model endpoints using XGBoost
Question # 26
A company is using Amazon Polly to translate plaintext documents to speech forautomated company announcements However company acronyms are beingmispronounced in the current documents How should a Machine Learning Specialistaddress this issue for future documents?
A. Convert current documents to SSML with pronunciation tags B. Create an appropriate pronunciation lexicon. C. Output speech marks to guide in pronunciation D. Use Amazon Lex to preprocess the text files for pronunciation
Answer: B
Explanation: A pronunciation lexicon is a file that defines how words or phrases should be
pronounced by Amazon Polly. A lexicon can help customize the speech output for words
that are uncommon, foreign, or have multiple pronunciations. A lexicon must conform to the
Pronunciation Lexicon Specification (PLS) standard and can be stored in an AWS region
using the Amazon Polly API. To use a lexicon for synthesizing speech, the lexicon name
must be specified in the <speak> SSML tag. For example, the following lexicon defines
<voice name=“Joanna”> <lexicon name=“w3c_lexicon”/> The <say-as interpretas=“
characters”>W3C</say-as> is an international community that develops open
standards to ensure the long-term growth of the Web. </voice> </speak>
References:
Customize pronunciation using lexicons in Amazon Polly: A blog post that explains
how to use lexicons for creating custom pronunciations.
Managing Lexicons: A documentation page that describes how to store and
retrieve lexicons using the Amazon Polly API.
Question # 27
A company wants to predict the classification of documents that are created from anapplication. New documents are saved to an Amazon S3 bucket every 3 seconds. Thecompany has developed three versions of a machine learning (ML) model within AmazonSageMaker to classify document text. The company wants to deploy these three versions to predict the classification of each document.Which approach will meet these requirements with the LEAST operational overhead?
A. Configure an S3 event notification that invokes an AWS Lambda function when newdocuments are created. Configure the Lambda function to create three SageMaker batchtransform jobs, one batch transform job for each model for each document. B. Deploy all the models to a single SageMaker endpoint. Treat each model as aproduction variant. Configure an S3 event notification that invokes an AWS Lambdafunction when new documents are created. Configure the Lambda function to call eachproduction variant and return the results of each model. C. Deploy each model to its own SageMaker endpoint Configure an S3 event notificationthat invokes an AWS Lambda function when new documents are created. Configure theLambda function to call each endpoint and return the results of each model. D. Deploy each model to its own SageMaker endpoint. Create three AWS Lambdafunctions. Configure each Lambda function to call a different endpoint and return theresults. Configure three S3 event notifications to invoke the Lambda functions when newdocuments are created.
Answer: B
Explanation: The approach that will meet the requirements with the least operational
overhead is to deploy all the models to a single SageMaker endpoint, treat each model as
a production variant, configure an S3 event notification that invokes an AWS Lambda
function when new documents are created, and configure the Lambda function to call each
production variant and return the results of each model. This approach involves the
following steps:
Deploy all the models to a single SageMaker endpoint. Amazon SageMaker is a
service that can build, train, and deploy machine learning models. Amazon
SageMaker can deploy multiple models to a single endpoint, which is a web
service that can serve predictions from the models. Each model can be treated as
a production variant, which is a version of the model that runs on one or more
instances. Amazon SageMaker can distribute the traffic among the production
variants according to the specified weights1.
Treat each model as a production variant. Amazon SageMaker can deploy multiple
models to a single endpoint, which is a web service that can serve predictions from
the models. Each model can be treated as a production variant, which is a version
of the model that runs on one or more instances. Amazon SageMaker can
distribute the traffic among the production variants according to the specified
weights1.
Configure an S3 event notification that invokes an AWS Lambda function when
new documents are created. Amazon S3 is a service that can store and retrieve
any amount of data. Amazon S3 can send event notifications when certain actions
occur on the objects in a bucket, such as object creation, deletion, or modification.
Amazon S3 can invoke an AWS Lambda function as a destination for the event
notifications. AWS Lambda is a service that can run code without provisioning or managing servers2.
Configure the Lambda function to call each production variant and return the
results of each model. AWS Lambda can execute the code that can call the
SageMaker endpoint and specify the production variant to invoke. AWS Lambda
can use the AWS SDK or the SageMaker Runtime API to send requests to the
endpoint and receive the predictions from the models. AWS Lambda can return
the results of each model as a response to the event notification3.
The other options are not suitable because:
Option A: Configuring an S3 event notification that invokes an AWS Lambda
function when new documents are created, configuring the Lambda function to
create three SageMaker batch transform jobs, one batch transform job for each
model for each document, will incur more operational overhead than using a single
SageMaker endpoint. Amazon SageMaker batch transform is a service that can
process large datasets in batches and store the predictions in Amazon S3.
Amazon SageMaker batch transform is not suitable for real-time inference, as it
introduces a delay between the request and the response. Moreover, creating
three batch transform jobs for each document will increase the complexity and cost
of the solution4.
Option C: Deploying each model to its own SageMaker endpoint, configuring an
S3 event notification that invokes an AWS Lambda function when new documents
are created, configuring the Lambda function to call each endpoint and return the
results of each model, will incur more operational overhead than using a single
SageMaker endpoint. Deploying each model to its own endpoint will increase the
number of resources and endpoints to manage and monitor. Moreover, calling
each endpoint separately will increase the latency and network traffic of the
solution5.
Option D: Deploying each model to its own SageMaker endpoint, creating three
AWS Lambda functions, configuring each Lambda function to call a different
endpoint and return the results, configuring three S3 event notifications to invoke
the Lambda functions when new documents are created, will incur more
operational overhead than using a single SageMaker endpoint and a single
Lambda function. Deploying each model to its own endpoint will increase the
number of resources and endpoints to manage and monitor. Creating three
Lambda functions will increase the complexity and cost of the solution. Configuring
three S3 event notifications will increase the number of triggers and destinations to
manage and monitor6.
References:
1: Deploying Multiple Models to a Single Endpoint - Amazon SageMaker
4: Get Inferences for an Entire Dataset with Batch Transform - Amazon
SageMaker
5: Deploy a Model - Amazon SageMaker
6: AWS Lambda
Question # 28
A company wants to create an artificial intelligence (Al) yoga instructor that can lead largeclasses of students. The company needs to create a feature that can accurately count thenumber of students who are in a class. The company also needs a feature that candifferentiate students who are performing a yoga stretch correctly from students who areperforming a stretch incorrectly....etermine whether students are performing a stretch correctly, the solution needs tomeasure the location and angle of each student's arms and legs A data scientist must useAmazon SageMaker to ...ss video footage of a yoga class by extracting image frames andapplying computer vision models.Which combination of models will meet these requirements with the LEAST effort? (SelectTWO.)
A. Image Classification B. Optical Character Recognition (OCR) C. Object Detection D. Pose estimation E. Image Generative Adversarial Networks (GANs)
Answer: C,D
Explanation: To count the number of students who are in a class, the solution needs to
detect and locate each student in the video frame. Object detection is a computer vision
model that can identify and locate multiple objects in an image. To differentiate students
who are performing a stretch correctly from students who are performing a stretch
incorrectly, the solution needs to measure the location and angle of each student’s arms
and legs. Pose estimation is a computer vision model that can estimate the pose of a person by detecting the position and orientation of key body parts. Image classification,
OCR, and image GANs are not relevant for this use case. References:
Object Detection: A computer vision technique that identifies and locates objects
within an image or video.
Pose Estimation: A computer vision technique that estimates the pose of a person
by detecting the position and orientation of key body parts.
Amazon SageMaker: A fully managed service that provides every developer and
data scientist with the ability to build, train, and deploy machine learning (ML)
models quickly.
Question # 29
A data scientist is working on a public sector project for an urban traffic system. Whilestudying the traffic patterns, it is clear to the data scientist that the traffic behavior at eachlight is correlated, subject to a small stochastic error term. The data scientist must modelthe traffic behavior to analyze the traffic patterns and reduce congestion.How will the data scientist MOST effectively model the problem?
A. The data scientist should obtain a correlated equilibrium policy by formulating thisproblem as a multi-agent reinforcement learning problem. B. The data scientist should obtain the optimal equilibrium policy by formulating thisproblem as a single-agent reinforcement learning problem. C. Rather than finding an equilibrium policy, the data scientist should obtain accuratepredictors of traffic flow by using historical data through a supervised learning approach. D. Rather than finding an equilibrium policy, the data scientist should obtain accuratepredictors of traffic flow by using unlabeled simulated data representing the new trafficpatterns in the city and applying an unsupervised learning approach.
Answer: A
Explanation: The data scientist should obtain a correlated equilibrium policy by formulating
this problem as a multi-agent reinforcement learning problem. This is because:
Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning
that deals with learning and coordination of multiple agents that interact with each
other and the environment 1. MARL can be applied to problems that involve
distributed decision making, such as traffic signal control, where each traffic light
can be modeled as an agent that observes the traffic state and chooses an action
(e.g., changing the signal phase) to optimize a reward function (e.g., minimizing
the delay or congestion) 2.
A correlated equilibrium is a solution concept in game theory that generalizes the
notion of Nash equilibrium. It is a probability distribution over the joint actions of
the agents that satisfies the following condition: no agent can improve its expected
payoff by deviating from the distribution, given that it knows the distribution and the
actions of the other agents 3. A correlated equilibrium can capture the correlation
among the agents’ actions, which is useful for modeling the traffic behavior at each
light that is subject to a small stochastic error term.
A correlated equilibrium policy is a policy that induces a correlated equilibrium in a
MARL setting. It can be obtained by using various methods, such as policy
gradient, actor-critic, or Q-learning algorithms, that can learn from the feedback of the environment and the communication among the agents 4. A correlated
equilibrium policy can achieve a better performance than a Nash equilibrium
policy, which assumes that the agents act independently and ignore the correlation
among their actions 5.
Therefore, by obtaining a correlated equilibrium policy by formulating this problem as a
MARL problem, the data scientist can most effectively model the traffic behavior and
reduce congestion.
References:
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning for Traffic Signal Control: A Survey
Correlated Equilibrium
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Correlated Q-Learning
Question # 30
An ecommerce company wants to use machine learning (ML) to monitor fraudulenttransactions on its website. The company is using Amazon SageMaker to research, train,deploy, and monitor the ML models.The historical transactions data is in a .csv file that is stored in Amazon S3 The datacontains features such as the user's IP address, navigation time, average time on eachpage, and the number of clicks for ....session. There is no label in the data to indicate if atransaction is anomalous.Which models should the company use in combination to detect anomalous transactions?(Select TWO.)
A. IP Insights B. K-nearest neighbors (k-NN) C. Linear learner with a logistic function D. Random Cut Forest (RCF) E. XGBoost
Answer: D,E
Explanation: To detect anomalous transactions, the company can use a combination of
Random Cut Forest (RCF) and XGBoost models. RCF is an unsupervised algorithm that
can detect outliers in the data by measuring the depth of each data point in a collection of
random decision trees. XGBoost is a supervised algorithm that can learn from the labeled
data points generated by RCF and classify them as normal or anomalous. RCF can also
provide anomaly scores that can be used as features for XGBoost to improve the accuracy
of the classification. References: 1: Amazon SageMaker Random Cut Forest
2: Amazon SageMaker XGBoost Algorithm
3: Anomaly Detection with Amazon SageMaker Random Cut Forest and Amazon
SageMaker XGBoost
Question # 31
A company wants to predict stock market price trends. The company stores stock marketdata each business day in Amazon S3 in Apache Parquet format. The company stores 20GB of data each day for each stock code.A data engineer must use Apache Spark to perform batch preprocessing datatransformations quickly so the company can complete prediction jobs before the stockmarket opens the next day. The company plans to track more stock market codes andneeds a way to scale the preprocessing data transformations.Which AWS service or feature will meet these requirements with the LEAST developmenteffort over time?
A. AWS Glue jobs B. Amazon EMR cluster C. Amazon Athena D. AWS Lambda
Answer: A
Explanation: AWS Glue jobs is the AWS service or feature that will meet the requirements
with the least development effort over time. AWS Glue jobs is a fully managed service that
enables data engineers to run Apache Spark applications on a serverless Spark
environment. AWS Glue jobs can perform batch preprocessing data transformations on
large datasets stored in Amazon S3, such as converting data formats, filtering data, joining
data, and aggregating data. AWS Glue jobs can also scale the Spark environment
automatically based on the data volume and processing needs, without requiring any
infrastructure provisioning or management. AWS Glue jobs can reduce the development
effort and time by providing a graphical interface to create and monitor Spark applications,
as well as a code generation feature that can generate Scala or Python code based on the
data sources and targets. AWS Glue jobs can also integrate with other AWS services, such
as Amazon Athena, Amazon EMR, and Amazon SageMaker, to enable further data
analysis and machine learning tasks1.
The other options are either more complex or less scalable than AWS Glue jobs. Amazon
EMR cluster is a managed service that enables data engineers to run Apache Spark
applications on a cluster of Amazon EC2 instances. However, Amazon EMR cluster
requires more development effort and time than AWS Glue jobs, as it involves setting up,
configuring, and managing the cluster, as well as writing and deploying the Spark
code. Amazon EMR cluster also does not scale automatically, but requires manual or
scheduled resizing of the cluster based on the data volume and processing needs2.
Amazon Athena is a serverless interactive query service that enables data engineers to
analyze data stored in Amazon S3 using standard SQL. However, Amazon Athena is not
suitable for performing complex data transformations, such as joining data from multiple
sources, aggregating data, or applying custom logic. Amazon Athena is also not designed
for running Spark applications, but only supports SQL queries3. AWS Lambda is a
serverless compute service that enables data engineers to run code without provisioning or
managing servers. However, AWS Lambda is not optimized for running Spark applications,
as it has limitations on the execution time, memory size, and concurrency of the functions.
AWS Lambda is also not integrated with Amazon S3, and requires additional steps to read
and write data from S3 buckets.
References:
1: AWS Glue - Fully Managed ETL Service - Amazon Web Services
2: Amazon EMR - Amazon Web Services
3: Amazon Athena – Interactive SQL Queries for Data in Amazon S3
[4]: AWS Lambda – Serverless Compute - Amazon Web Services
Question # 32
A company wants to forecast the daily price of newly launched products based on 3 yearsof data for older product prices, sales, and rebates. The time-series data has irregulartimestamps and is missing some values.Data scientist must build a dataset to replace the missing values. The data scientist needsa solution that resamptes the data daily and exports the data for further modeling.Which solution will meet these requirements with the LEAST implementation effort?
A. Use Amazon EMR Serveriess with PySpark. B. Use AWS Glue DataBrew. C. Use Amazon SageMaker Studio Data Wrangler. D. Use Amazon SageMaker Studio Notebook with Pandas.
Answer: C
Explanation: Amazon SageMaker Studio Data Wrangler is a visual data preparation tool
that enables users to clean and normalize data without writing any code. Using Data
Wrangler, the data scientist can easily import the time-series data from various sources,
such as Amazon S3, Amazon Athena, or Amazon Redshift. Data Wrangler can
automatically generate data insights and quality reports, which can help identify and fix
missing values, outliers, and anomalies in the data. Data Wrangler also provides over 250
built-in transformations, such as resampling, interpolation, aggregation, and filtering, which
can be applied to the data with a point-and-click interface. Data Wrangler can also export
the prepared data to different destinations, such as Amazon S3, Amazon SageMaker
Feature Store, or Amazon SageMaker Pipelines, for further modeling and analysis. Data
Wrangler is integrated with Amazon SageMaker Studio, a web-based IDE for machine
learning, which makes it easy to access and use the tool. Data Wrangler is a serverless
and fully managed service, which means the data scientist does not need to provision,
configure, or manage any infrastructure or clusters.
Option A is incorrect because Amazon EMR Serverless is a serverless option for running big data analytics applications using open-source frameworks, such as Apache Spark.
However, using Amazon EMR Serverless would require the data scientist to write PySpark
code to perform the data preparation tasks, such as resampling, imputation, and
aggregation. This would require more implementation effort than using Data Wrangler,
which provides a visual and code-free interface for data preparation.
Option B is incorrect because AWS Glue DataBrew is another visual data preparation tool
that can be used to clean and normalize data without writing code. However, DataBrew
does not support time-series data as a data type, and does not provide built-in
transformations for resampling, interpolation, or aggregation of time-series data. Therefore,
using DataBrew would not meet the requirements of the use case.
Option D is incorrect because using Amazon SageMaker Studio Notebook with Pandas
would also require the data scientist to write Python code to perform the data preparation
tasks. Pandas is a popular Python library for data analysis and manipulation, which
supports time-series data and provides various methods for resampling, interpolation, and
aggregation. However, using Pandas would require more implementation effort than using
Data Wrangler, which provides a visual and code-free interface for data preparation.
References:
1: Amazon SageMaker Data Wrangler documentation
2: Amazon EMR Serverless documentation
3: AWS Glue DataBrew documentation
4: Pandas documentation
Question # 33
A company operates large cranes at a busy port. The company plans to use machinelearning (ML) for predictive maintenance of the cranes to avoid unexpected breakdownsand to improve productivity.The company already uses sensor data from each crane to monitor the health of thecranes in real time. The sensor data includes rotation speed, tension, energy consumption,vibration, pressure, and …perature for each crane. The company contracts AWS MLexperts to implement an ML solution.Which potential findings would indicate that an ML-based solution is suitable for thisscenario? (Select TWO.)
A. The historical sensor data does not include a significant number of data points andattributes for certain time periods. B. The historical sensor data shows that simple rule-based thresholds can predict cranefailures. C. The historical sensor data contains failure data for only one type of crane model that isin operation and lacks failure data of most other types of crane that are in operation. D. The historical sensor data from the cranes are available with high granularity for the last3 years. E. The historical sensor data contains most common types of crane failures that thecompany wants to predict.
Answer: D,E
Explanation: The best indicators that an ML-based solution is suitable for this scenario are
D and E, because they imply that the historical sensor data is sufficient and relevant for building a predictive maintenance model. This model can use machine learning techniques
such as regression, classification, or anomaly detection to learn from the past data and
forecast future failures or issues12. Having high granularity and diversity of data can
improve the accuracy and generalization of the model, as well as enable the detection of
complex patterns and relationships that are not captured by simple rule-based thresholds3.
The other options are not good indicators that an ML-based solution is suitable, because
they suggest that the historical sensor data is incomplete, inconsistent, or inadequate for
building a predictive maintenance model. These options would require additional data
collection, preprocessing, or augmentation to overcome the data quality issues and ensure
that the model can handle different scenarios and types of cranes4 .
References:
1: Machine Learning Techniques for Predictive Maintenance
2: A Guide to Predictive Maintenance & Machine Learning
3: Machine Learning for Predictive Maintenance: Reinventing Asset Upkeep
4: Predictive Maintenance with Machine Learning: A Complete Guide
A company is creating an application to identify, count, and classify animal images that areuploaded to the company’s website. The company is using the Amazon SageMaker imageclassification algorithm with an ImageNetV2 convolutional neural network (CNN). Thesolution works well for most animal images but does not recognize many animal speciesthat are less common.The company obtains 10,000 labeled images of less common animal species and storesthe images in Amazon S3. A machine learning (ML) engineer needs to incorporate theimages into the model by using Pipe mode in SageMaker.Which combination of steps should the ML engineer take to train the model? (Choose two.)
A. Use a ResNet model. Initiate full training mode by initializing the network with randomweights. B. Use an Inception model that is available with the SageMaker image classificationalgorithm. C. Create a .lst file that contains a list of image files and corresponding class labels. Uploadthe .lst file to Amazon S3. D. Initiate transfer learning. Train the model by using the images of less common species. E. Use an augmented manifest file in JSON Lines format.
Answer: C,D
Explanation: The combination of steps that the ML engineer should take to train the model
are to create a .lst file that contains a list of image files and corresponding class labels,
upload the .lst file to Amazon S3, and initiate transfer learning by training the model using
the images of less common species. This approach will allow the ML engineer to leverage
the existing ImageNetV2 CNN model and fine-tune it with the new data using Pipe mode in
SageMaker.
A .lst file is a text file that contains a list of image files and corresponding class labels,
separated by tabs. The .lst file format is required for using the SageMaker image
classification algorithm with Pipe mode. Pipe mode is a feature of SageMaker that enables
streaming data directly from Amazon S3 to the training instances, without downloading the
data first. Pipe mode can reduce the startup time, improve the I/O throughput, and enable
training on large datasets that exceed the disk size limit. To use Pipe mode, the ML
engineer needs to upload the .lst file to Amazon S3 and specify the S3 path as the input
data channel for the training job1.
Transfer learning is a technique that enables reusing a pre-trained model for a new task by
fine-tuning the model parameters with new data. Transfer learning can save time and
computational resources, as well as improve the performance of the model, especially
when the new task is similar to the original task. The SageMaker image classification
algorithm supports transfer learning by allowing the ML engineer to specify the number of
output classes and the number of layers to be retrained. The ML engineer can use the
existing ImageNetV2 CNN model, which is trained on 1,000 classes of common objects,
and fine-tune it with the new data of less common animal species, which is a similar task2.
The other options are either less effective or not supported by the SageMaker image
classification algorithm. Using a ResNet model and initiating full training mode would
require training the model from scratch, which would take more time and resources than
transfer learning. Using an Inception model is not possible, as the SageMaker image
classification algorithm only supports ResNet and ImageNetV2 models. Using an
augmented manifest file in JSON Lines format is not compatible with Pipe mode, as Pipe
mode only supports .lst files for image classification1.
References:
1: Using Pipe input mode for Amazon SageMaker algorithms | AWS Machine
A machine learning (ML) specialist is using the Amazon SageMaker DeepAR forecastingalgorithm to train a model on CPU-based Amazon EC2 On-Demand instances. The modelcurrently takes multiple hours to train. The ML specialist wants to decrease the trainingtime of the model.Which approaches will meet this requirement7 (SELECT TWO )
A. Replace On-Demand Instances with Spot Instances B. Configure model auto scaling dynamically to adjust the number of instancesautomatically. C. Replace CPU-based EC2 instances with GPU-based EC2 instances. D. Use multiple training instances. E. Use a pre-trained version of the model. Run incremental training.
Answer: C,D
Explanation: The best approaches to decrease the training time of the model are C and D,
because they can improve the computational efficiency and parallelization of the training
process. These approaches have the following benefits:
C: Replacing CPU-based EC2 instances with GPU-based EC2 instances can
speed up the training of the DeepAR algorithm, as it can leverage the parallel
processing power of GPUs to perform matrix operations and gradient
computations faster than CPUs12. The DeepAR algorithm supports GPU-based
EC2 instances such as ml.p2 and ml.p33.
D: Using multiple training instances can also reduce the training time of the
DeepAR algorithm, as it can distribute the workload across multiple nodes and
perform data parallelism4. The DeepAR algorithm supports distributed training with
multiple CPU-based or GPU-based EC2 instances3.
The other options are not effective or relevant, because they have the following drawbacks:
A: Replacing On-Demand Instances with Spot Instances can reduce the cost of
the training, but not necessarily the time, as Spot Instances are subject to
interruption and availability5. Moreover, the DeepAR algorithm does not support
checkpointing, which means that the training cannot resume from the last saved
state if the Spot Instance is terminated3.
B: Configuring model auto scaling dynamically to adjust the number of instances
automatically is not applicable, as this feature is only available for inference
endpoints, not for training jobs6.
E: Using a pre-trained version of the model and running incremental training is not
possible, as the DeepAR algorithm does not support incremental training or
transfer learning3. The DeepAR algorithm requires a full retraining of the model
whenever new data is added or the hyperparameters are changed7.
References: 1: GPU vs CPU: What Matters Most for Machine Learning? | by Louis (What’s AI)
Bouchard | Towards Data Science
2: How GPUs Accelerate Machine Learning Training | NVIDIA Developer Blog
7: How the DeepAR Algorithm Works - Amazon SageMaker
Question # 36
A manufacturing company has a production line with sensors that collect hundreds ofquality metrics. The company has stored sensor data and manual inspection results in adata lake for several months. To automate quality control, the machine learning team mustbuild an automated mechanism that determines whether the produced goods are goodquality, replacement market quality, or scrap quality based on the manual inspectionresults.Which modeling approach will deliver the MOST accurate prediction of product quality?
A. Amazon SageMaker DeepAR forecasting algorithm B. Amazon SageMaker XGBoost algorithm C. Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm D. A convolutional neural network (CNN) and ResNet
Answer: D
Explanation: A convolutional neural network (CNN) is a type of deep learning model that
can learn to extract features from images and perform tasks such as classification,
segmentation, and detection1. ResNet is a popular CNN architecture that uses residual
connections to overcome the problem of vanishing gradients and enable very deep
networks2. For the task of predicting product quality based on sensor data, a CNN and
ResNet approach can leverage the spatial structure of the data and learn complex patterns
that distinguish different quality levels.
References:
Convolutional Neural Networks (CNNs / ConvNets)
PyTorch ResNet: The Basics and a Quick Tutorial
Question # 37
A data scientist at a financial services company used Amazon SageMaker to train anddeploy a model that predicts loan defaults. The model analyzes new loan applications andpredicts the risk of loan default. To train the model, the data scientist manually extractedloan data from a database. The data scientist performed the model training anddeployment steps in a Jupyter notebook that is hosted on SageMaker Studio notebooks.The model's prediction accuracy is decreasing over time. Which combination of slept in theMOST operationally efficient way for the data scientist to maintain the model's accuracy?(Select TWO.)
A. Use SageMaker Pipelines to create an automated workflow that extracts fresh data,trains the model, and deploys a new version of the model. B. Configure SageMaker Model Monitor with an accuracy threshold to check for model drift.Initiate an Amazon CloudWatch alarm when the threshold is exceeded. Connect theworkflow in SageMaker Pipelines with the CloudWatch alarm to automatically initiateretraining. C. Store the model predictions in Amazon S3 Create a daily SageMaker Processing jobthat reads the predictions from Amazon S3, checks for changes in model predictionaccuracy, and sends an email notification if a significant change is detected. D. Rerun the steps in the Jupyter notebook that is hosted on SageMaker Studio notebooksto retrain the model and redeploy a new version of the model. E. Export the training and deployment code from the SageMaker Studio notebooks into aPython script. Package the script into an Amazon Elastic Container Service (Amazon ECS)task that an AWS Lambda function can initiate.
Answer: A,B
Explanation:
Option A is correct because SageMaker Pipelines is a service that enables you to
create and manage automated workflows for your machine learning projects. You
can use SageMaker Pipelines to orchestrate the steps of data extraction, model
training, and model deployment in a repeatable and scalable way1.
Option B is correct because SageMaker Model Monitor is a service that monitors
the quality of your models in production and alerts you when there are deviations
in the model quality. You can use SageMaker Model Monitor to set an accuracy
threshold for your model and configure a CloudWatch alarm that triggers when the
threshold is exceeded. You can then connect the alarm to the workflow in
SageMaker Pipelines to automatically initiate retraining and deployment of a new
version of the model2.
Option C is incorrect because it is not the most operationally efficient way to
maintain the model’s accuracy. Creating a daily SageMaker Processing job that
reads the predictions from Amazon S3 and checks for changes in model prediction
accuracy is a manual and time-consuming process. It also requires you to write
custom code to perform the data analysis and send the email notification.
Moreover, it does not automatically retrain and deploy the model when the
accuracy drops.
Option D is incorrect because it is not the most operationally efficient way to
maintain the model’s accuracy. Rerunning the steps in the Jupyter notebook that is
hosted on SageMaker Studio notebooks to retrain the model and redeploy a new
version of the model is a manual and error-prone process. It also requires you to
monitor the model’s performance and initiate the retraining and deployment steps
yourself. Moreover, it does not leverage the benefits of SageMaker Pipelines and
SageMaker Model Monitor to automate and streamline the workflow.
Option E is incorrect because it is not the most operationally efficient way to
maintain the model’s accuracy. Exporting the training and deployment code from
the SageMaker Studio notebooks into a Python script and packaging the script into
an Amazon ECS task that an AWS Lambda function can initiate is a complex and
cumbersome process. It also requires you to manage the infrastructure and
resources for the Amazon ECS task and the AWS Lambda function. Moreover, it
does not leverage the benefits of SageMaker Pipelines and SageMaker Model
Monitor to automate and streamline the workflow. References:
1: SageMaker Pipelines - Amazon SageMaker
2: Monitor data and model quality - Amazon SageMaker
Question # 38
A data scientist uses Amazon SageMaker Data Wrangler to define and performtransformations and feature engineering on historical data. The data scientist saves thetransformations to SageMaker Feature Store.The historical data is periodically uploaded to an Amazon S3 bucket. The data scientistneeds to transform the new historic data and add it to the online feature store The datascientist needs to prepare the .....historic data for training and inference by using nativeintegrations.Which solution will meet these requirements with the LEAST development effort?
A. Use AWS Lambda to run a predefined SageMaker pipeline to perform thetransformations on each new dataset that arrives in the S3 bucket. B. Run an AWS Step Functions step and a predefined SageMaker pipeline to perform thetransformations on each new dalaset that arrives in the S3 bucket C. Use Apache Airflow to orchestrate a set of predefined transformations on each newdataset that arrives in the S3 bucket. D. Configure Amazon EventBridge to run a predefined SageMaker pipeline to perform thetransformations when a new data is detected in the S3 bucket.
Answer: D
Explanation: The best solution is to configure Amazon EventBridge to run a predefined
SageMaker pipeline to perform the transformations when a new data is detected in the S3
bucket. This solution requires the least development effort because it leverages the native
integration between EventBridge and SageMaker Pipelines, which allows you to trigger a
pipeline execution based on an event rule. EventBridge can monitor the S3 bucket for new
data uploads and invoke the pipeline that contains the same transformations and feature
engineering steps that were defined in SageMaker Data Wrangler. The pipeline can then
ingest the transformed data into the online feature store for training and inference.
The other solutions are less optimal because they require more development effort and
additional services. Using AWS Lambda or AWS Step Functions would require writing
custom code to invoke the SageMaker pipeline and handle any errors or retries. Using
Apache Airflow would require setting up and maintaining an Airflow server and DAGs, as
well as integrating with the SageMaker API.
References:
Amazon EventBridge and Amazon SageMaker Pipelines integration
Create a pipeline using a JSON specification
Ingest data into a feature group
Question # 39
A financial services company wants to automate its loan approval process by building amachine learning (ML) model. Each loan data point contains credit history from a thirdpartydata source and demographic information about the customer. Each loan approvalprediction must come with a report that contains an explanation for why the customer wasapproved for a loan or was denied for a loan. The company will use Amazon SageMaker tobuild the model.Which solution will meet these requirements with the LEAST development effort?
A. Use SageMaker Model Debugger to automatically debug the predictions, generate theexplanation, and attach the explanation report. B. Use AWS Lambda to provide feature importance and partial dependence plots. Use theplots to generate and attach the explanation report. C. Use SageMaker Clarify to generate the explanation report. Attach the report to thepredicted results. D. Use custom Amazon Cloud Watch metrics to generate the explanation report. Attach thereport to the predicted results.
Answer: C
Explanation:
The best solution for this scenario is to use SageMaker Clarify to generate the explanation
report and attach it to the predicted results. SageMaker Clarify provides tools to help
explain how machine learning (ML) models make predictions using a model-agnostic
feature attribution approach based on SHAP values. It can also detect and measure
potential bias in the data and the model. SageMaker Clarify can generate explanation
reports during data preparation, model training, and model deployment. The reports include
metrics, graphs, and examples that help understand the model behavior and predictions.
The reports can be attached to the predicted results using the SageMaker SDK or the SageMaker API.
The other solutions are less optimal because they require more development effort and
additional services. Using SageMaker Model Debugger would require modifying the
training script to save the model output tensors and writing custom rules to debug and
explain the predictions. Using AWS Lambda would require writing code to invoke the ML
model, compute the feature importance and partial dependence plots, and generate and
attach the explanation report. Using custom Amazon CloudWatch metrics would require
writing code to publish the metrics, create dashboards, and generate and attach the
explanation report.
References:
Bias Detection and Model Explainability - Amazon SageMaker Clarify - AWS
Amazon SageMaker Clarify Model Explainability
Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability
A manufacturing company has structured and unstructured data stored in an Amazon S3bucket. A Machine Learning Specialist wants to use SQL to run queries on this data.Which solution requires the LEAST effort to be able to query this data?
A. Use AWS Data Pipeline to transform the data and Amazon RDS to run queries. B. Use AWS Glue to catalogue the data and Amazon Athena to run queries. C. Use AWS Batch to run ETL on the data and Amazon Aurora to run the queries. D. Use AWS Lambda to transform the data and Amazon Kinesis Data Analytics to runqueries.
Answer: B
Explanation: Using AWS Glue to catalogue the data and Amazon Athena to run queries is
the solution that requires the least effort to be able to query the data stored in an Amazon
S3 bucket using SQL. AWS Glue is a service that provides a serverless data integration
platform for data preparation and transformation. AWS Glue can automatically discover,
crawl, and catalogue the data stored in various sources, such as Amazon S3, Amazon
RDS, Amazon Redshift, etc. AWS Glue can also use AWS KMS to encrypt the data at rest
on the Glue Data Catalog and Glue ETL jobs. AWS Glue can handle both structured and
unstructured data, and support various data formats, such as CSV, JSON, Parquet,
etc. AWS Glue can also use built-in or custom classifiers to identify and parse the data
schema and format1 Amazon Athena is a service that provides an interactive query engine
that can run SQL queries directly on data stored in Amazon S3. Amazon Athena can
integrate with AWS Glue to use the Glue Data Catalog as a central metadata repository for
the data sources and tables. Amazon Athena can also use AWS KMS to encrypt the data
at rest on Amazon S3 and the query results. Amazon Athena can query both structured
and unstructured data, and support various data formats, such as CSV, JSON, Parquet,
etc. Amazon Athena can also use partitions and compression to optimize the query
performance and reduce the query cost23
The other options are not valid or require more effort to query the data stored in an Amazon
S3 bucket using SQL. Using AWS Data Pipeline to transform the data and Amazon RDS to
run queries is not a good option, as it involves moving the data from Amazon S3 to
Amazon RDS, which can incur additional time and cost. AWS Data Pipeline is a service that can orchestrate and automate data movement and transformation across various AWS
services and on-premises data sources. AWS Data Pipeline can be integrated with
Amazon EMR to run ETL jobs on the data stored in Amazon S3. Amazon RDS is a service
that provides a managed relational database service that can run various database
engines, such as MySQL, PostgreSQL, Oracle, etc. Amazon RDS can use AWS KMS to
encrypt the data at rest and in transit. Amazon RDS can run SQL queries on the data
stored in the database tables45 Using AWS Batch to run ETL on the data and Amazon
Aurora to run the queries is not a good option, as it also involves moving the data from
Amazon S3 to Amazon Aurora, which can incur additional time and cost. AWS Batch is a
service that can run batch computing workloads on AWS. AWS Batch can be integrated
with AWS Lambda to trigger ETL jobs on the data stored in Amazon S3. Amazon Aurora is
a service that provides a compatible and scalable relational database engine that can run
MySQL or PostgreSQL. Amazon Aurora can use AWS KMS to encrypt the data at rest and
in transit. Amazon Aurora can run SQL queries on the data stored in the database tables.
Using AWS Lambda to transform the data and Amazon Kinesis Data Analytics to run
queries is not a good option, as it is not suitable for querying data stored in Amazon S3
using SQL. AWS Lambda is a service that can run serverless functions on AWS. AWS
Lambda can be integrated with Amazon S3 to trigger data transformation functions on the
data stored in Amazon S3. Amazon Kinesis Data Analytics is a service that can analyze
streaming data using SQL or Apache Flink. Amazon Kinesis Data Analytics can be
integrated with Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose to ingest
streaming data sources, such as web logs, social media, IoT devices, etc. Amazon Kinesis
Data Analytics is not designed for querying data stored in Amazon S3 using SQL.
Question # 41
A data scientist has been running an Amazon SageMaker notebook instance for a fewweeks. During this time, a new version of Jupyter Notebook was released along withadditional software updates. The security team mandates that all running SageMakernotebook instances use the latest security and software updates provided by SageMaker.How can the data scientist meet these requirements?
A. Call the CreateNotebookInstanceLifecycleConfig API operation B. Create a new SageMaker notebook instance and mount the Amazon Elastic Block Store(Amazon EBS) volume from the original instance C. Stop and then restart the SageMaker notebook instance D. Call the UpdateNotebookInstanceLifecycleConfig API operation
Answer: C
Explanation: The correct solution for updating the software on a SageMaker notebook
instance is to stop and then restart the notebook instance. This will automatically apply the
latest security and software updates provided by SageMaker1
The other options are incorrect because they either do not update the software or require
unnecessary steps. For example:
Option A calls the CreateNotebookInstanceLifecycleConfig API operation. This
operation creates a lifecycle configuration, which is a set of shell scripts that run
when a notebook instance is created or started. A lifecycle configuration can be
used to customize the notebook instance, such as installing additional libraries or
packages. However, it does not update the software on the notebook instance2
Option B creates a new SageMaker notebook instance and mounts the Amazon
Elastic Block Store (Amazon EBS) volume from the original instance. This option
will create a new notebook instance with the latest software, but it will also incur
additional costs and require manual steps to transfer the data and settings from
the original instance3
Option D calls the UpdateNotebookInstanceLifecycleConfig API operation. This
operation updates an existing lifecycle configuration. As explained in option A, a
lifecycle configuration does not update the software on the notebook instance4
A large company has developed a B1 application that generates reports and dashboardsusing data collected from various operational metrics The company wants to provideexecutives with an enhanced experience so they can use natural language to get data fromthe reports The company wants the executives to be able ask questions using written andspoken interlacesWhich combination of services can be used to build this conversational interface? (SelectTHREE)
A. Alexa for Business B. Amazon Connect C. Amazon Lex D. Amazon Poly E. Amazon Comprehend F. Amazon Transcribe
Answer: C,E,F
Explanation:
To build a conversational interface that can use natural language to get data from
the reports, the company can use a combination of services that can handle both
written and spoken inputs, understand the user’s intent and query, and extract the
relevant information from the reports. The services that can be used for this
purpose are:
Therefore, the company can use the following architecture to build the
conversational interface:
References:
What Is Amazon Lex?
What Is Amazon Comprehend?
What Is Amazon Transcribe?
Question # 43
A manufacturing company needs to identify returned smartphones that have beendamaged by moisture. The company has an automated process that produces 2.000diagnostic values for each phone. The database contains more than five million phoneevaluations. The evaluation process is consistent, and there are no missing values in thedata. A machine learning (ML) specialist has trained an Amazon SageMaker linear learnerML model to classify phones as moisture damaged or not moisture damaged by using allavailable features. The model's F1 score is 0.6.What changes in model training would MOST likely improve the model's F1 score? (SelectTWO.)
A. Continue to use the SageMaker linear learner algorithm. Reduce the number of featureswith the SageMaker principal component analysis (PCA) algorithm. B. Continue to use the SageMaker linear learner algorithm. Reduce the number of featureswith the scikit-iearn multi-dimensional scaling (MDS) algorithm. C. Continue to use the SageMaker linear learner algorithm. Set the predictor type toregressor. D. Use the SageMaker k-means algorithm with k of less than 1.000 to train the model E. Use the SageMaker k-nearest neighbors (k-NN) algorithm. Set a dimension reductiontarget of less than 1,000 to train the model.
Answer: A,E
Explanation:
Option A is correct because reducing the number of features with the SageMaker
PCA algorithm can help remove noise and redundancy from the data, and improve
the model’s performance. PCA is a dimensionality reduction technique that
transforms the original features into a smaller set of linearly uncorrelated features
called principal components. The SageMaker linear learner algorithm supports
PCA as a built-in feature transformation option.
Option E is correct because using the SageMaker k-NN algorithm with a
dimension reduction target of less than 1,000 can help the model learn from the
similarity of the data points, and improve the model’s performance. k-NN is a nonparametric
algorithm that classifies an input based on the majority vote of its k
nearest neighbors in the feature space. The SageMaker k-NN algorithm supports
dimension reduction as a built-in feature transformation option.
Option B is incorrect because using the scikit-learn MDS algorithm to reduce the
number of features is not a feasible option, as MDS is a computationally expensive
technique that does not scale well to large datasets. MDS is a dimensionality
reduction technique that tries to preserve the pairwise distances between the
original data points in a lower-dimensional space.
Option C is incorrect because setting the predictor type to regressor would change
the model’s objective from classification to regression, which is not suitable for the given problem. A regressor model would output a continuous value instead of a
binary label for each phone.
Option D is incorrect because using the SageMaker k-means algorithm with k of
less than 1,000 would not help the model classify the phones, as k-means is a
clustering algorithm that groups the data points into k clusters based on their
similarity, without using any labels. A clustering model would not output a binary
A beauty supply store wants to understand some characteristics of visitors to the store. Thestore has security video recordings from the past several years. The store wants togenerate a report of hourly visitors from the recordings. The report should group visitors byhair style and hair color.Which solution will meet these requirements with the LEAST amount of effort?
A. Use an object detection algorithm to identify a visitor’s hair in video frames. Pass theidentified hair to an ResNet-50 algorithm to determine hair style and hair color. B. Use an object detection algorithm to identify a visitor’s hair in video frames. Pass theidentified hair to an XGBoost algorithm to determine hair style and hair color. C. Use a semantic segmentation algorithm to identify a visitor’s hair in video frames. Passthe identified hair to an ResNet-50 algorithm to determine hair style and hair color. D. Use a semantic segmentation algorithm to identify a visitor’s hair in video frames. Passthe identified hair to an XGBoost algorithm to determine hair style and hair.
Answer: C
Explanation: The solution that will meet the requirements with the least amount of effort is
to use a semantic segmentation algorithm to identify a visitor’s hair in video frames, and
pass the identified hair to an ResNet-50 algorithm to determine hair style and hair color.
This solution can leverage the existing Amazon SageMaker algorithms and frameworks to
perform the tasks of hair segmentation and classification.
Semantic segmentation is a computer vision technique that assigns a class label to every
pixel in an image, such that pixels with the same label share certain characteristics.
Semantic segmentation can be used to identify and isolate different objects or regions in an
image, such as a visitor’s hair in a video frame. Amazon SageMaker provides a built-in
semantic segmentation algorithm that can train and deploy models for semantic
segmentation tasks. The algorithm supports three state-of-the-art network architectures:
Fully Convolutional Network (FCN), Pyramid Scene Parsing Network (PSP), and DeepLab
v3. The algorithm can also use pre-trained or randomly initialized ResNet-50 or ResNet-
101 as the backbone network. The algorithm can be trained using P2/P3 type Amazon EC2
instances in single machine configurations1.
ResNet-50 is a convolutional neural network that is 50 layers deep and can classify images
into 1000 object categories. ResNet-50 is trained on more than a million images from the
ImageNet database and can achieve high accuracy on various image recognition tasks.
ResNet-50 can be used to determine hair style and hair color from the segmented hair
regions in the video frames. Amazon SageMaker provides a built-in image classification
algorithm that can use ResNet-50 as the network architecture. The algorithm can also
perform transfer learning by fine-tuning the pre-trained ResNet-50 model with new
data. The algorithm can be trained using P2/P3 type Amazon EC2 instances in single or
multiple machine configurations2.
The other options are either less effective or more complex to implement. Using an object
detection algorithm to identify a visitor’s hair in video frames would not segment the hair at
the pixel level, but only draw bounding boxes around the hair regions. This could result in
inaccurate or incomplete hair segmentation, especially if the hair is occluded or has
irregular shapes. Using an XGBoost algorithm to determine hair style and hair color would
require transforming the segmented hair images into numerical features, which could lose
some information or introduce noise. XGBoost is also not designed for image classification
tasks, and may not achieve high accuracy or performance.
Each morning, a data scientist at a rental car company creates insights about the previousday’s rental car reservation demands. The company needs to automate this process bystreaming the data to Amazon S3 in near real time. The solution must detect high-demandrental cars at each of the company’s locations. The solution also must create avisualization dashboard that automatically refreshes with the most recent data.Which solution will meet these requirements with the LEAST development time?
A. Use Amazon Kinesis Data Firehose to stream the reservation data directly to AmazonS3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize the data in QuickSight. B. Use Amazon Kinesis Data Streams to stream the reservation data directly to AmazonS3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model inAmazon SageMaker. Visualize the data in Amazon QuickSight. C. Use Amazon Kinesis Data Firehose to stream the reservation data directly to AmazonS3. Detect high-demand outliers by using the Random Cut Forest (RCF) trained model inAmazon SageMaker. Visualize the data in Amazon QuickSight. D. Use Amazon Kinesis Data Streams to stream the reservation data directly to AmazonS3. Detect high-demand outliers by using Amazon QuickSight ML Insights. Visualize thedata in QuickSight.
Answer: A
Explanation: The solution that will meet the requirements with the least development time
is to use Amazon Kinesis Data Firehose to stream the reservation data directly to Amazon
S3, detect high-demand outliers by using Amazon QuickSight ML Insights, and visualize
the data in QuickSight. This solution does not require any custom development or ML
domain expertise, as it leverages the built-in features of QuickSight ML Insights to
automatically run anomaly detection and generate insights on the streaming data.
QuickSight ML Insights can also create a visualization dashboard that automatically
refreshes with the most recent data, and allows the data scientist to explore the outliers
and their key drivers. References:
1: Simplify and automate anomaly detection in streaming data with Amazon
Lookout for Metrics | AWS Machine Learning Blog
2: Detecting outliers with ML-powered anomaly detection - Amazon QuickSight
3: Real-time Outlier Detection Over Streaming Data - IEEE Xplore
4: Towards a deep learning-based outlier detection … - Journal of Big Data
Question # 46
A company wants to conduct targeted marketing to sell solar panels to homeowners. Thecompany wants to use machine learning (ML) technologies to identify which housesalready have solar panels. The company has collected 8,000 satellite images as training data and will use Amazon SageMaker Ground Truth to label the data.The company has a small internal team that is working on the project. The internal teamhas no ML expertise and no ML experience.Which solution will meet these requirements with the LEAST amount of effort from theinternal team?
A. Set up a private workforce that consists of the internal team. Use the private workforceand the SageMaker Ground Truth active learning feature to label the data. Use AmazonRekognition Custom Labels for model training and hosting. B. Set up a private workforce that consists of the internal team. Use the private workforceto label the data. Use Amazon Rekognition Custom Labels for model training and hosting. C. Set up a private workforce that consists of the internal team. Use the private workforceand the SageMaker Ground Truth active learning feature to label the data. Use theSageMaker Object Detection algorithm to train a model. Use SageMaker batch transformfor inference. D. Set up a public workforce. Use the public workforce to label the data. Use theSageMaker Object Detection algorithm to train a model. Use SageMaker batch transformfor inference.
Answer: A
Explanation: The solution A will meet the requirements with the least amount of effort
from the internal team because it uses Amazon SageMaker Ground Truth and Amazon
Rekognition Custom Labels, which are fully managed services that can provide the desired
functionality. The solution A involves the following steps:
Set up a private workforce that consists of the internal team. Use the private
workforce and the SageMaker Ground Truth active learning feature to label the
data. Amazon SageMaker Ground Truth is a service that can create high-quality
training datasets for machine learning by using human labelers. A private
workforce is a group of labelers that the company can manage and control. The
internal team can use the private workforce to label the satellite images as having
solar panels or not. The SageMaker Ground Truth active learning feature can
reduce the labeling effort by using a machine learning model to automatically label
the easy examples and only send the difficult ones to the human labelers1.
Use Amazon Rekognition Custom Labels for model training and hosting. Amazon
Rekognition Custom Labels is a service that can train and deploy custom machine
learning models for image analysis. Amazon Rekognition Custom Labels can use
the labeled data from SageMaker Ground Truth to train a model that can detect
solar panels in satellite images. Amazon Rekognition Custom Labels can also host
the model and provide an API endpoint for inference2.
The other options are not suitable because:
Option B: Setting up a private workforce that consists of the internal team, using
the private workforce to label the data, and using Amazon Rekognition Custom
Labels for model training and hosting will incur more effort from the internal team than using SageMaker Ground Truth active learning feature. The internal team will
have to label all the images manually, without the assistance of the machine
learning model that can automate some of the labeling tasks1.
Option C: Setting up a private workforce that consists of the internal team, using
the private workforce and the SageMaker Ground Truth active learning feature to
label the data, using the SageMaker Object Detection algorithm to train a model,
and using SageMaker batch transform for inference will incur more operational
overhead than using Amazon Rekognition Custom Labels. The company will have
to manage the SageMaker training job, the model artifact, and the batch transform
job. Moreover, SageMaker batch transform is not suitable for real-time inference,
as it processes the data in batches and stores the results in Amazon S33.
Option D: Setting up a public workforce, using the public workforce to label the
data, using the SageMaker Object Detection algorithm to train a model, and using
SageMaker batch transform for inference will incur more operational overhead and
cost than using a private workforce and Amazon Rekognition Custom Labels. A
public workforce is a group of labelers from Amazon Mechanical Turk, a
crowdsourcing marketplace. The company will have to pay the public workforce for
each labeling task, and it may not have full control over the quality and security of
the labeled data. The company will also have to manage the SageMaker training
job, the model artifact, and the batch transform job, as explained in option C4.
References:
1: Amazon SageMaker Ground Truth
2: Amazon Rekognition Custom Labels
3: Amazon SageMaker Object Detection
4: Amazon Mechanical Turk
Question # 47
A finance company needs to forecast the price of a commodity. The company has compileda dataset of historical daily prices. A data scientist must train various forecasting models on80% of the dataset and must validate the efficacy of those models on the remaining 20% ofthe dataset.What should the data scientist split the dataset into a training dataset and a validationdataset to compare model performance?
A. Pick a date so that 80% to the data points precede the date Assign that group of datapoints as the training dataset. Assign all the remaining data points to the validation dataset. B. Pick a date so that 80% of the data points occur after the date. Assign that group of datapoints as the training dataset. Assign all the remaining data points to the validation dataset. C. Starting from the earliest date in the dataset. pick eight data points for the trainingdataset and two data points for the validation dataset. Repeat this stratified sampling untilno data points remain. D. Sample data points randomly without replacement so that 80% of the data points are inthe training dataset. Assign all the remaining data points to the validation dataset.
Answer: A
Explanation: A Comprehensive Explanation: The best way to split the dataset into a
training dataset and a validation dataset is to pick a date so that 80% of the data points
precede the date and assign that group of data points as the training dataset. This method
preserves the temporal order of the data and ensures that the validation dataset reflects
the most recent trends and patterns in the commodity price. This is important for
forecasting models that rely on time series analysis and sequential data. The other
methods would either introduce bias or lose information by ignoring the temporal structure
of the data.
References:
Time Series Forecasting - Amazon SageMaker
Time Series Splitting - scikit-learn
Time Series Forecasting - Towards Data Science
Question # 48
A chemical company has developed several machine learning (ML) solutions to identifychemical process abnormalities. The time series values of independent variables and thelabels are available for the past 2 years and are sufficient to accurately model the problem.The regular operation label is marked as 0. The abnormal operation label is marked as 1 .Process abnormalities have a significant negative effect on the companys profits. Thecompany must avoid these abnormalities.Which metrics will indicate an ML solution that will provide the GREATEST probability ofdetecting an abnormality?
A. Precision = 0.91Recall = 0.6 B. Precision = 0.61Recall = 0.98 C. Precision = 0.7Recall = 0.9 D. Precision = 0.98Recall = 0.8
Answer: B
Explanation: The metrics that will indicate an ML solution that will provide the greatest
probability of detecting an abnormality are precision and recall. Precision is the ratio of true
positives (TP) to the total number of predicted positives (TP + FP), where FP is false
positives. Recall is the ratio of true positives (TP) to the total number of actual positives (TP
+ FN), where FN is false negatives. A high precision means that the ML solution has a low
rate of false alarms, while a high recall means that the ML solution has a high rate of true
detections. For the chemical company, the goal is to avoid process abnormalities, which
are marked as 1 in the labels. Therefore, the company needs an ML solution that has a
high recall for the positive class, meaning that it can detect most of the abnormalities and
minimize the false negatives. Among the four options, option B has the highest recall for
the positive class, which is 0.98. This means that the ML solution can detect 98% of the
abnormalities and miss only 2%. Option B also has a reasonable precision for the positive
class, which is 0.61. This means that the ML solution has a false alarm rate of 39%, which
may be acceptable for the company, depending on the cost and benefit analysis. The other options have lower recall for the positive class, which means that they have higher false
negative rates, which can be more detrimental for the company than false positive rates.
3: AWS Whitepaper - An Overview of Machine Learning on AWS
4: Precision and recall
Question # 49
A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studioenvironment. The ML specialist performs initial data cleansing. Before the ML specialistbegins to train a model, the ML specialist needs to create and view an analysis report thatdetails potential bias in the uploaded data.Which combination of actions will meet these requirements with the LEAST operationaloverhead? (Choose two.)
A. Use SageMaker Clarify to automatically detect data bias B. Turn on the bias detection option in SageMaker Ground Truth to automatically analyzedata features. C. Use SageMaker Model Monitor to generate a bias drift report. D. Configure SageMaker Data Wrangler to generate a bias report. E. Use SageMaker Experiments to perform a data check
Answer: A,D
Explanation: The combination of actions that will meet the requirements with the least
operational overhead is to use SageMaker Clarify to automatically detect data bias and to
configure SageMaker Data Wrangler to generate a bias report. SageMaker Clarify is a
feature of Amazon SageMaker that provides machine learning (ML) developers with tools
to gain greater insights into their ML training data and models. SageMaker Clarify can
detect potential bias during data preparation, after model training, and in your deployed
model. For instance, you can check for bias related to age in your dataset or in your trained
model and receive a detailed report that quantifies different types of potential bias1.
SageMaker Data Wrangler is another feature of Amazon SageMaker that enables you to
prepare data for machine learning (ML) quickly and easily. You can use SageMaker Data
Wrangler to identify potential bias during data preparation without having to write your own
code. You specify input features, such as gender or age, and SageMaker Data Wrangler
runs an analysis job to detect potential bias in those features. SageMaker Data Wrangler
then provides a visual report with a description of the metrics and measurements of
potential bias so that you can identify steps to remediate the bias2. The other actions either
require more customization (such as using SageMaker Model Monitor or SageMaker
Experiments) or do not meet the requirement of detecting data bias (such as using
SageMaker Ground Truth). References:
1: Bias Detection and Model Explainability – Amazon Web Services
2: Amazon SageMaker Data Wrangler – Amazon Web Services
Question # 50
A company uses sensors on devices such as motor engines and factory machines tomeasure parameters, temperature and pressure. The company wants to use the sensordata to predict equipment malfunctions and reduce services outages.The Machine learning (ML) specialist needs to gather the sensors data to train a model topredict device malfunctions The ML spoctafst must ensure that the data does not containoutliers before training the ..el.What can the ML specialist meet these requirements with the LEAST operationaloverhead?
A. Load the data into an Amazon SagcMaker Studio notebook. Calculate the first and thirdquartile Use a SageMaker Data Wrangler data (low to remove only values that are outside of those quartiles. B. Use an Amazon SageMaker Data Wrangler bias report to find outliers in the dataset Usea Data Wrangler data flow to remove outliers based on the bias report. C. Use an Amazon SageMaker Data Wrangler anomaly detection visualization to findoutliers in the dataset. Add a transformation to a Data Wrangler data flow to removeoutliers. D. Use Amazon Lookout for Equipment to find and remove outliers from the dataset.
Answer: C
Explanation: Amazon SageMaker Data Wrangler is a tool that helps data scientists and
ML developers to prepare data for ML. One of the features of Data Wrangler is the anomaly
detection visualization, which uses an unsupervised ML algorithm to identify outliers in the
dataset based on statistical properties. The ML specialist can use this feature to quickly
explore the sensor data and find any anomalous values that may affect the model
performance. The ML specialist can then add a transformation to a Data Wrangler data
flow to remove the outliers from the dataset. The data flow can be exported as a script or a
pipeline to automate the data preparation process. This option requires the least
operational overhead compared to the other options.
References:
Amazon SageMaker Data Wrangler - Amazon Web Services (AWS)
A data scientist wants to use Amazon Forecast to build a forecasting model for inventorydemand for a retail company. The company has provided a dataset of historic inventorydemand for its products as a .csv file stored in an Amazon S3 bucket. The table belowshows a sample of the dataset.
How should the data scientist transform the data?
A. Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset andan item metadata dataset. Upload both datasets as .csv files to Amazon S3. B. Use a Jupyter notebook in Amazon SageMaker to separate the dataset into a relatedtime series dataset and an item metadata dataset. Upload both datasets as tables inAmazon Aurora. C. Use AWS Batch jobs to separate the dataset into a target time series dataset, a relatedtime series dataset, and an item metadata dataset. Upload them directly to Forecast from alocal machine. D. Use a Jupyter notebook in Amazon SageMaker to transform the data into the optimizedprotobuf recordIO format. Upload the dataset in this format to Amazon S3.
Answer: A
Explanation: Amazon Forecast requires the input data to be in a specific format. The data
scientist should use ETL jobs in AWS Glue to separate the dataset into a target time series
dataset and an item metadata dataset. The target time series dataset should contain the
timestamp, item_id, and demand columns, while the item metadata dataset should contain
the item_id, category, and lead_time columns. Both datasets should be uploaded as .csv
files to Amazon S3 . References:
How Amazon Forecast Works - Amazon Forecast
Choosing Datasets - Amazon Forecast
Question # 52
The chief editor for a product catalog wants the research and development team to build amachine learning system that can be used to detect whether or not individuals in acollection of images are wearing the company's retail brand. The team has a set of trainingdata.Which machine learning algorithm should the researchers use that BEST meets theirrequirements?
A. Latent Dirichlet Allocation (LDA) B. Recurrent neural network (RNN) C. K-means D. Convolutional neural network (CNN)
Answer: D
Explanation: The problem of detecting whether or not individuals in a collection of images
are wearing the company’s retail brand is an example of image recognition, which is a type
of machine learning task that identifies and classifies objects in an image. Convolutional
neural networks (CNNs) are a type of machine learning algorithm that are well-suited for
image recognition, as they can learn to extract features from images and handle variations
in size, shape, color, and orientation of the objects. CNNs consist of multiple layers that
perform convolution, pooling, and activation operations on the input images, resulting in a
high-level representation that can be used for classification or detection. Therefore, option
D is the best choice for the machine learning algorithm that meets the requirements of the
chief editor.
Option A is incorrect because latent Dirichlet allocation (LDA) is a type of machine learning
algorithm that is used for topic modeling, which is a task that discovers the hidden themes
or topics in a collection of text documents. LDA is not suitable for image recognition, as it
does not preserve the spatial information of the pixels. Option B is incorrect because
recurrent neural networks (RNNs) are a type of machine learning algorithm that are used
for sequential data, such as text, speech, or time series. RNNs can learn from the temporal
dependencies and patterns in the input data, and generate outputs that depend on the
previous states. RNNs are not suitable for image recognition, as they do not capture the
spatial dependencies and patterns in the input images. Option C is incorrect because kmeans
is a type of machine learning algorithm that is used for clustering, which is a task
that groups similar data points together based on their features. K-means is not suitable for
image recognition, as it does not perform classification or detection of the objects in the
images.
References:
Image Recognition Software - ML Image & Video Analysis - Amazon …
Image classification and object detection using Amazon Rekognition … AWS Amazon Rekognition - Deep Learning Face and Image Recognition …
GitHub - awslabs/aws-ai-solution-kit: Machine Learning APIs for common …
Meet iNaturalist, an AWS-powered nature app that helps you identify …
Question # 53
A wildlife research company has a set of images of lions and cheetahs. The companycreated a dataset of the images. The company labeled each image with a binary label thatindicates whether an image contains a lion or cheetah. The company wants to train amodel to identify whether new images contain a lion or cheetah..... Dh Amazon SageMaker algorithm will meet this requirement?
A. XGBoost B. Image Classification - TensorFlow C. Object Detection - TensorFlow D. Semantic segmentation - MXNet
Answer: B
Explanation: The best Amazon SageMaker algorithm for this task is Image Classification -
TensorFlow. This algorithm is a supervised learning algorithm that supports transfer
learning with many pretrained models from the TensorFlow Hub. Transfer learning allows
the company to fine-tune one of the available pretrained models on their own dataset, even
if a large amount of image data is not available. The image classification algorithm takes an
image as input and outputs a probability for each provided class label. The company can
choose from a variety of models, such as MobileNet, ResNet, or Inception, depending on
their accuracy and speed requirements. The algorithm also supports distributed training,
Amazon SageMaker Provides New Built-in TensorFlow Image Classification
Algorithm
Image Classification with ResNet :: Amazon SageMaker Workshop
Image classification on Amazon SageMaker | by Julien Simon - Medium
Question # 54
A company’s data scientist has trained a new machine learning model that performs betteron test data than the company’s existing model performs in the production environment.The data scientist wants to replace the existing model that runs on an Amazon SageMakerendpoint in the production environment. However, the company is concerned that the newmodel might not work well on the production environment data.The data scientist needs to perform A/B testing in the production environment to evaluatewhether the new model performs well on production environment data.Which combination of steps must the data scientist take to perform the A/B testing?(Choose two.)
A. Create a new endpoint configuration that includes a production variant for each of thetwo models. B. Create a new endpoint configuration that includes two target variants that point todifferent endpoints. C. Deploy the new model to the existing endpoint. D. Update the existing endpoint to activate the new model. E. Update the existing endpoint to use the new endpoint configuration.
Answer: A,E
Explanation: The combination of steps that the data scientist must take to perform the A/B
testing are to create a new endpoint configuration that includes a production variant for
each of the two models, and update the existing endpoint to use the new endpoint
configuration. This approach will allow the data scientist to deploy both models on the same endpoint and split the inference traffic between them based on a specified
distribution.
Amazon SageMaker is a fully managed service that provides developers and data
scientists the ability to quickly build, train, and deploy machine learning models. Amazon
SageMaker supports A/B testing on machine learning models by allowing the data scientist
to run multiple production variants on an endpoint. A production variant is a version of a
model that is deployed on an endpoint. Each production variant has a name, a machine
learning model, an instance type, an initial instance count, and an initial weight. The initial
weight determines the percentage of inference requests that the variant will handle. For
example, if there are two variants with weights of 0.5 and 0.5, each variant will handle 50%
of the requests. The data scientist can use production variants to test models that have
been trained using different training datasets, algorithms, and machine learning
frameworks; test how they perform on different instance types; or a combination of all of the
above1.
To perform A/B testing on machine learning models, the data scientist needs to create a
new endpoint configuration that includes a production variant for each of the two models.
An endpoint configuration is a collection of settings that define the properties of an
endpoint, such as the name, the production variants, and the data capture configuration.
The data scientist can use the Amazon SageMaker console, the AWS CLI, or the AWS
SDKs to create a new endpoint configuration. The data scientist needs to specify the name,
model name, instance type, initial instance count, and initial variant weight for each
production variant in the endpoint configuration2.
After creating the new endpoint configuration, the data scientist needs to update the
existing endpoint to use the new endpoint configuration. Updating an endpoint is the
process of deploying a new endpoint configuration to an existing endpoint. Updating an
endpoint does not affect the availability or scalability of the endpoint, as Amazon
SageMaker creates a new endpoint instance with the new configuration and switches the
DNS record to point to the new instance when it is ready. The data scientist can use the
Amazon SageMaker console, the AWS CLI, or the AWS SDKs to update an endpoint. The
data scientist needs to specify the name of the endpoint and the name of the new endpoint
configuration to update the endpoint3.
The other options are either incorrect or unnecessary. Creating a new endpoint
configuration that includes two target variants that point to different endpoints is not
possible, as target variants are only used to invoke a specific variant on an endpoint, not to
define an endpoint configuration. Deploying the new model to the existing endpoint would
replace the existing model, not run it side-by-side with the new model. Updating the
existing endpoint to activate the new model is not a valid operation, as there is no
activation parameter for an endpoint.
References:
1: A/B Testing ML models in production using Amazon SageMaker | AWS Machine
Learning Blog 2: Create an Endpoint Configuration - Amazon SageMaker
3: Update an Endpoint - Amazon SageMake
Question # 55
A data science team is working with a tabular dataset that the team stores in Amazon S3.The team wants to experiment with different feature transformations such as categoricalfeature encoding. Then the team wants to visualize the resulting distribution of the dataset.After the team finds an appropriate set of feature transformations, the team wants toautomate the workflow for feature transformations.Which solution will meet these requirements with the MOST operational efficiency?
A. Use Amazon SageMaker Data Wrangler preconfigured transformations to explorefeature transformations. Use SageMaker Data Wrangler templates for visualization. Exportthe feature processing workflow to a SageMaker pipeline for automation. B. Use an Amazon SageMaker notebook instance to experiment with different featuretransformations. Save the transformations to Amazon S3. Use Amazon QuickSight forvisualization. Package the feature processing steps into an AWS Lambda function forautomation. C. Use AWS Glue Studio with custom code to experiment with different featuretransformations. Save the transformations to Amazon S3. Use Amazon QuickSight forvisualization. Package the feature processing steps into an AWS Lambda function forautomation. D. Use Amazon SageMaker Data Wrangler preconfigured transformations to experimentwith different feature transformations. Save the transformations to Amazon S3. UseAmazon QuickSight for visualzation. Package each feature transformation step into aseparate AWS Lambda function. Use AWS Step Functions for workflow automation.
Answer: A
Explanation: The solution A will meet the requirements with the most operational
efficiency because it uses Amazon SageMaker Data Wrangler, which is a service that
simplifies the process of data preparation and feature engineering for machine learning.
The solution A involves the following steps:
Use Amazon SageMaker Data Wrangler preconfigured transformations to explore
feature transformations. Amazon SageMaker Data Wrangler provides a visual
interface that allows data scientists to apply various transformations to their tabular
data, such as encoding categorical features, scaling numerical features, imputing
missing values, and more. Amazon SageMaker Data Wrangler also supports
custom transformations using Python code or SQL queries1.
Use SageMaker Data Wrangler templates for visualization. Amazon SageMaker
Data Wrangler also provides a set of templates that can generate visualizations of
the data, such as histograms, scatter plots, box plots, and more. These
visualizations can help data scientists to understand the distribution and
characteristics of the data, and to compare the effects of different feature
transformations1.
Export the feature processing workflow to a SageMaker pipeline for automation.
Amazon SageMaker Data Wrangler can export the feature processing workflow as
a SageMaker pipeline, which is a service that orchestrates and automates
machine learning workflows. A SageMaker pipeline can run the feature processing
steps as a preprocessing step, and then feed the output to a training step or an
inference step. This can reduce the operational overhead of managing the feature
processing workflow and ensure its consistency and reproducibility2.
The other options are not suitable because:
Option B: Using an Amazon SageMaker notebook instance to experiment with
different feature transformations, saving the transformations to Amazon S3, using
Amazon QuickSight for visualization, and packaging the feature processing steps
into an AWS Lambda function for automation will incur more operational overhead
than using Amazon SageMaker Data Wrangler. The data scientist will have to
write the code for the feature transformations, the data storage, the data
visualization, and the Lambda function. Moreover, AWS Lambda has limitations on
the execution time, memory size, and package size, which may not be sufficient
for complex feature processing tasks3.
Option C: Using AWS Glue Studio with custom code to experiment with different
feature transformations, saving the transformations to Amazon S3, using Amazon
QuickSight for visualization, and packaging the feature processing steps into an
AWS Lambda function for automation will incur more operational overhead than
using Amazon SageMaker Data Wrangler. AWS Glue Studio is a visual interface
that allows data engineers to create and run extract, transform, and load (ETL)
jobs on AWS Glue. However, AWS Glue Studio does not provide preconfigured
transformations or templates for feature engineering or data visualization. The data
scientist will have to write custom code for these tasks, as well as for the Lambda
function. Moreover, AWS Glue Studio is not integrated with SageMaker pipelines,
and it may not be optimized for machine learning workflows4.
Option D: Using Amazon SageMaker Data Wrangler preconfigured
transformations to experiment with different feature transformations, saving the
transformations to Amazon S3, using Amazon QuickSight for visualization, packaging each feature transformation step into a separate AWS Lambda function,
and using AWS Step Functions for workflow automation will incur more operational
overhead than using Amazon SageMaker Data Wrangler. The data scientist will
have to create and manage multiple AWS Lambda functions and AWS Step
Functions, which can increase the complexity and cost of the solution. Moreover,
AWS Lambda and AWS Step Functions may not be compatible with SageMaker
pipelines, and they may not be optimized for machine learning workflows5.
References:
1: Amazon SageMaker Data Wrangler
2: Amazon SageMaker Pipelines
3: AWS Lambda
4: AWS Glue Studio
5: AWS Step Functions
Question # 56
A Machine Learning Specialist is training a model to identify the make and model ofvehicles in images The Specialist wants to use transfer learning and an existing modeltrained on images of general objects The Specialist collated a large custom dataset ofpictures containing different vehicle makes and models.What should the Specialist do to initialize the model to re-train it with the custom data?
A. Initialize the model with random weights in all layers including the last fully connectedlayer B. Initialize the model with pre-trained weights in all layers and replace the last fullyconnected layer. C. Initialize the model with random weights in all layers and replace the last fully connectedlayer D. Initialize the model with pre-trained weights in all layers including the last fully connectedlayer
Answer: B
Explanation: Transfer learning is a technique that allows us to use a model trained for a
certain task as a starting point for a machine learning model for a different task. For image
classification, a common practice is to use a pre-trained model that was trained on a large
and general dataset, such as ImageNet, and then customize it for the specific task. One
way to customize the model is to replace the last fully connected layer, which is responsible
for the final classification, with a new layer that has the same number of units as the
number of classes in the new task. This way, the model can leverage the features learned
by the previous layers, which are generic and useful for many image recognition tasks, and
learn to map them to the new classes. The new layer can be initialized with random
weights, and the rest of the model can be initialized with the pre-trained weights. This
method is also known as feature extraction, as it extracts meaningful features from the pretrained
model and uses them for the new task. References:
Transfer learning and fine-tuning
Deep transfer learning for image classification: a survey
Leave a comment
Your email address will not be published. Required fields are marked *