Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Tech News - Data

1209 Articles
article-image-tableau-migrates-to-the-cloud-how-we-evaluated-our-modernization-from-whats-new
Anonymous
26 Oct 2020
7 min read
Save for later

Tableau migrates to the cloud: how we evaluated our modernization from What's New

Anonymous
26 Oct 2020
7 min read
Erin Gengo Manager, Analytics Platforms, Tableau Tanna Solberg October 26, 2020 - 9:31pm October 29, 2020 Cloud technologies make it easier, faster, and more reliable to ingest, store, analyze, and share data sets that range in type and size. The cloud also provides strong tools for governance and security which enable organizations to move faster on analytics initiatives. Like many of our customers, we at Tableau wanted to realize these benefits. Having done the heavy lifting to move our data into the cloud, we now have the opportunity to reflect and share our migration story. As we embarked on the journey of selecting and moving to a cloud-driven data platform from a conventional on-premises solution, we were in a unique position. With our mission to help people see and understand data, we’ve always encouraged employees to use Tableau to make faster, better decisions. Between our culture of democratizing data and rapid, significant growth, we consequently had servers running under people’s desks, powering data sources that were often in conflict. It also created a messy server environment where we struggled to maintain proper documentation, apply standard governance practices, and manage downstream data to avoid duplication. When it came time to migrate, this put pressure on analysts and strained resources. Despite some of our unique circumstances, we know Tableau isn’t alone in facing some of these challenges—from deciding what and when to migrate to the cloud, to how to better govern self-service analytics and arrive at a single source of truth. We’re pleased to share our lessons learned so customers can make informed decisions along their own cloud journeys. Our cloud evaluation measures Because the cloud is now the preferred place for businesses to run their IT infrastructure, choosing to shift our enterprise analytics to a SaaS environment (Tableau Online) was a key first step. After that, we needed to carefully evaluate cloud platforms and choose the best solution for hosting our data. The top criteria we focused on during the evaluation phase were:  Performance: The platform had to be highly performant to support ad-hoc analysis to high-volume, regular reporting across diverse use cases. We wanted fewer “knobs” to turn and an infrastructure that adapted to usage patterns, responded dynamically, and included automatic encryption. Scale: We wanted scalable compute and storage that would adjust to changes in demand—whether we were in a busy time of closing financial books for the quarter or faced with quickly and unpredictably shifting needs—like an unexpected pandemic. Whatever we chose needed compute power that scaled to match our data workloads.  Governance and security: We’re a data-driven organization, but because much of that  data wasn’t always effectively governed, we knew we were missing out on value that the data held.. Thus, we required technology that supported enterprise governance as well as the increased security that our growing business demands.  Flexibility: We needed the ability to scale infrastructure up or down to meet performance and cost needs. We also wanted a cloud platform that matched Tableau’s handling of structured, unstructured, or semi-structured data types to increase performance across our variety of analytics use cases.  Simplicity: Tableau sought a solution that was easy to use and manage across skill levels, including teams with seasoned engineers or teams without them that managed their data pipelines through Tableau Prep. If they quickly saw the benefit of the cloud architecture to streamline workflows and reduce their time to insight, it would help them focus on creating data context and support governance that enabled self-service—a win-win for all. Cost-efficiency: A fixed database infrastructure can create large overhead costs. Knowing many companies purchase their data warehouse to meet the highest demand timeframes, we needed high performance and capacity, but not 24/7. That could cost Tableau millions of dollars of unused capacity. Measurement and testing considerations We needed to deploy at scale and account for diverse use cases as well as quickly get our people answers from their data to make important, in-the-moment decisions. After narrowing our choices, we followed that with testing to ensure the cloud solution performed as efficiently as we needed it to. We tested: Dashboard load times; we tested more than 20,000 Tableau vizzes  Data import speeds Compute power Extract refreshes How fast the solution allows our London and Singapore data centers to access data that we have stored in our US-West-2a regional data center  We advise similar testing for organizations like us, but we also suggest asking some other questions to guarantee the solution aligns with your top priorities and concerns: What could the migration path look like from your current solution to the cloud? (For us, SQL Server to Snowflake) What's the learning curve like for data engineers—both for migration and afterward? Is the cost structure of the cloud solution transparent, so you can somewhat accurately forecast/estimate your costs? Will the solution lower administration and maintenance?  How does the solution fit with your current development practices and methods, and what is the impact for processes that may have to change? How will you handle authentication? How will this solution fit with our larger vendor and partner ecosystem? Tabeau’s choice: Snowflake There isn’t a one-size-fits-all approach, and it’s worth exploring various cloud data platforms. We found that in prioritizing requirements and making careful, conscious choices of where we wouldn’t make any sacrifices, a few vendors rose to the top as our shortlist for evaluation.  In our data-heavy, dynamic environment where needs and situations change on a dime, we found Snowflake met our needs and then some. It is feature-rich with a dynamic, collaborative environment that brings Tableau together—sales, marketing, finance, product development, and executives who must quickly make decisions for the health, safety, progress of the business.  “This process had a transformational effect on my team, who spent years saying ‘no’ when we couldn’t meet analytics demands across Tableau,” explained Phillip Cheung, a product manager who helped drive the evaluation and testing process. “Now we can easily respond to any request for data in a way that fully supports self-service analytics with Tableau.”  Cloud adoption, accelerated With disruption on a global scale, the business landscape is changing like we’ve never experienced. Every organization, government agency, and individual has been impacted by COVID-19. We’re all leaning into data for answers and clarity to move ahead. And through these times of rapid change, the cloud has proven even more important than we thought. As a result of the pandemic, organizations are accelerating and prioritizing cloud adoption and migration efforts. According to a recent IDC survey, almost 50 percent of technology decision makers expect to moderately or significantly increase demand for cloud computing as a result of the pandemic. Meredith Whalen, chief research officer, said, “A number of CIOs tell us their cloud migration investments paid off during the pandemic as they were able to easily scale up or down.” (Source: IDC. COVID-19 Brings New C-Suite Priorities, May 2020.) We know that many of our customers are considering or already increasing their cloud investments. And we hope our lessons learned will help others gain useful perspective in moving to the cloud, and to ultimately grow more adaptive, resilient, and successful as they plan for the future. So stay tuned—as part of this continued series, we’ll also be sharing takeaways and experiences from IT and end users during key milestones as we moved our data and analytics to the cloud.
Read more
  • 0
  • 0
  • 666

article-image-weekly-digest-october-19-from-featured-blog-posts-data-science-central
Matthew Emerick
18 Oct 2020
1 min read
Save for later

Weekly Digest, October 19 from Featured Blog Posts - Data Science Central

Matthew Emerick
18 Oct 2020
1 min read
Monday newsletter published by Data Science Central. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week. To subscribe, follow this link.   Featured Resources and Technical Contributions  Best Models For Multi-step Time Series Modeling Types of Variables in Data Science in One Picture A quick demonstration of polling confidence interval calculations using simulation Why you should NEVER run a Logistic Regression (unless you have to) Cross-validation and hyperparameter tuning 5 Great Data Science Courses Complete Hands-Off Automated Machine Learning Why You Should Learn Sitecore CMS? Featured Articles AI is Driving Software 2.0… with Minimal Human Intervention Data Observability: How to Fix Your Broken Data Pipelines Applications of Machine Learning in FinTech Where synthetic data brings value Why Fintech is the Future of Banking? Real Estate: How it is Impacted by Business Intelligence Determining How Cloud Computing Benefits Data Science Advantages And Disadvantages Of Mobile Banking Picture of the Week Source: article flagged with a +  To make sure you keep getting these emails, please add  mail@newsletter.datasciencecentral.com to your address book or whitelist us. To subscribe, click here. Follow us: Twitter | Facebook.
Read more
  • 0
  • 0
  • 1233

article-image-genius-tool-to-compare-best-time-series-models-for-multi-step-time-series-modeling-from-featured-blog-posts-data-science-central
Matthew Emerick
18 Oct 2020
17 min read
Save for later

Genius Tool to Compare Best Time-Series Models For Multi-step Time Series Modeling from Featured Blog Posts - Data Science Central

Matthew Emerick
18 Oct 2020
17 min read
Predict Number of Active Cases by Covid-19 Pandemic based on Medical Facilities (Volume of Testing, ICU beds, Ventilators, Isolation Units, etc) using Multi-variate LSTM based Multi-Step Forecasting Models Introduction and Motivation The intensity of the growth of the covid-19 pandemic worldwide has propelled researchers to evaluate the best machine learning model that could the people affected in the distant future by considering the current statistics and predicting the near future terms in subsequent stages. While different univariate models like ARIMA/SARIMA and traditional time-series are capable of predicting Number of Active cases, daily recoveries, Number of deaths, they do not take into consideration the other time-varying factors like Medical Facilities (Volume of Testing, ICU beds, Hospital Admissions, Ventilators, Isolation Units, Quarantine Centres, etc). As these factors become important we build a predictive model that can predict the Number of Active Cases, Deaths, and Recoveries based on the change in Medical Facilities as well as other changes in infrastructure. Here in this blog, we try to model Multi-step Time Series Prediction using Deep learning Models on the basis of Medical Information available for different states of India. Multi-Step Time Series Prediction A typical multi-step predictive model looks as the below figure, where each of the predicted outcomes from the previous state is treated as next state input to derive the outcome for the second-state and so forth. www.tensorflow.org/tutorials/structured_data/time_series_files/output_0xJoIP6PMWMI_1.png?resize=429%2C280&ssl=1" alt="png" width="429" height="280" /> Source Deep Learning-based Multi-variate Time Series Training and Prediction The following figure illustrated the important steps involved in selecting the best deep learning model. Time-Series based Single/Multi-Step Prediction Feeding Multi-variate data from a single source or from aggregated sources available directly from the cloud or other 3rd-party providers into the ML modeling data ingestion system. Cleaning, preprocessing, and feature engineering of the data involving scaling and normalization. Conversion of the data to a supervised time-series. Feeding the data to a deep learning training source that can train different time-series models like LSTM, CNN, BI-LSTM, CNN+LSTM using different combinations of hidden layers, neurons, batch-size, and other hyper-parameters. Forecasting based on near term or far distant term in future either using Single-Step or Multi-Step Forecasting respectively Evaluation of some of the error metrics like (MAPE, MAE, ME, RMSE, MPE) by comparing it with the actual data, when it comes in Re-training the model and continuous improvements when the threshold of error exceeds. Import Necessary Tensorflow Libraries The code snippet gives an overview of the necessary libraries required for tensorflow. from tensorflow.python.keras.layers import Dense, LSTM, RepeatVector,TimeDistributed,Flatten, Bidirectional from tensorflow.python.keras import Sequential from tensorflow.python.keras.layers.convolutional import Conv1D, Conv2D, MaxPooling1D,ConvLSTM2D Data Loading and Selecting Features As Delhi had high Covid-19 cases, here we model different DL models for the “DELHI” State (National Capital of India). Further, we keep the scope of dates from 25th March to 6th June 2020. Data till 29th April has been used for Training, whereas from 30th April to 6th June has been used for testing/prediction. The test data has been used to predict for 7 days for 3 subsequent stages of prediction. This code demonstrates the data is first split into a 70:30 ratio between training and testing (by finding the closest number to 7), where each set is then restructured to weekly samples of data. def split_dataset(data): # split into standard weeks print(np.shape(data)) split_factor = int((np.shape(data)[0]*0.7)) print("Split Factor no is", split_factor) m = 7 trn_close_no = closestNumber(split_factor, m) te_close_no = closestNumber((np.shape(data)[0]-split_factor), m) train, test = data[0:trn_close_no], data[trn_close_no:(trn_close_no + te_close_no)] print("Initials Train-Test Split --", np.shape(train), np.shape(test)) len_train = np.shape(train)[0] len_test = np.shape(test)[0] # restructure into windows of weekly data train = array(split(train[0:len_train], len(train[0:len_train]) / 7)) test = array(split(test, len(test) / 7)) print("Final Train-Test Split --", np.shape(train), np.shape(test)) return train, test Initials Train-Test Split -- (49, 23) (21, 23) ----- Training and Test DataSet Final Train-Test Split -- (7, 7, 23) (3, 7, 23) ----- Arrange Train and Test DataSet into 7 and 3 weekly samples respecytively. The data set and the features have been scaled using Min-Max Scaler. scaler = MinMaxScaler(feature_range=(0, 1)) scaled_dataset = scaler.fit_transform(dataset) Convert Time-Series to a Supervised DataSet The tricky part in converting the time-series to a supervised time-series for multi-step prediction lies in incorporating the number of past days (i.e. the historic data) that the weekly data has to consider. The series derived by considering historic data is considered 7 times during training iterations and 3 times during testing iterations (as it got split as (7,7,23) and (7,3,23), where 22 is the number of input features with one predicted output). This series built using historic data helps the model to learn and predict any day of the week. Note 1: This is the most important step of formulating a time-series data to a multi-step model The below snippet code demonstrates what is described above. # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0] * train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end <= len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) Training Different Deep Learning Models using Tensorflow In this section, we describe how we train different DL models using Tensorflow’s Keras APIs. Convolution Neural Network (CNN Model) The following figure recollects the structure of a Convolution Neural Network (CNN) with a code snippet showing how a 1D CNN with 16 filters, with a kernel size of 3 has been used to train the network over 7 steps, where each 7 step is of 7 days. Source # train CNN model def build_model_cnn(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 200, 4 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(Conv1D(filters=16, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(MaxPooling1D(pool_size=2))model.add(Flatten()) model.add(Dense(10, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model                                                                  CNN LSTM The following code snippet demonstrates how we train an LSTM model, plot the training and validation loss, before making a prediction. # train LSTM model def build_model_lstm(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) print(np.shape(train_x)) print(np.shape(train_y)) # define parameters verbose, epochs, batch_size = 0, 50, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model The below figure illustrates the Actual vs Predicted Outcome of the Multi-Step LSTM model after the predicted outcome has been inverse-transformed (to remove the effect of scaling).                                                                       LSTM Bi-Directional LSTM The following code snippet demonstrates how we train a BI-LSTM model, plot the training and validation loss, before making a prediction. Source # train Bi-Directionsl LSTM model def build_model_bi_lstm(train, n_input):# prepare data train_x, train_y = to_supervised(train, n_input) print(np.shape(train_x)) print(np.shape(train_y)) # define parameters verbose, epochs, batch_size = 0, 50, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]# reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(Bidirectional(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features)))) model.add(RepeatVector(n_outputs)) model.add(Bidirectional(LSTM(200, activation='relu', return_sequences=True))) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model The below figure illustrates the Actual vs Predicted Outcome of Multi-Step Bi-LSTM model after the predicted outcome has been inverse-transformed (to remove the effect of scaling). BI-LSTM Stacked LSTM + CNN Here we have used Conv1d with TimeDistributed Layer, which is then fed to a single layer of LSTM, to predicted different sequences, as illustrated by the figure below. The CNN model is built first, then added to the LSTM model by wrapping the entire sequence of CNN layers in a TimeDistributed layer. Source # train Stacked CNN + LSTM model def build_model_cnn_lstm(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 500, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(n_timesteps, n_features))) model.add(Conv1D(filters=64, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model The prediction and inverse scaling help to yield the actual predicted outcomes, as illustrated below.                                                           LSTM With CNN Multi-Step Forecasting and Evaluation The below snippet states how the data is properly reshaped into (1, n_input, n) to forecast for the following week. For the multi-variate time-series (of 23 features) with test data of 23 samples (with predicted output from previous steps i.e. 21+2) for 3 weeks is reshaped from (7,7,23), (8,7,23) and (9,7,23) as (49,23), (56,23) and (63, 23)  Prediction for 3 weeks, by taking the predicted output from previous weeks # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1])) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat Note 2: If you wish to see the evaluation results and plots for each step as stated below, please check the notebook at Github (https://github.com/sharmi1206/covid-19-analysis Notebook ts_dlearn_mstep_forecats.ipynb) Here at each step at the granularity of every week, we evaluate the model and compare it against the actual output. # evaluate one or more weekly forecasts against expected valuesdef evaluate_forecasts(actual, predicted):print("Actual Results", np.shape(actual)) print("Predicted Results", np.shape(predicted))scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i])# calculate rmse rmse = sqrt(mse) # store scores.append(rmse) plt.figure(figsize=(14, 12)) plt.plot(actual[:, i], label='actual') plt.plot(predicted[:, i], label='predicted') plt.title(ModelType + ' based Multi-Step Time Series Active Cases Prediction for step ' + str(i)) plt.legend() plt.show() # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col]) ** 2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # evaluate a single model def evaluate_model(train, test, n_input): model = None # fit model if(ModelType == 'LSTM'): print('lstm') model = build_model_lstm(train, n_input) elif(ModelType == 'BI_LSTM'): print('bi_lstm') model = build_model_bi_lstm(train, n_input) elif(ModelType == 'CNN'): print('cnn') model = build_model_cnn(train, n_input) elif(ModelType == 'LSTM_CNN'): print('lstm_cnn') model = build_model_cnn_lstm(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores, test[:, :, 0], predictions Here we show a univariate and multi-variate, multi-step time-series prediction. Multi-Step Conv2D + LSTM (Uni-variate & Multi-Variate) based Prediction for State Delhi Source A type of CNN-LSTM is the ConvLSTM (primarily for two-dimensional spatial-temporal data), where the convolutional reading of input is built directly into each LSTM unit. Here for this particular univariate time series, we have the input vector as [timesteps=14, rows=1, columns=7, features=2 (input and output)] # train CONV LSTM2D model def build_model_cnn_lstm_2d(train, n_steps, n_length, n_input): # prepare data train_x, train_y = to_supervised_2cnn_lstm(train, n_input) # define parameters verbose, epochs, batch_size = 0, 750, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape into subsequences [samples, time steps, rows, cols, channels] train_x = train_x.reshape((train_x.shape[0], n_steps, 1, n_length, n_features)) # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(ConvLSTM2D(filters=64, kernel_size=(1,3), activation='relu', input_shape=(n_steps, 1, n_length, n_features))) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # convert history into inputs and outputs def to_supervised_2cnn_lstm(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end <= len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # make a forecast def forecast_2cnn_lstm(model, history, n_steps, n_length, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [samples, time steps, rows, cols, channels] input_x = input_x.reshape((1, n_steps, 1, n_length, 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model_2cnn_lstm(train, test, n_steps, n_length, n_input): # fit model model = build_model_cnn_lstm_2d(train, n_steps, n_length, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast_2cnn_lstm(model, history, n_steps, n_length, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores, test[:, :, 0], predictions Reading State-wise data and Indexing time columns: df_state_all = pd.read_csv('all_states/all.csv') df_state_all = df_state_all.drop(columns=['Latitude', 'Longitude', 'index']) stateName = unique_states[8] dataset = df_state_all[df_state_all['Name of State / UT'] == unique_states[8]] dataset = dataset.sort_values(by='Date', ascending=True) dataset = dataset[(dataset['Date'] >= '2020-03-25') & (dataset['Date'] <= '2020-06-06')] print(np.shape(dataset)) daterange = dataset['Date'].values no_Dates = len(daterange) dateStart = daterange[0] dateEnd = daterange[no_Dates - 1] print(dateStart) print(dateEnd) dataset = dataset.drop(columns=['Unnamed: 0', 'Date', 'source1', 'state', 'Name of State / UT', 'tagpeopleinquarantine', 'tagtotaltested']) print(np.shape(dataset)) n = np.shape(dataset)[0] scaler = MinMaxScaler(feature_range=(0, 1)) scaled_dataset = scaler.fit_transform(dataset) # split into train and test train, test = split_dataset(scaled_dataset) # define the number of subsequences and the length of subsequences n_steps, n_length = 2, 7 # define the total days to use as input n_input = n_length * n_steps score, scores, actual, predicted = evaluate_model_2cnn_lstm(train, test, n_steps, n_length, n_input) # summarize scores summarize_scores(ModelType, score, scores) The model parameters can be summarized as :   Model Summary Conv2D + LSTM The evaluate_model function appends the model forecasting score at each step and returns it at the end. The below figure illustrates the Actual vs Predicted Outcome of Multi-Step ConvLSTM2D model after the predicted outcome has been inverse-transformed (to remove the effect of scaling). Uni-Variate ConvLSTM2D For multi-variate time series with 22 input features and one output prediction, we take into consideration the following changes: In function forecast_2cnn_lstm we replace the input data shaping to constitute the multi-variate features #In function forecast_2cnn_lstm input_x = data[-n_input:, :]. #replacing 0 with : # reshape into [samples, time steps, rows, cols, channels] input_x = input_x.reshape((1, n_steps, 1, n_length, data.shape[1])) #replacing 1 with #data.shape[1] for multi-variate Further, in function to_supervised_2cnn_lstm, we replace x_input’s feature size from 0 to : and 1 with 23 features as follows: x_input = data[in_start:in_end, :] x_input = x_input.reshape((len(x_input), x_input.shape[1]))                                                    Multi-Variate ConvLSTM2D Conv2D + BI_LSTM We can further try out Bi-Directional LSTM with a 2D Convolution Layer as depicted in the figure below. The model stacking and subsequent layers remain the same as tried in the previous step, with the exception of using a BI-LSTM in place of a single LSTM. Source Comparison of Model Metrics on test data set Deep Learning Method RMSE LSTM 912.224 BI LSTM 1317.841 CNN 1021.518 LSTM + CNN 891.076 Conv2D + LSTM (Uni-Variate Single-Step) 1288.416 Conv2D + LSTM (Multi-Variate Multi-Step) 863.163 Conclusion In this blog, I have discussed multi-step time-series prediction using deep learning mechanisms and compared/evaluated them based on RMSE. Here, we notice that for a forecasting time-period of 7 days stacked ConvLSTM2D works the best, followed by LSTM with CNN, CNN, and LSTM networks. More extensive model evaluation with different hidden layers and neurons with efficient hyperparameter tuning can further improve accuracy. Though we see the model accuracy decreases for multi-step models, this can be a useful tool for having long term forecasts where predicted outcomes in the previous week help in playing a dominant role on predicted outputs. For complete source code check out https://github.com/sharmi1206/covid-19-analysis Acknowledgments Special thanks to machinelearningmastery.com. as some of the concepts have been taken from there. References https://arxiv.org/pdf/1801.02143.pdf https://github.com/sharmi1206/covid-19-analysis https://machinelearningmastery.com/multi-step-time-series-forecasting/ https://machinelearningmastery.com/multi-step-time-series-forecasting-with-machine-learning-models-for-household-electricity-consumption/ https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/ https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/ https://www.tensorflow.org/tutorials/structured_data/time_series https://www.aiproblog.com/index.php/2018/11/13/how-to-develop-lstm-models-for-time-series-forecasting/
Read more
  • 0
  • 0
  • 1813
Banner background image

article-image-types-of-variables-in-data-science-in-one-picture-from-featured-blog-posts-data-science-central
Matthew Emerick
18 Oct 2020
1 min read
Save for later

Types of Variables in Data Science in One Picture from Featured Blog Posts - Data Science Central

Matthew Emerick
18 Oct 2020
1 min read
While there are several dozen different types of possible variables, all can be categorized into a few basic areas. This simple graphic shows you how they are related, with a few examples of each type.  More info: Types of variables in statistics and research  
Read more
  • 0
  • 0
  • 1471

Matthew Emerick
15 Oct 2020
1 min read
Save for later

Thursday News, October 15 from Featured Blog Posts - Data Science Central

Matthew Emerick
15 Oct 2020
1 min read
Here is our selection of articles and technical contributions featured on DSC since Monday: Announcements Penn State Master’s in Data Analytics – 100% Online eBook: Data Preparation for Dummies Technical Contributions A quick demonstration of polling confidence interval calculations using simulation Why you should NEVER run a Logistic Regression (unless you have to) Cross-validation and hyperparameter tuning Why You Should Learn Sitecore CMS? Articles AI is Driving Software 2.0… with Minimal Human Intervention Applications of Machine Learning in FinTech Why Fintech is the Future of Banking? Real Estate: How it is Impacted by Business Intelligence Determining How Cloud Computing Benefits Data Science Enjoy the reading!
Read more
  • 0
  • 0
  • 1197

article-image-ai-is-driving-software-2-0-with-minimal-human-intervention-from-featured-blog-posts-data-science-central
Matthew Emerick
15 Oct 2020
6 min read
Save for later

AI is Driving Software 2.0… with Minimal Human Intervention from Featured Blog Posts - Data Science Central

Matthew Emerick
15 Oct 2020
6 min read
The future of software development will be model-driven, not code-driven. Now that my 4th book (“The Economics of Data, Analytics and Digital Transformation”) is in the hands of my publisher, it’s time to get back to work investigating and sharing new learnings.  In this blog I’ll take on the subject of Software 2.0.  And thanks Jens for the push in this direction! Imagine trying to distinguish a dog from other animals in a photo coding in if-then statements: If the animal has four legs (except when it only has 3 legs due to an accident), and if the animal has short fur (except when it is a hair dog or a chihuahua with no fur), and if the animal has medium length ears (except when the dog is a bloodhound), and if the animal has a medium length legs (except when it’s a bull dog), and if… Well, you get the point.  In fact, it is probably impossible to distinguish a dog from other animals coding in if-then statements. And that’s where the power of model-based (AI and Deep Learning) programming shows its strength; to tackle programming problems – such as facial recognition, natural language processing, real-time dictation, image recognition – that are nearly impossible to address using traditional rule-based programming (see Figure 1). Figure 1:  How Deep Learning Works As discussed in “2020 Challenge: Unlearn to Change Your Frame”, most traditional analytics are rule based; the analytics make decisions guided by a pre-determined set of business or operational rules. However, AI and Deep Learning make decisions based upon the "learning" gleaned from the data. Deep Learning “learns” the characteristics of entities in order to distinguish cats from dogs, tanks from trucks, or healthy cells from cancerous cells (see Figure 2). Figure 2: Rules-based versus Learning-based Programing This learning amplifies when there is a sharing of the learnings across a collection of similar assets – vehicles, trains, airplanes, compressors, turbines, motors, elevators, cranes – so that the learnings of one asset can be aggregated and backpropagated to the cohort of assets. The Uncertain Future of Programming A recent announcement from NVIDIA has the AI community abuzz, and software developers worrying about their future.  NVIDIA researchers recently used AI to recreate the classic video game Pac-Man.  NVIDIA created an AI model using Generative Adversarial Networks (GANs) (called NVIDIA GameGAN) that can generate a fully functional version of Pac-Man without the coding associated with building the underlying game engine.  The AI model was able to recreate the game without having to “code” the game’s fundamental rules (see Figure 3). Figure 3: “How GANs and Adaptive Content Will Change Learning, Entertainment and More” Using AI and Machine Learning (ML) to create software without the need to code the software is driving the "Software 2.0" phenomena.  And it is impressive.  An outstanding presentation from Kunle Olukotun titled “Designing Computer Systems for Software 2.0” discussed the potential of Software 2.0 to use machine learning to generate models from data and replace traditional software development (coding) for many applications. Software 2.0[1] Due to the stunning growth of Big Data and IOT, Neural Networks now have access to enough detailed, granular data to surpass conventional coded algorithms in the predictive accuracy of complex models in areas such as image recognition, natural language processing, autonomous vehicles, and personalized medicine. Instead of coding software algorithms in the traditional development manner, you train Neural Network – leveraging backpropagation and stochastic gradient descent – to optimize the neural network nodes’ weights to deliver the desired outputs or outcomes (see Figure 4). Figure 4: “Neural Networks:  Is Meta-learning the New Black?” With model-driven software development, it is often easier to train a model than to manually code an algorithm, especially for complex applications like Natural Language Processing (NLP) and image recognition.  Plus, model-driven software development is often more predictable in term of runtimes and memory usage compared to conventional algorithms For example, Google’s Jeff Dean reported that 500 lines of TensorFlow code replaced 500,000 lines of code in Google Translate. And while a thousand-fold reduction is huge, what’s more significant is how this code works: rather than half a million lines of static code, the neural network can learn and adapt as biases and prejudices in the data are discovered. Software 2.0 Challenge: Data Generation In the article “What machine learning means for software development”, Andrew Karpathy states that neural networks have proven they can perform almost any task for which there is sufficient training data. Training Neural Networks to beat Go or Chess or StarCraft is possible because of the large volume of associated training data.  It’s easy to collect training data for Go or Chess as there is over 150 years of data from which to train the models.  And training image recognition programs is facilitated by the 14 million labeled images available on ImageNet. However, there is not always sufficient data to neural network models in all cases.  Significant effort must be invested to create and engineer training data, using techniques such as noisy labeling schemes, data augmentation, data engineering, and data reshaping, to power the model-based neural network applications.  Welcome to Snorkel.  Snorkel (damn cool name) is a system for programmatically building and managing training datasets without manual labeling. Snorkel can automatically develop, clean and integrate large training datasets using three different programmatic operations (see Figure 5): Labeling data through the use of heuristic rules or distant supervision techniques Transforming or augmenting the data by rotating or stretching images Slicing data into different subsets for monitoring or targeted improvement   Figure 5: Programmatically Building and Managing Training Data with Snorkel Snorkel is a powerful tool for data labeling and data synthesis. Labeling data manually is very time-consuming, and Snorkel can address this issue programmatically, and the resulting data can be validated by human beings by looking at some samples of the data. See “Snorkel Intro Tutorial: Data Augmentation” for more information on its workings. Software 2.0 Summary There are certain, complex programming problems – facial recognition, natural language processing, real-time dictation, image recognition, autonomous vehicles, precision medicine – that are nearly impossible to address using traditional rule-based programming.  In these cases, it is easier to create AI, Deep Learning and Machine Learning models that can trained (with large data sets) to deliver the right actions versus being coded to deliver the right actions.  This is the philosophy of Software 2.0. Instead of coding software algorithms in the traditional development manner, you train a Neural Network to optimize the neural network nodes’ weights to deliver the desired outputs or outcomes. And model-driven programs have the added advantage of being able to learn and adapt… the neural network can learn and adapt as biases and prejudices in the data are discovered. However, there is not always sufficient data to neural network models in all cases.  In those cases, new tools like Snorkel can help… Snorkel can automatically develop, clean and integrate large training datasets The future of software development will be model-driven, not code-driven. Article Sources: Machine Learning vs Traditional Programming Designing Computer Systems for Software 2.0 (PDF) Software Ate the World, Now AI Is Eating Software: The road to Software 2.0         [1] Kunle Olukotun’s presentation and video.
Read more
  • 0
  • 0
  • 1253
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-index-data-from-s3-via-nifi-using-cdp-data-hubs-from-cloudera-blog
Matthew Emerick
15 Oct 2020
10 min read
Save for later

How-to: Index Data from S3 via NiFi Using CDP Data Hubs from Cloudera Blog

Matthew Emerick
15 Oct 2020
10 min read
About this Blog Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc). Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take.  Assumptions The prerequisites to pull this feat are pretty similar to the ones in our previous blog post, minus the command line access: You have a CDP account already and have power user or admin rights for the environment in which you plan to spin up the services. If you do not have a CDP AWS account, please contact your favorite Cloudera representative, or sign up for a CDP trial here. You have environments and identities mapped and configured. More explicitly, all you need is to have the mapping of the CDP User to an AWS Role which grants access to the specific S3 bucket you want to read from (and write to). You have a workload (FreeIPA) password already set. You have  DDE and  Flow Management Data Hub clusters running in your environment. You can also find more information about using templates in CDP Data Hub here. You have AWS credentials to be able to access an S3 bucket from Nifi. Here is documentation on how to acquire AWS credentials and how to create a bucket and upload files to it. You have a sample file in an S3 bucket that is accessible for your CDP user.  If you don’t have a sample file, here is a link to the one we used. Note: the workflow discussed in this blog was written with the linked ‘films.csv’ file in mind. If you use a different one, you might need to do things slightly differently, e.g. when creating the Solr collection) Pro Tip for the novice user: to download a CSV file from GitHub, view it by clicking the RAW button and then use the Save As option in the browser File menu. Workflow To replicate what we did, you need to do the following: Create a collection using Hue. Build a dataflow in NiFi. Run the NiFi flow. Check if everything went well NiFi logs and see the indexed data on Hue. Create a collection using Hue You can create a collection using the solrctrl CLI. Here we chose to use HUE in the DDE Data Hub cluster: 1.In the Services section of the DDE cluster details page, click the Hue shortcut. 2. On the Hue webUI select Indexes> + ‘Create index’ > from the Type drop down select ‘Manually’> Click Next. 3. Provide a collection Name under Destination (in this example, we named it ‘solr-nifi-demo’). 4. Add the following  Fields, using the + Add Field button: Name Type name text_general initial_release_date date 5. Click Submit. 6. To check that the collection has indeed been created, go to the Solr webUI by clicking the Solr Server shortcut on the DDE cluster details page. 7. Once there, you can either click on the Collections sidebar option or click Select an option > in the drop down you will find the collection you have just created (‘solr-nifi-demo’ in our example) > click the collection > click Query > Execute Query. You should get something very similar: {  "responseHeader":{    "zkConnected":true,    "status":0,    "QTime":0,    "params":{      "q":"*:*",      "doAs":"<querying user>",      "_forwardedCount":"1",      "_":"1599835760799"}},  "response":{"numFound":0,"start":0,"docs":[]   }} That is, you have successfully created an empty collection. Build a flow in NiFi Once you are done with collection creation, move over to Flow management Data Hub cluster. In the Services section of the Flow Management cluster details page, click the NiFi shortcut. Add processors Start adding processors by dragging the ‘Processor’ button to the NiFi canvas. To build the example workflow we did, add the following processors: 1. ListS3 This processor reads the content of the S3 bucket linked to your environment. Configuration: Config name Config value Comments Name Check for new Input Optional Bucket nifi-solr-demo The S3 bucket where you uploaded your sample file Access Key ID <my access key> This value is generated for AWS users. You may generate and download a new one from AWS Management Console > Services > IAM > Users > Select your user > Security credentials > Create access key. Secret Access Key <my secret access key> This value is generated for AWS users, together with the Access Key ID. Prefix input-data/ The folder inside the bucket where the input CSV is located. Be careful of the “/” at the end. It is required to make this work. You may need to fill in or change additional properties beside these such as region, scheduling etc. (Based on your preferences and your AWS configuration) 2. RouteOnAttribute This processor filters objects read in the previous step, and makes sure only CSV files reach the next processor. Configuration: Config name Config value Comments Name Filter CSVs Optional csv_file ${filename:toUpper():endsWith(‘CSV’)} This attribute is added with the ‘Add Property’ option. The routing will be based on this property. See in the connections section. 3.  FetchS3Object FetchS3 object reads the content of the CSV files it receives. Configuration Config name Config value Comments Name Fetch CSV from S3 Optional Bucket nifi-solr-demo The same as provided for the ListS3 processor Object Key ${filename} It’s coming from the Flow File Access Key ID <My Access Key Id> The same as provided for the ListS3 processor Secret Access Key <My Secret Access Key> The same as provided for the ListS3 processor The values for Bucket, Access Key, and Secret Key are the same as in case of the List3 processor. The Object key is autofilled by NiFi, It comes as an input from the previous processors. 4. PutSolrContentStream Configuration Config name Config value Comments Name Index Data to DDE Optional Solr Type Cloud We will provide ZK ensemble as Solr location so this is required to be set to Cloud. Solr Location <ZK_ENSEMBLE> You find this value on the Dashboard of the Solr webUI, as the zkHost parameter value. Collection solr-nifi-demo-collection Here we use the collection which has been created above. If you specified a different name there then put the same here. Content Stream Path /update Be careful of the leading “/”. Content-Type application/csv Any content type that Solr can process may be provided here. In this example we use CSV. Kerberos principal <my kerberos username> Since we use direct URL to Solr, Kerberos authentication needs to be used here. Kerberos password <my kerberos password> Password for the Kerberos principal. SSL Context Service Default NiFi SSL Context Service Just choose it from the drop down. The service is created by default from the Workflow Management template. 5. LogMessage (x4) We created four LogMessage processors too to track if everything happens as expected. a) Log Check Log message Object checked out: ${filename} b) Log Ignore Log message File is not csv. Ignored: ${filename} c) Log Fetch Log message Object fetched: ${filename} d) Log Index Log message Data indexed from: ${filename} 6. In this workflow, the log processors are the dead ends, so pick the “Automatically Terminate Relationships” option on them like this: In this example, all properties not mentioned above were left with their default values during processor setup. Depending on your AWS and environment setup, you may need to set things differently.  After setting up the processors you shall see something like this: Create connections Use your mouse to create flow between the processors. The connections between the boxes are the successful paths, except for the RouteOnAttribute processor: It has the csv_file and the unmatched routes. The FetchS3Object and the PutSolrContentStream processors have failure paths as well: direct them back to themselves, creating a retry mechanism on failure. This may not be the most sophisticated, but it serves its purpose.  This is what your flow will look like after setting the connections: Run the NiFi Flow You may start the processors one by one, or you may start the entire flow at once. If no processor is selected, by clicking the “Play” icon on the left side in the NiFi Operate Palette starts the flow. If you did the setup exactly as it is in the beginning of this post, two object are almost instantly checked out (depending, of course, on your scheduling settings if you set those too):  input-data/ – The input folder also matches with the prefix provided for the ListS3 processor. But no worries, as in the next step it will be filtered out so it won’t go further as it’s not a CSV file. films.csv – this goes to our collection if you did everything right. After starting your flow the ListS3 command based on the scheduling polls your S3 bucket and searches for changes based on the “Last modified” timestamp. So if you put something new in your input-data folder it will be automatically processed. Also if a file changes it’s rechecked too. Check the results After the CSV has been processed, you can check your logs and collection for the expected result. Logs 1. In the Services section of the Flow Management cluster details page, click the Cloudera Manager shortcut. 2. Click on the name of your compute cluster >Click NiFi in the Compute  Cluster box. > Under Status Summary  click NiFi Node  > Click on one of the nodes and click Log Files in the top menu bar. > Select Role Log File. If everything went well you will see similar log messages: Indexed data Indexed data appears in our collection. Here is what you should see on Hue:  Summary In this post, we demonstrated how Cloudera Data Platform components can collaborate with each other, while still being resource isolated and managed separately. We created a Solr collection via Hue, built a data ingest workflow in NiFi to connect our S3 bucket with Solr, and in the end, we have the indexed data ready for searching. There is no terminal magic in this scenario, we’ve only used comfortable UI features. Having our indexing flow and our Solr sitting in separate clusters, we have more options in areas like scalability, the flexibility of routing, and decorating data pipelines for multiple consuming workloads, and yet with consistent security and governance across. Remember, this was only one simple example. This basic setup, however, offers endless opportunities to implement way more complex solutions. Feel free to try Data Discovery and Exploration in CDP on your own and play around with more advanced pipelines and let us know how it goes! Alternatively, contact us for more information. The post How-to: Index Data from S3 via NiFi Using CDP Data Hubs appeared first on Cloudera Blog.
Read more
  • 0
  • 0
  • 1675

article-image-image-processing-techniques-that-you-can-use-in-machine-learning-projects-planet-scipy
Matthew Emerick
15 Oct 2020
1 min read
Save for later

Image Processing Techniques That You Can Use in Machine Learning Projects - Planet SciPy

Matthew Emerick
15 Oct 2020
1 min read
Image processing is a method to perform operations on an image to extract information from it or enhance it. Digital image processing... The post Image Processing Techniques That You Can Use in Machine Learning Projects appeared first on neptune.ai.
Read more
  • 0
  • 0
  • 1695

article-image-apache-spark-on-kubernetes-how-apache-yunikorn-incubating-helps-from-cloudera-blog
Matthew Emerick
14 Oct 2020
10 min read
Save for later

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps from Cloudera Blog

Matthew Emerick
14 Oct 2020
10 min read
Background Why choose K8s for Apache Spark Apache Spark unifies batch processing, real-time processing, stream analytics, machine learning, and interactive query in one-platform. While Apache Spark provides a lot of capabilities to support diversified use cases, it comes with additional complexity and high maintenance costs for cluster administrators. Let’s look at some of the high-level requirements for the underlying resource orchestrator to empower Spark as a one-platform: Containerized Spark compute to provide shared resources across different ML and ETL jobs Support for multiple Spark versions, Python versions, and version-controlled containers on the shared K8s clusters for both faster iteration and stable production A single, unified infrastructure for both majority of batch workloads and microservices Fine-grained access controls on shared clusters Kubernetes as a de-facto standard for service deployment offers finer control on all of the above aspects compared to other resource orchestrators. Kubernetes offers a simplified way to manage infrastructure and applications with a practical approach to isolate workloads, limit the use of resources, deploy on-demand resources, and auto-scaling capabilities as needed. Scheduling challenges to run Apache Spark on K8s Kubernetes default scheduler has gaps in terms of deploying batch workloads efficiently in the same cluster where long-running services are also to be scheduled. Batch workloads need to be scheduled mostly together and much more frequently due to the nature of compute parallelism required. Let’s look at some of those gaps in detail. Lack of first-class application concept  Batch jobs often need to be scheduled in a sequential manner based on types of container deployment. For instance, Spark driver pods need to be scheduled earlier than worker pods. A clear first-class application concept could help with ordering or queuing each container deployment. Also, such a concept helps admin to visualize the jobs which are scheduled for debugging purposes. Lack of efficient capacity/quota management capability  Kubernetes namespace resource quota can be used to manage resources while running a Spark workload in multi-tenant use cases. However, there are few challenges in achieving this, Apache Spark jobs are dynamic in nature with regards to their resource usage. Namespace quotas are fixed and checked during the admission phase. The pod request is rejected if it does not fit into the namespace quota. This requires the Apache Spark job to implement a retry mechanism for pod requests instead of queueing the request for execution inside Kubernetes itself.  The namespace resource quota is flat, it doesn’t support hierarchy resource quota management. Also many a time, user’s could starve to run the batch workloads as Kubernetes namespace quotas often do not match the organizational hierarchy based capacity distribution plan. An elastic and hierarchical priority management for jobs in K8s is missing today. Lack of resource fairness between tenants In a production environment, it is often found that Kubernetes default scheduler could not efficiently manage diversified workloads and provide resource fairness for their workloads. Some of the key reasons are: Batch workload management in a production environment will often be running with a large number of users. In a dense production environment where different types of workloads are running, it is highly possible that Spark driver pods could occupy all resources in a namespace. Such scenarios pose a big challenge in effective resource sharing.  Abusive or corrupted jobs could steal resources easily and impact production workloads. Strict SLA requirements with scheduling latency  Most of the busy production clusters dedicated for batch workloads often run thousands of jobs with hundreds of thousands of tasks every day. These workloads require larger amounts of parallel container deployments and often the lifetime of such containers is short (from seconds to hours). This usually produces a demand for thousands of pod or container deployment waiting to be scheduled, using Kubernetes default scheduler can introduce additional delays which could lead to missing of SLAs. How Apache YuniKorn (Incubating) could help Overview of Apache YuniKorn (Incubating)  YuniKorn is an enhanced Kubernetes scheduler for both services and batch workloads. YuniKorn can replace Kubernetes default scheduler, or also work with K8s default scheduler based on the deployment use cases. YuniKorn brings a unified, cross-platform scheduling experience for mixed workloads consisting of stateless batch workloads and stateful services. YuniKorn v.s. Kubernetes default scheduler: A comparison   Feature Default Scheduler YUNIKORN Note Application concept x √ Applications are a 1st class citizen in YuniKorn. YuniKorn schedules apps with respect to, e,g their submission order, priority, resource usage, etc. Job ordering x √ YuniKorn supports FIFO/FAIR/Priority (WIP) job ordering policies Fine-grained resource capacity management x √ Manage cluster resources with hierarchy queues. Queues provide the guaranteed resources (min) and the resource quota limit (max). Resource fairness x √ Resource fairness across application and queues to get ideal allocation for all applications running Natively support Big Data workloads x √ Default scheduler focuses for long-running services. YuniKorn is designed for Big Data app workloads, and it natively supports to run Spark/Flink/Tensorflow, etc efficiently in K8s. Scale  & Performance x √ YuniKorn is optimized for performance, it is suitable for high throughput and large scale environments. How YuniKorn helps to run Spark on K8s YuniKorn has a rich set of features that help to run Apache Spark much efficiently on Kubernetes. Detailed steps can be found here to run Spark on K8s with YuniKorn. Please read more details about how YuniKorn empowers running Spark on K8s in Cloud-Native Spark Scheduling with YuniKorn Scheduler in Spark & AI summit 2020. Let’s take a look at some of the use cases and how YuniKorn helps to achieve better resource scheduling for Spark in these scenarios. Multiple users (noisy) running different spark workloads together As more users start to run jobs together, it becomes very difficult to isolate and provide required resources for the jobs with resource fairness, priority etc. YuniKorn scheduler provides an optimal solution to manage resource quotas by using resource queues. In the above example of a queue structure in YuniKorn, namespaces defined in Kubernetes are mapped to queues under the Namespaces parent queue using a placement policy. The Test and Development queue have fixed resource limits. All other queues are only limited by the size of the cluster. Resources are distributed using a Fair policy between the queues, and jobs are scheduled FIFO in the production queue. Some of the high-level advantages are, One YuniKorn queue can map to one namespace automatically in Kubernetes Queue Capacity is elastic in nature which could provide resource range from a configured min to max value Honor resource fairness which could avoid possible resource starvation YuniKorn provides a seamless way to manage resource quota for a Kubernetes cluster, it can work as a replacement of the namespace resource quota. YuniKorn resource quota management allows leveraging queuing of pod requests and sharing of limited resources between jobs based on pluggable scheduling policies. This all can be achieved without any further requirements, like retrying pod submits, on Apache Spark. Setting up the cluster to organization hierarchy based resource allocation model In a large production environment, multiple users will be running various types of workloads together. Often these users are bound to consume resources based on the organization team hierarchy budget constraints. Such a production setup helps for efficient cluster resource usage within resource quota boundaries. YuniKorn provides an ability to manage resources in a cluster with a hierarchy of queues. A fine-grained resource capacity management for a multi-tenant environment will be possible by using resource queues with clear hierarchy (like organization hierarchy). YuniKorn queues can be statically configured or dynamically managed and with the dynamic queue management feature, users can set up placement rules to delegate queue management. Better Spark job SLA in a multi-tenant cluster Normal ETL workloads running in a multi-tenant cluster require easier means of defining fine-grained policies to run jobs in the desired organizational queue hierarchy. Many times, such policies help to define stricter SLA’s for job execution. YuniKorn empowers administrators with options to enable the Job ordering in queues based on simpler policies such as FIFO, FAIR, etc. The StateAware app sorting policy orders jobs in a queue in FIFO order and schedules them one by one on conditions. This avoids the common race condition while submitting lots of batch jobs, e.g Spark, to a single namespace (or cluster). By enforcing the specific ordering of jobs, it also improves the scheduling of jobs to be more predictable. Enable various K8s feature sets for Apache Spark Job scheduling YuniKorn is fully compatible with K8s major released versions. Users can swap the scheduler transparently on an existing K8s cluster. YuniKorn fully supports all the native K8s semantics that can be used during scheduling, such as label selector, pod affinity/anti-affinity, taints/toleration, PV/PVCs, etc. YuniKorn is also compatible with the management commands and utilities, such as cordon nodes, retrieving events via kubectl, etc. Apache YuniKorn (Incubating) in CDP Cloudera’s CDP platform offers Cloudera Data Engineering experience which is powered by Apache YuniKorn (Incubating). Some of the high-level use cases solved by YuniKorn at Cloudera are, Provide resource quota management for CDE virtual clusters Provide advanced job scheduling capabilities for Spark Responsible for both micro-service and batch jobs scheduling Running on Cloud with auto-scaling enabled Future roadmaps to better support Spark workloads YuniKorn community is actively looking into some of the core feature enhancements to support Spark workloads execution. Some of the high-level features are For Spark workloads, it is essential that a minimum number of driver & worker pods be allocated for better efficient execution. Gang scheduling helps to ensure a required number of pods be allocated to start the Spark job execution. Such a feature will be very helpful in a noisy multi-tenant cluster deployment. For more details, YUNIKORN-2 Jira is tracking the feature progress. Job/Task priority support Job level priority ordering helps admin users to prioritize and direct YuniKorn to provision required resources for high SLA based job execution. This also gives more flexibility for effective usage of cluster resources. For more details, YUNIKORN-1 Jira is tracking the feature progress. Distributed Tracing YUNIKORN-387 leverages Open Tracing to improve the overall observability of the scheduler. With this feature, the critical traces through the core scheduling cycle can be collected and persisted for troubleshooting, system profiling, and monitoring. Summary YuniKorn helps to achieve fine-grained resource sharing for various Spark workloads efficiently on a large scale, multi-tenant environments on one hand and dynamically brought up cloud-native environments on the other. YuniKorn, thus empowers Apache Spark to become an enterprise-grade essential platform for users, offering a robust platform for a variety of applications ranging from large scale data transformation to analytics to machine learning. Acknowledgments Thanks to Shaun Ahmadian and Dale Richardson for reviewing and sharing comments. A huge thanks to YuniKorn open source community members who helped to get these features to the latest Apache release. The post Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps appeared first on Cloudera Blog.
Read more
  • 0
  • 0
  • 1554

article-image-using-cloudera-machine-learning-to-build-a-predictive-maintenance-model-for-jet-engines-from-cloudera-blog
Matthew Emerick
14 Oct 2020
6 min read
Save for later

Using Cloudera Machine Learning to Build a Predictive Maintenance Model for Jet Engines from Cloudera Blog

Matthew Emerick
14 Oct 2020
6 min read
Introduction Running a large commercial airline requires the complex management of critical components, including fuel futures contracts, aircraft maintenance and customer expectations. Airlines, in just the U.S. alone, average about 45,000 daily flights, transporting over 10 million passengers a year (source: FAA). Airlines typically operate on very thin margins, and any schedule delay immediately angers or frustrates customers. Flying is not inherently dangerous, but the consequence of a failure is catastrophic. Airlines have such a sophisticated business model that encompasses a culture of streamlined supply chains, predictive maintenance, and unwavering customer satisfaction. To maximize safety for all passengers and crew members, while also delivering profits, airlines have heavily invested in predictive analytics to gain insight on the most cost-effective way to maintain real-time engine performance. Additionally, airlines ensure availability and reliability of their fleet by leveraging maintenance, overhaul and repair (MRO) organizations, such as Lufthansa Technik.  Lufthansa Technik worked with Cloudera to build a predictive maintenance platform to service its fleet of 5000 aircraft throughout its global network of 800 MRO facilities. Lufthansa Technik extended a standard practice of placing sensors on aircraft engines and enabling predictive maintenance to automate fulfilment solutions. By combining profound airline operation expertise, data science, and engine analytics to a predictive maintenance schedule, Lufthansa Technik can now ensure critical parts are on the ground (OTG) when needed, instead of the entire aircraft being OTG and not producing revenue. The objective of this blog is to show how to use Cloudera Machine Learning (CML), running Cloudera Data Platform (CDP), to build a predictive maintenance model based on advanced machine learning concepts. The Process Many companies build machine learning models using libraries, whether they are building perception layers for autonomous vehicles, allowing autonomous vehicle operation, or modeling a complex jet engine. Kaggle, a site that provides test training data sets for building machine learning models, provides simulation data sets from NASA that measures engine component degradation for turbofan jet engines. The models in this blog are built on CML and are based on inputting various engine parameters showing typical sensor values of engine temperature, fuel consumption, vibration, or fuel to oxygen mixture (see Fig 1). One item to note in this blog is that the term “failure” is not to imply catastrophic failure, but rather, that one of its components (pumps, values, etc) is not operating to specification. Airlines design their aircraft to operate at 99.999% reliability. Fig 1: Turbofan jet engine Step 1: Using the training data to create a model/classifier First, four test and training data sets for varying conditions and failure modes were organized in preparation for CML (see box 1 in Fig 2). Each set of training data shows the engine parameters per flight while each engine is “flown” until an engine component signals failure. This is done at both sea level and all flight conditions. This data will be used to train the model that can predict how many flights a given engine has until failure. For each training set, there is a corresponding test data set that provides data on 100 jet engines at various stages of life with actual values on which to test the predictive model for accuracy.  Fig 2: Diagram showing how CML is used to build ML training models Step 2: Iterate on the model to validate and improve effectiveness CML was used to create a model that estimated the amount of remaining useful life (RUL) for a given engine using the provided test and training data sets. A threshold of one week–the time allowance to place parts on the ground–was planned for a scenario that alerts an airline before a potential engine component failure. Assuming four flights daily, this means the airline would like to know with confidence if an engine is going to fail within 40 flights. The model was tested for each engine, and the results were classified as true or false for potential failure within 40 flights (see Table 1). Table 1: Data in table based on one week of data of 40 flights. Step 3: Apply an added cost value to the results With no preventative maintenance, an engine that runs out of life or fails can compromise safety and cost millions more dollars to replace an engine. If an engine is maintained or overhauled before it runs out of life, the cost of overhaul is significantly less. However, if the engine is overhauled too early, there is potential engine life that could have still been utilized. The estimated cost in this model for each of these overhaul outcomes can be seen below (see Fig 3). Fig 3: Cost-benefit confusion matrix Conclusion Using Cloudera Machine Learning to analyze NASA jet engine simulation data provided by Kaggle, our predictive maintenance model predicted when an engine was likely to fail or when it required an overhaul with very high accuracy. Combining the cost-benefit analysis with this predictive model against the test data sets suggested significant savings across all applied scenarios. Airline decisions are always made with a consideration to safety first and then consideration to profit second. Predictive maintenance is preferred because it is always the safest choice, and it delivers drastically lower maintenance costs over reactive (engine replacement after failure) or proactive (replacing components before engine replacement) approaches. Next Steps To see all this in action, please click on links below to a few different sources showcasing the process that was created. Video – If you’d like to see and hear how this was built, see video at the link. Tutorials – If you’d like to do this at your own pace, see a detailed walkthrough with screenshots and line by line instructions of how to set this up and execute. Meetup – If you want to talk directly with experts from Cloudera, please join a virtual meetup to see a live stream presentation. There will be time for direct Q&A at the end. CDP Users Page – To learn about other CDP resources built for users, including additional video, tutorials, blogs and events, click on the link. The post Using Cloudera Machine Learning to Build a Predictive Maintenance Model for Jet Engines appeared first on Cloudera Blog.
Read more
  • 0
  • 0
  • 1557
article-image-election-2020-how-data-can-show-whats-driving-the-trump-vs-biden-polls-from-whats-new
Matthew Emerick
14 Oct 2020
8 min read
Save for later

Election 2020: How data can show what’s driving the Trump vs. Biden polls from What's New

Matthew Emerick
14 Oct 2020
8 min read
Steve Schwartz Director, Public Affairs at Tableau Tanna Solberg October 14, 2020 - 5:19pm October 14, 2020 It’s October, which means there is officially less than one month until the 2020 Presidential election on November 3. Opinion polls on the race between current President Donald Trump and the challenger, former Vice President Joe Biden, are everywhere.  Although many people see public opinion polls as a way to anticipate the outcome of the election, they are most valuable when considered as a snapshot of people's beliefs at a given moment in time. Through our partnership with SurveyMonkey and Axios to collect and share data on the 2020 Presidential race, we’ve created a dashboard where you can track how survey respondents are feeling about the candidates. But looking at candidate preference data alone doesn’t answer the critical question of this years’ election: What is driving voter preference? This year, that’s an especially tricky question. There are the major issues confronting the country this year—from challenges like the COVID-19 pandemic, to the disease’s impact on the national and global economies, to the nationwide protests for racial justice and equity. And there’s also the news cycle which seemingly tosses another knuckleball at voters before they’ve had a chance to process the last one. By partnering with SurveyMonkey, we’ve been able to tap into their vast market research technologies to reach the public and visualize their answers to these critical questions. Through our Election 2020 platform, you can dig into this data and expand your understanding of not only what the topline polls are saying, but what is top-of-mind for the voters making the decision this year. We’ll walk you through some of the key data you can find on our Election 2020 pages, and why it’s so critical to understanding this year’s political landscape. Preferences by demographics Understanding the way different demographic groups vote is critical. It’s very common for pollsters to break down results by categories like age bracket, race, and gender: Disaggregating data offers valuable insights into trends among voter groups that can inform understanding of potential election results. But the way the data is often presented—either in static crosstabs, or in percentage points scattered throughout an analysis—doesn’t really give people the insight into voters’ intersectionalities, and how they play out in the data. SurveyMonkey wanted to give people a way to explore demographic data in a more granular and comprehensive way. They’ve broken down data on candidate preference by five different demographic categories—age, race, gender, education level, and party ID—and in a Tableau dashboard, anyone can choose which categories to combine to see more nuanced voter preferences. For instance, if one were to just look at gender, the breakdown would be pretty clear: 52% of men support Trump, and 56% of women support Biden. But in this dashboard, you can also choose to layer in race. Suddenly, the picture becomes much more complex: 87% of Black women support Biden, and 76% of Black men support Biden. On the flip side, just 38% of white men support Biden, and 44% of white women support Trump. If you add in another dimension, like education, the numbers become even clearer: By far, the group that most strongly supports Trump (at 70%) is white men with a high school degree or less, and the group with the strongest across-the-board support for Biden is Black women with a postgraduate degree (91%). "From the perspective of someone who's immersed in crosstabs and bar charts every day, this visualization is the clearest example yet of the value of pairing data collected through SurveyMonkey's mighty scale with visual storytelling tools from Tableau. The fact that it's highly interactive and responsive really brings the data to life in a way that isn't possible using standard tools,” Wronski says. The COVID-19 pandemic Let’s start with the big one. COVID-19 has posed one of the most significant challenges to the United States and its citizens in recent memory. Over 200,000 people have died, and the economy has recorded its steepest-ever drop on record, with the GDP declining more than 9%. As we near the Election, the virus is not showing signs of abating (for the latest data on COVID-19, you can visit Tableau’s COVID-19 Data Hub). Our partners at SurveyMonkey have been tracking public sentiment around the pandemic since February, as it’s impacting the lives of nearly everyone in the United States this year. "The coronavirus pandemic has infiltrated every aspect of life for the past eight months, and it will continue to do so for the foreseeable future. We wanted to make sure to start measuring concerns early on, and we're committed to tracking public sentiment on this topic for as long as necessary,” says Laura Wronski, research science manager at SurveyMonkey. Through our Election 2020 portal, you can analyze data on how the public is feeling about the pandemic in the leadup to the election. SurveyMonkey has asked respondents about their personal concerns around the virus—if they are worried about contracting it themselves, or someone in their family being affected, and if they are worried about the pandemic’s impact on the economy. Because SurveyMonkey’s Tableau dashboards make it easy to filter these responses by a number of demographic factors—from age to political affiliation—you can begin to see patterns in the data, and understand how concerns around COVID-19 could be a key factor in shaping the outcome of the election. Government leadership Elections are nearly always a referendum on leadership, and this year is no different. However, the pandemic is adding a new layer to how voters assess their elected leaders across the country. "As the election approaches, politicians who are on the ballot at every level will be judged by how well they responded to the coronavirus this year, both in terms of its effect on the economy through lost jobs and shuttered businesses and in terms of the public health infrastructure's response,” Wronski says. Digging into the data, you can see virtually no difference along any demographic breakdown between people’s assessment of Trump as a leader overall, and people’s opinion of how he is handling the federal response to COVID-19. That can tell you several things: That voters’ opinions are, at this point, fairly solidified, and also that COVID-19 is a significant driver of that opinion. Digging into the data on how respondents feel about their state government’s response to COVID-19 shows some interesting trends. The clearest split, in many states, seems to be along party lines. In Pennsylvania, for instance, 82% of Democrats approve of the state response, while 71% of Republicans disapprove. In South Carolina, 73% of Republicans approve of the response, and 74% of Republicans disapprove. It gets much more interesting along other demographic lines, though. Here’s the opinion split along gender lines in Pennsylvania: 60% of women approve, and 49% of men disapprove. And in South Carolina, 54% of women disapprove, and 54% of men approve. "Like so much else these days, Republicans and Democrats are split in their views of how worrisome the coronavirus is and how well we've responded to it. Those partisan effects far outweigh any differences by age, gender, race, or other demographic characteristics,” Wronski says. Voting COVID-19 has complicated nearly every aspect of the 2020 Election, including voting. Multiple news outlets are reporting a sharp uptick in requests to vote by mail this year, due to concerns about gathering in public amid a pandemic. But through their data on how likely people are to vote by mail, SurveyMonkey is able to show a clear split along party lines. Overall, 70% of Democrat respondents say they’re likely to vote by mail, and 72% of Republican respondents say the opposite. Axios, our media partner in our Election 2020 initiative, has analyzed what this means in the context of the potential outcome, and what the implications could be if mail-in ballots are disqualified due to complications with the system. "More people will vote by mail in this election than in any previous election, and that will reshape the logistics of the electoral tallying process and the entire narrative that we see on the news on Election Day. It's important for us to understand those dynamics early on so that we can help explain those changes to the public,” Wronski says. Exploring with data Now that you have a sense of the information SurveyMonkey is polling for and why—and how to discover it in Tableau—we hope you take some time to dig into the data and gather your own insights. As the election nears, SurveyMonkey, Tableau, and Axios will continue to deliver more data and analysis around the political landscape, so make sure you keep checking back to the Election 2020 page for the latest.
Read more
  • 0
  • 0
  • 642

Matthew Emerick
14 Oct 2020
2 min read
Save for later

Why you should NEVER run a Logistic Regression (unless you have to) from Featured Blog Posts - Data Science Central

Matthew Emerick
14 Oct 2020
2 min read
Hello fellow Data Science-Centralists! I wrote a post on my LinkedIn about why you should NEVER run a Logistic Regression. (Unless you really have to). The main thrust is: There is no theoretical reason why a least squares estimator can't work on a 0/1. There are very very narrow theoretical reasons that you want to run a logistic, and unless you fall into those categories it's not worth the time. The run time of a logistic can be up to 100x longer than an OLS model. If you are doing v-fold cross-validation save yourself some time. The XB's are exactly the same whether you use a Logistic or a linear regression. The model specification (features, feature engineering, feature selection, interaction terms) are identical -- and this is what you should be focused on anyways. Myth: Linear regression can only run linear models. There is *one* practical reason to run a logistic: if the results are all very close to 0 or to 1, and you can't hard code your prediction to 0 or 1 if the linear models falls outside a normal probability range, then use the logistic. So if you are pricing an insurance policy based on risk, you can't have a hard-coded 0.000% prediction because you can't price that correctly. See video here and slides here. I think it'd be nice to start a debate on this topic!
Read more
  • 0
  • 0
  • 885

Matthew Emerick
14 Oct 2020
1 min read
Save for later

Applications of Machine Learning in FinTech from Featured Blog Posts - Data Science Central

Matthew Emerick
14 Oct 2020
1 min read
Machine learning is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed. The science behind machine learning is interesting and application-oriented. Many startups have disrupted the FinTech ecosystem with machine learning as their key technology. There are various applications of machine learning used by the FinTech companies falling under different subcategories. Let us look at some of the applications of machine learning and companies using such applications. Table of contents Predictive Analysis for Credit Scores and Bad Loans Accurate Decision-Making Content/Information Extraction Fraud Detection and Identity Management To read the whole article, with each point detailed, click here.
Read more
  • 0
  • 0
  • 1145
article-image-introducing-data-literacy-for-all-free-data-skills-training-for-individuals-and-organizations-from-whats-new
Matthew Emerick
13 Oct 2020
4 min read
Save for later

Introducing Data Literacy for All: Free data skills training for individuals and organizations from What's New

Matthew Emerick
13 Oct 2020
4 min read
Courtney Totten Director, Academic Programs Tanna Solberg October 13, 2020 - 9:19pm October 14, 2020 Data is becoming more pervasive at work and in our everyday lives. Whether you’re optimizing your sales organization or your fantasy football team, data is a key ingredient of success.  Although more people are familiar with data, many are still struggling with fundamental data literacy—the ability to explore, understand, and communicate with data. This is a problem, particularly since data skills are now a prerequisite for many jobs—and the demand is growing. In fact, in 2020, LinkedIn listed data-driven decision making skills like analytical reasoning and business analysis as two of the top ten most in-demand hard skills on their Top Skills Companies Need Most list.  To help address this demand, we are so excited to announce the launch of Data Literacy for All, a free eLearning program that includes over five hours of training to help anyone learn foundational data skills. Whether you are new to data, looking to accelerate in your career, or seeking a new career path, Data Literacy for All can help you develop a foundational data skillset. We hear from customers time and time again that developing data skills for employees is one of the main challenges they face when deploying analytics. Whether it’s hiring new talent with data skills on their resume, or reskilling existing employees, having a baseline of data literacy across the organization is a critical component within a Data Culture. Helping our customers fill their talent pipeline with trained candidates has been a focus of ours for many years. Since inception in 2011, Tableau Academic Programs have been driving data literacy efforts in higher education, offering free software and learning resources to enable and empower future data workers. Through these efforts, we have provided more than 1.3 million students and instructors with access to software and data skills. We will continue investing in our future generations, but we also recognized the opportunity to do more. We are thrilled to expand our work with data literacy beyond the classroom. Developing the Data Literacy for All program and coursework Data Literacy for All fills common knowledge gaps and ‘learning pain points’ to allow anyone to begin and continue their data journey.  The Data Literacy for All program includes the following courses: Introduction to Data Literacy Recognizing Well-Structured Data Exploring Variables and Field Types Exploring Aggregation and Granularity Understanding Distributions Understanding Variation for Wise Comparisons Using Correlation and Regression to Examine Relationships One of our long-term goals for this program is to open doors to more diversity within formal and informal data roles. We believe that creating a more data literate world begins with education and that foundational data skills are building blocks for our future. Making these foundational skills easy and accessible to anyone and everyone around the world is a start.   When it came to developing data literacy resources for all types of learners, we found inspiration from our Academic Programs and our existing instructor relationships. Through Tableau for Teaching we worked closely with instructors around the world who were building their own analytics programs within their institutions. So to tackle the challenge of data literacy, we took a unique approach. We hired Dr. Sue Kraemer as our first Academic Program Instructional Designer. Sue was brought on board to help us serve academia’s growing needs and to drive development of this integral education in support of our instructors in higher education. Prior to joining Tableau, Sue was an instructor at the University of Washington Bothell, where she taught Statistics and Data Visualization courses in Health Studies. As a result of her academic experience, we were able to create a training that is a bridge between foundational skills and practical business needs—a necessary balance for today’s knowledge workers. Access data literacy courses for free, starting today! Now anyone can access this training for free. We are so excited about this program, as it is a critical part of helping people see and understand data. But this is just the beginning! At Tableau we will continue to strive for additional ways to help more people become data rockstars because we believe that access to data and the right skills can truly change the world. Start your data literacy journey today!
Read more
  • 0
  • 0
  • 669

article-image-what-you-need-to-know-to-begin-your-journey-to-cdp-from-cloudera-blog
Matthew Emerick
13 Oct 2020
5 min read
Save for later

What you need to know to begin your journey to CDP from Cloudera Blog

Matthew Emerick
13 Oct 2020
5 min read
Recently, my colleague published a blog build on your investment by Migrating or Upgrading to CDP Data Center, which articulates great CDP Private Cloud Base features. Existing CDH and HDP customers can immediately benefit from this new functionality.  This blog focuses on the process to accelerate your CDP journey to CDP Private Cloud Base for both professional services engagements and self-service upgrades. Upgrade with Confidence with Cloudera Professional Services Cloudera recommends working with Cloudera Professional Services to simplify your journey to CDP Private Cloud Base and get faster time to value. Cloudera PS offers SmartUpgrade to help you efficiently upgrade or migrate to CDP Private Cloud Base with minimal disruptions to your SLAs.  Preparing for an Upgrade Whether you choose to manage your own upgrade process or leverage our Professional Services organization, Cloudera provides the tools you need to get started.  Before you Begin Contact your account team to start the process. Generate a diagnostic bundle to send information about your cluster to Cloudera Support for analysis. The diagnostic bundle consists of information about the health and performance of the cluster. Learn more about how to send diagnostic bundles. 1.On a CDH cluster, use Cloudera Manager. 2. On an HDP Cluster, use SmartSense. Gather information that diagnostic tool will not be able to automatically obtain: What is the primary purpose of the cluster? HDP customers only: Which relational database and version is used? How many database objects do you have? Which external APIs you are using? Which third party software do you use with this cluster? Create an Upgrade Planning Case To manage your own upgrade process, follow these steps to file an upgrade planning case to ensure a smooth upgrade experience: Go to the Cloudera Support Hub and click Create Case. Select Upgrade Planning.  In Product to Upgrade, select a product from the list. Choices are: Ambari HDP HDP & Ambari CDH Cloudera Manager CDH & Cloudera Manager Are you upgrading to CDP Private Cloud Base? Select Yes or No. What is your target version? Select the version of the product and the version of the Cloudera Manager or Ambari. Complete information about your assets and timeline. Attach the diagnostic bundle you created. Diagnostics will run through your bundle data to identify potential issues that need to be addressed prior to an upgrade. Include in the information that you gathered earlier in the “Before you Begin”. A case is created. CDP Upgrade Advisor The CDP Upgrade Advisor is a utility available on my.cloudera.com for Cloudera customers. This tool performs an evaluation of diagnostic data to determine the CDP readiness of your CDH or HDP cluster environment.  Running the upgrade advisor against the cluster in question is one of your first steps to adopting CDP, followed by an in-depth conversation with your Cloudera account team to review the specific results. This utility raises awareness of clusters that may present risks during an upgrade to CDP due to, for example, an unsupported of the operating system currently in use. The upgrade advisor utility is focused on the environment and platform in use but doesn’t take into consideration use-cases, the actual cluster data, or workflows in use. Analysis of these critical areas occurs as part of your CDP Journey Workshop with your Cloudera account team and Professional Services. To run the Upgrade Advisor: Click Upgrade Path to begin the evaluation based on your diagnostic data The first thing you’ll see is a list of your active assets (CDH, DataFlow, HDP, Key Trustee, and CDP assets). The upgrade advisor is available only for CDH and HDP environments. Click the respective CDP Upgrade Advisor link on the right-hand side of a CDH or HDP asset to obtain the evaluation results The Upgrade Advisor determines a recommended upgrade path for the asset in question. You may see a recommendation to upgrade to CDP Data Center (Private Cloud Base), Public Cloud, or not to upgrade at this time due to the environmental failures identified. Beneath the recommendations are details of the cluster asset being evaluated along with contact details for your Cloudera account team.   The Evaluation Details section includes the results of the validation checks being performed against your diagnostic data. This includes risks and recommendations such as a particular service or version of 3rd party software that will not be supported after an upgrade to CDP. Each category of the evaluation details also features icons that will take you to the relevant CDP documentation. You can view a video (recommended) about the Upgrade Advisor:   Validate partner certifications For partner ecosystem support for CDP, you can validate your partner application certifications with this blog:  Certified technical partner solutions help customers success with Cloudera Data Platform.  Please also work with your account team for partner technology applications that are not currently on the certified list. Learn from Customer Success Stories Take a deeper look at one customer’s journey to CDP in this blog. A financial services customer upgraded their environment from CDH to CDP with Cloudera Professional Services in order to modernize their architecture to ingest data in real-time using the new streaming features available in CDP and make the data available to their users faster than ever before.  Summary Take the next steps on your journey to CDP now by visiting my.cloudera.com to assess your clusters in the Upgrade Advisor and sign up for a trial of CDP Private Cloud Base.  To learn more about CDP, please check out the CDP Resources page.  The post What you need to know to begin your journey to CDP appeared first on Cloudera Blog.
Read more
  • 0
  • 0
  • 1393