Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Tech News

3711 Articles
article-image-opsera-simplifies-building-of-devops-pipelines-from-devops-com
Matthew Emerick
15 Oct 2020
1 min read
Save for later

Opsera Simplifies Building of DevOps Pipelines from DevOps.com

Matthew Emerick
15 Oct 2020
1 min read
Fresh off raising $4.3 million in funding, Opsera today launched a namesake platform that enables IT teams to orchestrate both the tools employed by developers as well as the pipelines that make up a DevOps process. Company co-founder Chandra Ranganathan said the Opsera platform automates setup of DevOps pipelines using a declarative approach that doesn’t […] The post Opsera Simplifies Building of DevOps Pipelines appeared first on DevOps.com.
Read more
  • 0
  • 0
  • 1660

article-image-scott-mead-tracking-postgres-stats-from-planet-postgresql
Matthew Emerick
15 Oct 2020
9 min read
Save for later

Scott Mead: Tracking Postgres Stats from Planet PostgreSQL

Matthew Emerick
15 Oct 2020
9 min read
Database applications are living, [(sometimes) fire-]breathing systems that behave in unexpected ways. As a purveyor of the pgCraft, it’s important to understand how to interrogate a Postgres instance and learn about the workload. This is critical for lots of reasons: Understanding how the app is using the database Understanding what risks there are in the data model Designing a data lifecycle management plan (i.e. partitions, archiving) Learning how ORM is behaving towards the database Building a VACUUM strategy There’s lots of other reasons this data is useful, but let’s take a look at some examples and get down to a few scripts you can use to pull this together into something useful. First, take a visit to the pgCraftsman’s toolbox to find an easy-to-use snapshot script. This script is designed to be completely self-contained. It will run at whatever frequency you’d like and will save snapshots of the critical monitoring tables right inside your database. There’s even a few reporting functions included to help you look at stats over time. What to Watch There’s a number of critical tables and views to keep an eye on in the Postgres catalog, this isn’t an exhaustive list, but a quick set that the toolbox script already watches. pg_stat_activity pg_locks pg_stat_all_tables pg_statio_all_tables pg_stat_all_indexes pg_stat_database These tables views provide runtime stats on how your application is behaving in regards to the data model. The problem with many of these is that they’re either point-in-time (like pg_stat_activity) or cumulative (pg_stat_all_tables.n_tup_ins contains the cumulative number of inserts since pg_stat_database.stats_reset). In order to glean anything useful from these runtime performance views, you should be snapshot-ing them periodically and saving the results. I’ve seen (and built) lots of interesting ways to do this over the years, but the simplest way to generate some quick stats over time is with the PgCraftsman Toolbox script: pgcraftsman-snapshots.sql. This is approach is great, but as you can guess, a small SQL script doesn’t solve all the world’s database problems. True, this script does solve 80% of them, but that’s why it only took me 20% of the time Let’s say I have a workload that I know nothing about, let’s use pgcraftsman-snapshots.sql to learn about the workload and determine the best way to deal with it: Snapshots In order to build actionable monitoring out of the cumulative or point-in-time monitoring views, we need to snapshot the data periodically and compare between those snapshots. This is exactly was the pgcraftsman-snapshots.sql script does. All of the snapshots are saved in appropriate tables in a new ‘snapshots’ schema. The ‘snapshot’ function simply runs an INSERT as SELECT from each of the monitoring views. Each row is associated with the id of the snapshot being taken (snap_id). When it’s all put together, we can easily see the number of inserts that took place in a given table between two snapshots, the growth (in bytes) of a table over snapshots, or the number of index scans against a particular index. Essentially, any data in any of the monitoring views we are snapshot-ing. 1. Install pgcraftsman-snapshots.sql ❯ psql -h your.db.host.name -U postgres -d postgres -f pgcraftsman-snapshots.sql SET CREATE SCHEMA SELECT 92 CREATE INDEX SELECT 93 CREATE INDEX SELECT 6 CREATE INDEX SELECT 7 CREATE INDEX CREATE INDEX CREATE INDEX SELECT 145 CREATE INDEX SELECT 3 CREATE INDEX SELECT 269 CREATE INDEX CREATE INDEX CREATE INDEX SELECT 1 CREATE INDEX CREATE INDEX CREATE TABLE CREATE INDEX CREATE INDEX CREATE INDEX CREATE TABLE CREATE INDEX CREATE TABLE CREATE INDEX CREATE INDEX CREATE INDEX CREATE TABLE CREATE INDEX CREATE INDEX ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE ALTER TABLE CREATE SEQUENCE CREATE FUNCTION save_snap ----------- 2 (1 row) CREATE FUNCTION CREATE TYPE CREATE FUNCTION CREATE TYPE CREATE FUNCTION CREATE TYPE CREATE FUNCTION In addition to installing the snapshot schema, this script takes two initial snapshots for you. You can monitor the snapshots by running: postgres=# select * from snapshots.snap; snap_id | dttm ---------+------------------------------- 1 | 2020-10-15 10:32:54.31244-04 2 | 2020-10-15 10:32:54.395929-04 (2 rows) You can also get a good look at the schema: postgres=# set search_path=snapshots; SET postgres=# dt+ List of relations Schema | Name | Type | Owner | Size | Description -----------+------------------------+-------+----------+------------+------------- snapshots | snap | table | postgres | 8192 bytes | snapshots | snap_all_tables | table | postgres | 96 kB | snapshots | snap_cpu | table | postgres | 8192 bytes | snapshots | snap_databases | table | postgres | 8192 bytes | snapshots | snap_indexes | table | postgres | 120 kB | snapshots | snap_iostat | table | postgres | 8192 bytes | snapshots | snap_load_avg | table | postgres | 8192 bytes | snapshots | snap_mem | table | postgres | 8192 bytes | snapshots | snap_pg_locks | table | postgres | 16 kB | snapshots | snap_settings | table | postgres | 32 kB | snapshots | snap_stat_activity | table | postgres | 16 kB | snapshots | snap_statio_all_tables | table | postgres | 72 kB | (12 rows) postgres=# reset search_path; RESET postgres=# There’s a few tables here (snap_cpu, snap_load_avg, snap_mem) that seem interesting, eh? I’ll cover these in a future post, we can’t get that data from within a postgres instance without a special extension installed or some external driver collecting it. For now, those tables will remain unused. 2. Take a snapshot The snapshots.save_snap() function included with pgcraftsman-snapshots.sql does a quick save of all the metadata and assigns it all a new snap_id: postgres=# select snapshots.save_snap(); save_snap ----------- 3 (1 row) postgres=# The output row is the snap_id that was just generated and saved. Every time you want to create a snapshot, just call: select snapshots.save_snap(); The easiest way to do this is via cron or another similar job scheduler (pg_cron). I find it best to schedule these before large workload windows and after. If you have a 24 hour workload, find inflection points that you’re looking to differentiate between. Snapshot Performance Questions here about the performance of a snapshot make lots of sense. You can look a the save_snap() in code, you’ll see that the runtime of the process is going to depend on the number of rows in each of the catalog tables. This will depend on : pg_stat_activity <– Number of connections to the instance pg_locks < — Number of locks pg_stat_all_tables <– Number of tables in the database pg_statio_all_tables <– Number of tables in the database pg_stat_all_indexes <– Number of indexes in the database pg_stat_database <– Number of databases in the instance For databases with thousands of objects, snapshots should be pruned frequently so that the snapshot mechanism itself does not cause performance problems. Pruning old snapshots Pruning old snapshots with this script is really easy. There is a relationship between the snapshots.snap table and all the others, so a simple ‘DELETE FROM snapshots.snap WHERE snap_id = x; ‘ will delete all the rows from the given snap_id. 3. Let the workload run Let’s learn a little bit about the workload that is running in the database. Now that we have taken a snapshot (snap_id = 3) before the workload, we’re going to let the workload run for a bit, then take another snapshot and compare the difference. (Note: snapshots just read the few catalog tables I noted above and save the data. They don’t start a process, or run anything. The only thing that’ll make your snapshots run long is if you have a large number of objects (schema, table, index) in the database. ) 4. Take a ‘post-workload’ snapshot After we’ve let the workload run for a while (5 minutes, 2 hours, 2 days… whatever you think will give the best approximation for your workload), take a new snapshot. This will save the new state of data and let us compare the before and after stats: postgres=# select snapshots.save_snap(); save_snap ----------- 4 (1 row) postgres=# 5. Analyze the report There are two included functions for reporting across the workload: select * from snapshots.report_tables(start_snap_id, end_snap_id); select * from snapshots.report_indexes(start_snap_id, end_snap_id); Both of these reports need a starting and ending snap_id. You can get this by examining the snapshots.snap table: postgres=# select * from snapshots.snap; snap_id | dttm ---------+------------------------------- 1 | 2020-10-15 10:32:54.31244-04 2 | 2020-10-15 10:32:54.395929-04 3 | 2020-10-15 10:56:56.894127-04 4 | 2020-10-15 13:30:47.951223-04 (4 rows) postgres=# Our pre-workload snapshot was snap_id = 3 and our post-workload snapshot was snap_id = 4. Since we are reporting between two snapshots, we can see exactly what occurred between them. The number of inserts / updates / deletes / sequential scans / index scans, and even table growth (bytes and human readable). The key is that this is just what took place between the snapshots. You can take a snapshot at any time and report across any number of them. (Note: You may need to side-scroll to see the full output. I highly recommend it) postgres=# select * from snapshots.report_tables(3,4); time_window | relname | ins | upd | del | index_scan | seqscan | relsize_growth_bytes | relsize_growth | total_relsize_growth_bytes | total_relsize_growth | total_relsize | total_relsize_bytes -----------------+-------------------------+--------+--------+-----+------------+---------+----------------------+----------------+----------------------------+----------------------+---------------+--------------------- 02:33:51.057096 | pgbench_accounts | 0 | 588564 | 0 | 1177128 | 0 | 22085632 | 21 MB | 22085632 | 21 MB | 1590083584 | 1516 MB 02:33:51.057096 | pgbench_tellers | 0 | 588564 | 0 | 588564 | 0 | 1269760 | 1240 kB | 1597440 | 1560 kB | 1720320 | 1680 kB 02:33:51.057096 | pgbench_history | 588564 | 0 | 0 | | 0 | 31244288 | 30 MB | 31268864 | 30 MB | 31268864 | 30 MB 02:33:51.057096 | pgbench_branches | 0 | 588564 | 0 | 587910 | 655 | 1081344 | 1056 kB | 1146880 | 1120 kB | 1204224 | 1176 kB 02:33:51.057096 | snap_indexes | 167 | 0 | 0 | 0 | 0 | 49152 | 48 kB | 65536 | 64 kB | 204800 | 200 kB 02:33:51.057096 | snap_all_tables | 111 | 0 | 0 | 0 | 0 | 40960 | 40 kB | 40960 | 40 kB | 172032 | 168 kB 02:33:51.057096 | snap_statio_all_tables | 111 | 0 | 0 | 0 | 0 | 24576 | 24 kB | 24576 | 24 kB | 114688 | 112 kB 02:33:51.057096 | pg_statistic | 23 | 85 | 0 | 495 | 0 | 16384 | 16 kB | 16384 | 16 kB | 360448 | 352 kB 02:33:51.057096 | snap_pg_locks | 39 | 0 | 0 | 0 | 0 | 8192 | 8192 bytes | 32768 | 32 kB | 98304 | 96 kB 02:33:51.057096 | snap_stat_activity | 6 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 32768 | 32 kB 02:33:51.057096 | snap | 1 | 0 | 0 | 0 | 324 | 0 | 0 bytes | 0 | 0 bytes | 57344 | 56 kB 02:33:51.057096 | snap_settings | 1 | 0 | 0 | 1 | 1 | 0 | 0 bytes | 0 | 0 bytes | 114688 | 112 kB 02:33:51.057096 | snap_databases | 1 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 24576 | 24 kB 02:33:51.057096 | pg_class | 0 | 1 | 0 | 1448 | 200 | 0 | 0 bytes | 0 | 0 bytes | 245760 | 240 kB 02:33:51.057096 | pg_trigger | 0 | 0 | 0 | 3 | 0 | 0 | 0 bytes | 0 | 0 bytes | 65536 | 64 kB 02:33:51.057096 | sql_parts | 0 | 0 | 0 | | 0 | 0 | 0 bytes | 0 | 0 bytes | 49152 | 48 kB 02:33:51.057096 | pg_event_trigger | 0 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 16384 | 16 kB 02:33:51.057096 | pg_language | 0 | 0 | 0 | 1 | 0 | 0 | 0 bytes | 0 | 0 bytes | 73728 | 72 kB 02:33:51.057096 | pg_toast_3381 | 0 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 8192 | 8192 bytes 02:33:51.057096 | pg_partitioned_table | 0 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 8192 | 8192 bytes 02:33:51.057096 | pg_largeobject_metadata | 0 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 8192 | 8192 bytes 02:33:51.057096 | pg_toast_16612 | 0 | 0 | 0 | 0 | 0 | 0 | 0 bytes | 0 | 0 bytes | 8192 | 8192 bytes This script is a building-block. If you have a single database that you want stats on, it’s great. If you have dozens of databases in a single instance or dozens of instances, you’re going to quickly wish you had this data in a dashboard of some kind. Hopefully this gets you started with metric building against your postgres databases. Practice the pgCraft, submit me a pull request! Next time, we’ll look more into some of the insights we can glean from the information we assemble here.
Read more
  • 0
  • 0
  • 1350

article-image-making-your-new-normal-safer-with-recaptcha-enterprise-from-cloud-blog
Matthew Emerick
15 Oct 2020
4 min read
Save for later

Making your new normal safer with reCAPTCHA Enterprise from Cloud Blog

Matthew Emerick
15 Oct 2020
4 min read
Traffic from both humans and bots are at record highs. Since March 2020, reCAPTCHA has seen a 40% increase in usage - businesses and services that previously saw most of their users in person have shifted to online-first or online-only. This increased demand for online services and transactions can expose businesses to various forms of online fraud and abuse, and without dedicated teams familiar with these attacks and how to stop them, we’ve seen hundreds of thousands of new websites come to reCAPTCHA for visibility and protection. During COVID-19, reCAPTCHA is playing a critical role helping global public sector agencies to distribute masks and other supplies, provide up-to-date information to constituents, and secure user accounts from distributed attacks. The majority of these agencies are using the score-based detection that comes from reCAPTCHA v3 or reCAPTCHA Enterprise instead of showing the visual or audio challenges found in reCAPTCHA v2. This reduces friction for users and also gives teams flexibility on how to take action on bot requests and fraudulent activity. reCAPTCHA Enterprise can also help protect your business. Whether you’re moving operations online for the first time or have your own team of security engineers, reCAPTCHA can help you detect new web attacks, understand the threats, and take action to keep your users safe. Many enterprises lack visibility in parts of their site, and adding reCAPTCHA helps to expose costly attacks before they happen. The console shows the risk associated with each action to help your business stay ahead. Unlike many other abuse and fraud fighting platforms, reCAPTCHA doesn’t rely on invasive fingerprinting. These techniques can often penalize privacy-conscious users who try to keep themselves safe with tools such as private networks, and are in conflict with browsers’ pushes for privacy-by-default. Instead, we’ve shifted our focus to in-session behavioral risk analysis, detecting fraudulent behavior rather than caring about who or what is behind the network connection. We’ve found this to be extremely effective in detecting attacks in a world where adversaries have control of millions of IP addresses and compromised devices, and regularly pay real humans to manually bypass detections. Since we released reCAPTCHA Enterprise last year, we’ve been able to work closer with existing and new customers, collaborating on abuse problems and determining best practices in specific use cases, such as account takeovers, carding, and scraping. The more granular score distribution that comes with reCAPTCHA Enterprise gives customers more fine-tuned control over when and how to take action. reCAPTCHA Enterprise learns how to score requests specific to the use case, but the score is also best used in a context-specific way. Our most successful customers use features to delay feedback to adversaries, such as limiting capabilities of suspicious accounts, requiring additional verification for sensitive purchases, and manually moderating content likely generated by a bot.  We also recently released a report by ESG where they evaluated the effectiveness of reCAPTCHA Enterprise as deployed in a real-world hyperscale website to protect against automated credential stuffing and account takeover attacks. ESG noted: “Approximately two months after reCAPTCHA Enterprise deployment, login attempts dropped by approximately 90% while the registered user base grew organically.” We’re continually developing new types of signals to detect abuse at scale. Across the four million sites with reCAPTCHA protections enabled, we defend everything from accounts, to e-commerce transactions, to food distribution after disasters, to voting for your favorite celebrity. Now more than ever, we’re proud to be protecting our customers and their users. To see reCAPTCHA Enterprise in action, check out our latest video. To get started with reCAPTCHA Enterprise, contact our sales team. Related Article Protect your organization from account takeovers with reCAPTCHA Enterprise How reCAPTCHA Enterprise helps protect your websites from fraudulent activity like account takeovers and hijacking Read Article
Read more
  • 0
  • 0
  • 1614

article-image-query-interact-with-apps-in-android-11-with-package-visibility-from-xamarin-blog
Matthew Emerick
15 Oct 2020
4 min read
Save for later

Query & Interact with Apps in Android 11 with Package Visibility from Xamarin Blog

Matthew Emerick
15 Oct 2020
4 min read
Android 11 introduced several exciting updates for developers to integrate into their app experience including new device and media controls, enhanced support for foldables, and a lot more. In addition to new features there are also several privacy enhancements that developers need to integrate into their application when upgraded and re-targeting to Android 11. One of those enhancements is the introduction of package visibility that alters the ability to query installed applications and packages on a user’s device. When you want to open a browser or send an email then your application will have to launch and interact with another application on the device through an Intent. Before calling StartActivity it is best practice to QueryIntentActivities or ResolveActivity to ensure there is an application that can handle the request. If you are using Xamarin.Essentials, then you may not have seen these APIs because the library handles all of the logic for you automatically for Browser(External), Email, and SMS. Before Android 11 every app could easily query all installed applications and see if a specific Intent would open when StartActivity is called. That has all changed with Android 11 with the introduction of package visibility. You will now need to declare what intents and data schemes you want your app to be able to query when your app is targeting Android 11. Once you retarget to Android 11 and run your application on a device running Android 11 you will receive zero results if you use QueryIntentActivities. If you are using Xamarin.Essentials you will receive a FeatureNotSupportedException when you try to call one of the APIs that needs to query activities. Let’s say you are using the Email feature of Xamarin.Essentials. Your code may look like this: public async Task SendEmail(string subject, string body, List<string> recipients) { try { var message = new EmailMessage { Subject = subject, Body = body, To = recipients }; await Email.ComposeAsync(message); } catch (FeatureNotSupportedException fbsEx) { // Email is not supported on this device } catch (Exception ex) { // Some other exception occurred } } If your app targeted Android 10 and earlier, it would just work. With package visibility in Android 11 when you try to send an Email, Xamarin.Essentials will try to query for pacakges that support email and zero results will be return. This will result in a FeatureNotSupportedException to be thrown, which is not ideal. To enable your application to get visbility into the packages you will need to add a list of queries into your AndroidManifest.xml. <manifest package="com.mycompany.myapp"> <queries> <intent> <action android:name="android.intent.action.SENDTO" /> <data android:scheme="mailto" /> </intent> </queries> </manifest> If you need query multiple intents or use multiple APIs you will need to add them all into the list. <queries> <intent> <action android:name="android.intent.action.SENDTO" /> <data android:scheme="mailto" /> </intent> <intent> <action android:name="android.intent.action.VIEW" /> <data android:scheme="http"/> </intent> <intent> <action android:name="android.intent.action.VIEW" /> <data android:scheme="https"/> </intent> <intent> <action android:name="android.intent.action.VIEW" /> <data android:scheme="smsto"/> </intent> </queries> And there you have it, with just a small amount of configuration you are app will continue to work flawless when you target Android 11. Learn More Be sure to browse through the official Android 11 documentation on package visibility, and of course the newly updated Xamarin.Essentials documentation. Finally, be sure to read through the Xamarin.Android 11 release notes. The post Query & Interact with Apps in Android 11 with Package Visibility appeared first on Xamarin Blog.
Read more
  • 0
  • 0
  • 1962
Banner background image

article-image-rookout-launches-a-live-debugging-heatmap-to-find-applications-on-fire-from-devops-com
Matthew Emerick
15 Oct 2020
1 min read
Save for later

Rookout Launches a ‘Live Debugging Heatmap’ To Find Applications On Fire from DevOps.com

Matthew Emerick
15 Oct 2020
1 min read
October 15, 2020 13:00 ET | Source: Rookout According to the 2020 State of Software Quality report, two out of three software developers estimate they spend at least a day per week troubleshooting issues in their code, and close to one-third spend even more time. DEJ’s research shows that organizations are losing $2,129,000 per month, on average, due to delays in application […] The post Rookout Launches a ‘Live Debugging Heatmap’ To Find Applications On Fire appeared first on DevOps.com.
Read more
  • 0
  • 0
  • 1591

article-image-alexey-lesovsky-postgres-13-observability-updates-from-planet-postgresql
Matthew Emerick
15 Oct 2020
2 min read
Save for later

Alexey Lesovsky: Postgres 13 Observability Updates from Planet PostgreSQL

Matthew Emerick
15 Oct 2020
2 min read
New shiny Postgres 13 has been released and now it’s the  time for making some updates to “Postgres Observability” diagram. New release includes many improvements related to monitoring, such as new stats views and new added fields to existing views. Let’s take a closer look at these. List of progress views has been extended with two new views. The first one is the “pg_stat_progress_basebackup” which helps to observe running base backups and estimate their progress, ETA and other properties.  The second view is the “pg_stat_progress_analyze” as the name suggests, it watches over execute/analyze operations. The third new view is called pg_shmem_allocations which is supposed to be used for deeper inspection of how shared buffers are used. The fourth, and the last new view is “pg_stat_slru” related to the inspection of SLRU caches. Both recently added views are help to answer the question “How Postgres spends its allocated memory” Other improvements are general-purpose and related to the existing views. The “pg_stat_statements” has few modifications: New fields related to time planning have been added, and due to this the existing “time” fields have been renamed to executing time. So all monitoring tools that rely on pg_stat_statements should be adjusted accordingly.  New fields related to WAL have been added – now it’s possible to understand how much WAL has been generated by each statement. WAL usage statistics have also been added to EXPLAIN (added WAL keyword), auto_explain and autovacuum. WAL usage stats are appended to the logs (that is if log_autovacuum_min_duration is enabled) Pg_stat_activity has a new column “leader_pid”, which shows the PID of the parallel group leader and helps to explicitly identify background workers with their leader. A huge thank you goes to many who contributed to this new release, among which are my colleagues Victor Yegorov and Sergei Kornilov and also those who help to spread the word about Postgres to other communities and across geographies.  The post Postgres 13 Observability Updates appeared first on Data Egret.
Read more
  • 0
  • 0
  • 1303
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-thursday-news-october-15-from-featured-blog-posts-data-science-central
Matthew Emerick
15 Oct 2020
1 min read
Save for later

Thursday News, October 15 from Featured Blog Posts - Data Science Central

Matthew Emerick
15 Oct 2020
1 min read
Here is our selection of articles and technical contributions featured on DSC since Monday: Announcements Penn State Master’s in Data Analytics – 100% Online eBook: Data Preparation for Dummies Technical Contributions A quick demonstration of polling confidence interval calculations using simulation Why you should NEVER run a Logistic Regression (unless you have to) Cross-validation and hyperparameter tuning Why You Should Learn Sitecore CMS? Articles AI is Driving Software 2.0… with Minimal Human Intervention Applications of Machine Learning in FinTech Why Fintech is the Future of Banking? Real Estate: How it is Impacted by Business Intelligence Determining How Cloud Computing Benefits Data Science Enjoy the reading!
Read more
  • 0
  • 0
  • 1145

article-image-automobile-repair-self-diagnosis-and-traffic-light-management-enabled-by-ai-from-ai-trends
Matthew Emerick
15 Oct 2020
5 min read
Save for later

Automobile Repair Self-Diagnosis and Traffic Light Management Enabled by AI from AI Trends

Matthew Emerick
15 Oct 2020
5 min read
By AI Trends Staff Looking inside and outside, AI is being applied to the self-diagnosis of automobiles and to the connection of vehicles to traffic infrastructure. A data scientist at BMW Group in Munich, while working on his PhD, created a system for self-diagnosis called the Automated Damage Assessment Service, according to an account in  Mirage. Milan Koch was completing his studies at the Leiden Institute of Advanced Computer Science in the Netherlands when he got the idea. “It should be a nice experience for customers,” he stated. The system gathers data over time from sensors in different parts of the car. “From scratch, we have developed a service idea that is about detecting damaged parts from low speed accidents,” Koch stated. “The car itself is able to detect the parts that are broken and can estimate the costs and the time of the repair.” Milan Koch, data scientist, BMW Group, Munich Koch developed and compared different multivariate time series methods, based on machine learning, deep learning and also state-of-the-art automated machine learning (AutoML) models. He tested different levels of complexity to find the best way to solve the time series problems. Two of the AutoML methods and his hand-crafted machine learning pipeline showed the best results. The system may have application to other multivariate time series problems, where multiple time-dependent variables must be considered, outside the automotive field. Koch collaborated with researchers from the Leiden University Medical Center (LUMC) to use his hand-crafted pipeline to analyze Electroencephalography (EEG) data.  Koch stated, ‘We predicted the cognition of patients based on EEG data, because an accurate assessment of cognitive function is required during the screening process for Deep Brain Stimulation (DBS) surgery. Patients with advanced cognitive deterioration are considered suboptimal candidates for DBS as cognitive function may deteriorate after surgery. However, cognitive function is sometimes difficult to assess accurately, and analysis of EEG patterns may provide additional biomarkers. Our machine learning pipeline was well suited to apply to this problem.” He added, “We developed algorithms for the automotive domain and initially we didn’t have the intention to apply it to the medical domain, but it worked out really well.” His models are now also applied to Electromyography (EMG) data, to distinguish between people with a motor disease and healthy people. Koch intends to continue his work at BMW Group, where he will focus on customer-oriented services, predictive maintenance applications and optimization of vehicle diagnostics. DOE Grant to Research Traffic Management Delays Aims to Reduce Emissions Getting automobiles to talk to the traffic management infrastructure is the goal of research at the University of Tennesse at Chattanooga, which has been awarded $1.89 million from the US Department of Energy to create a new model for traffic intersections that would reduce energy consumption. The UTC Center for Urban Informatics and Progress (CUIP)  will leverage its existing “smart corridor” to accommodate the new research. The smart corridor is a 1.25-mile span on a main artery in downtown Chattanooga, used as a test bed for research into smart city development and connected vehicles in a real-world environment.  “This project is a huge opportunity for us,” stated Dr. Mina Sartipi, CUIP Director and principal investigator, in a press release. “Collaborating on a project that is future-oriented, novel, and full of potential is exciting. This work will contribute to the existing body of literature and lead the way for future research.” UTC is collaborating with the University of Pittsburgh, the Georgia Institute of Technology, the Oak Ridge National Laboratory, and the City of Chattanooga on the project. Dr. Mina Sartipi, Director, UTC Center for Urban Informatics and Progress In the grant proposal for the DOE, the research team noted that the US transportation sector accounted for more than 69 percent of petroleum consumption, and more than 37 percent of the country’s CO2 emissions. An earlier National Traffic Signal Report Card found that inefficient traffic signals contribute to 295 million vehicle hours of traffic delay, making up to 10 percent of all traffic-related delays.  The project intends to leverage the capabilities of connected vehicles and infrastructures to optimize and manage traffic flow. While adaptive traffic control systems (ATCS) have been in use for a half century to improve mobility and traffic efficiency, they were not designed to address fuel consumption and emissions. Inefficient traffic systems increase idling time and stop-and-go traffic. The National Transportation Operations Coalition has graded the state of the nation’s traffic signals as D+. “The next step in the evolution [of intelligent transportation systems] is the merging of these systems through AI,” noted Aleksandar Stevanovic, associate professor of civil and environmental engineering at Pitt’s Swanson School of Engineering and director of the Pittsburgh Intelligent Transportation Systems (PITTS) Lab. “Creation of such a system, especially for dense urban corridors and sprawling exurbs, can greatly improve energy and sustainability impacts. This is critical as our transportation portfolio will continue to have a heavy reliance on gasoline-powered vehicles for some time.” The goal of the three-year project is to develop a dynamic feedback Ecological Automotive Traffic Control System (Eco-ATCS), which reduces fuel consumption and greenhouse gases while maintaining a highly operable and safe transportation environment. The integration of AI will allow additional infrastructure enhancements including emergency vehicle preemption, transit signal priority, and pedestrian safety. The ultimate goal is to reduce corridor-level fuel consumption by 20 percent. Read the source articles and information in Mirage, and in a press release from the UTC Center for Urban Informatics and Progress.
Read more
  • 0
  • 0
  • 2506

article-image-new-dataproc-optional-components-support-apache-flink-and-docker-from-cloud-blog
Matthew Emerick
15 Oct 2020
5 min read
Save for later

New Dataproc optional components support Apache Flink and Docker from Cloud Blog

Matthew Emerick
15 Oct 2020
5 min read
Google Cloud’s Dataproc lets you run native Apache Spark and Hadoop clusters on Google Cloud in a simpler, more cost-effective way. In this blog, we will talk about our newest optional components available in Dataproc’s Component Exchange: Docker and Apache Flink. Docker container on Dataproc Docker is a widely used container technology. Since it’s now a Dataproc optional component, Docker daemons can now be installed on every node of the Dataproc cluster. This will give you the ability to install containerized applications and interact with Hadoop clusters easily on the cluster.  In addition, Docker is also critical to supporting these features: Running containers with YARN Portable Apache Beam job Running containers on YARN allows you to manage dependencies of your YARN application separately, and also allows you to create containerized services on YARN. Get more details here. Portable Apache Beam packages jobs into Docker containers and submits them the Flink cluster. Find more detail about Beam portability.  Docker optional component is also configured to use Google Container Registry, in addition to the default Docker registry. This lets you use container images managed by your organization. Here is how to create a Dataproc cluster with the Docker optional component: gcloud beta dataproc clusters create <cluster-name>   --optional-components=DOCKER   --image-version=1.5 When you run the Docker application, the log will be streamed to Cloud Logging, using gcplogs driver. If your application does not depend on any Hadoop services, check out Kubernetes and Google Kubernetes Engine to run containers natively. For more on using Dataproc, check out our documentation. Apache Flink on Dataproc Among streaming analytics technologies, Apache Beam and Apache Flink stand out. Apache Flink is a distributed processing engine using stateful computation. Apache Beam is a unified model for defining batch and steaming processing pipelines. Using Apache Flink as an execution engine, you can also run Apache Beam jobs on Dataproc, in addition to Google’s Cloud Dataflow service. Flink and running Beam on Flink are suitable for large-scale, continuous jobs, and provide: A streaming-first runtime that supports both batch processing and data streaming programs A runtime that supports very high throughput and low event latency at the same time Fault-tolerance with exactly-once processing guarantees Natural back-pressure in streaming programs Custom memory management for efficient and robust switching between in-memory and out-of-core data processing algorithms Integration with YARN and other components of the Apache Hadoop ecosystem Our Dataproc team here at Google Cloud recently announced that Flink Operator on Kubernetes is now available. It allows you to run Apache Flink jobs in Kubernetes, bringing the benefits of reducing platform dependency and producing better hardware efficiency.  Basic Flink Concepts A Flink cluster consists of a Flink JobManager and a set of Flink TaskManagers. Like similar roles in other distributed systems such as YARN, JobManager has responsibilities such as accepting jobs, managing resources and supervising jobs. TaskManagers are responsible for running the actual tasks.  When running Flink on Dataproc, we use YARN as resource manager for Flink. You can run Flink jobs in 2 ways: job cluster and session cluster. For the job cluster, YARN will create JobManager and TaskManagers for the job and will destroy the cluster once the job is finished. For session clusters, YARN will create JobManager and a few TaskManagers.The cluster can serve multiple jobs until being shut down by the user. How to create a cluster with Flink Use this command to get started: gcloud beta dataproc clusters create <cluster-name>   --optional-components=FLINK   --image-version=1.5 How to run a Flink job After a Dataproc cluster with Flink starts, you can submit your Flink jobs to YARN directly using the Flink job cluster. After accepting the job, Flink will start a JobManager and slots for this job in YARN. The Flink job will be run in the YARN cluster until finished. The JobManager created will then be shut down. Job logs will be available in regular YARN logs. Try this command to run a word-counting example: The Dataproc cluster will not start a Flink Session cluster by default. Instead, Dataproc will create the script “/usr/bin/flink-yarn-daemon,” which will start a Flink session.  If you want to start a Flink session when Dataproc is created, use the metadata key to allow it: If you want to start the Flink session after Dataproc is created, you can run the following command on master node: Submit jobs to that session cluster. You’ll need to get the Flink JobManager URL: How to run a Java Beam job It is very easy to run an Apache Beam job written in Java. There is no extra configuration needed. As long as you package your Beam jobs into a JAR file, you do not need to configure anything to run Beam on Flink. This is the command you can use: How to run a Python Beam job written in Python Beam jobs written in Python use a different execution model. To run them in Flink on Dataproc, you will also need to enable the Docker optional component. Here’s how to create a cluster: You will also need to install necessary Python libraries needed by Beam, such as apache_beam and apache_beam[gcp]. You can pass in a Flink master URL to let it run in a session cluster. If you leave the URL out, you need to use the job cluster mode to run this job: After you’ve written your Python job, simply run it to submit: Learn more about Dataproc.
Read more
  • 0
  • 0
  • 1844

article-image-prevent-planned-downtime-during-the-holiday-shopping-season-with-cloud-sql-from-cloud-blog
Matthew Emerick
15 Oct 2020
3 min read
Save for later

Prevent planned downtime during the holiday shopping season with Cloud SQL from Cloud Blog

Matthew Emerick
15 Oct 2020
3 min read
Routine database maintenance is a way of life. Updates keep your business running smoothly and securely. And with a managed service, like Cloud SQL, your databases automatically receive the latest patches and updates, with significantly less downtime. But we get it: Nobody likes downtime, no matter how brief.  That's why we’re pleased to announce that Cloud SQL, our fully managed database service for MySQL, PostgreSQL, and SQL Server, now gives you more control over when your instances undergo routine maintenance. Cloud SQL is introducing maintenance deny period controls. With maintenance deny periods, you can prevent automatic maintenance from occurring during a 90-day time period.  This can be especially useful for the Cloud SQL retail customers about to kick off their busiest time of year, with Black Friday and Cyber Monday just around the corner. This holiday shopping season is a time of peak load that requires heightened focus on infrastructure stability, and any upgrades can put that at risk. By setting a maintenance deny period from mid-October to mid-January, these businesses can prevent planned upgrades from Cloud SQL during this critical time. Understanding Cloud SQL maintenanceBefore describing these new controls, let’s answer a few questions we often hear about the automatic maintenance that Cloud SQL performs. What is automatic maintenance?To keep your databases stable and secure, Cloud SQL automatically patches and updates your database instance (MySQL, Postgres, and SQL Server), including the underlying operating system. To perform maintenance, Cloud SQL must temporarily take your instances offline. What is a maintenance window?Maintenance windows allow you to control when maintenance occurs. Cloud SQL offers maintenance windows to minimize the impact of planned maintenance downtime to your applications and your business.  Defining the maintenance window lets you set the hour and day when an update occurs, such as only when database activity is low (for example, on Saturday at midnight).  Additionally, you can control the order of updates for your instance relative to other instances in the same project (“Earlier” or “Later”). Earlier timing is useful for test instances, allowing you to see the effects of an update before it reaches your production instances.  What are the new maintenance deny period controls?You can now set a single deny period, configurable from 1 to 90 days, each year. During the deny period, Cloud SQL will not perform maintenance that causes downtime on your database instance. Deny periods can be set to reduce the likelihood of downtime during the busy holiday season, your next product launch, end of quarter financial reporting, or any other important time for your business. Paired with Cloud SQL’s existing maintenance notification and rescheduling functionality, deny periods give you even more flexibility and control. After receiving a notification of upcoming maintenance, you can reschedule ad hoc, or if you want to prevent maintenance longer, set a deny period.  Getting started with Cloud SQL’s new maintenance controlReview our documentation to learn more about maintenance deny periods and, when you're ready, start configuring them for your database instances.  What’s next for Cloud SQLSupport for additional maintenance controls continues to be a top request from users. These new deny periods are an addition to the list of existing maintenance controls for Cloud SQL. Have more ideas? Let us know what other features and capabilities you need with our Issue Tracker and by joining the Cloud SQL discussion group. We’re glad you’re along for the ride, and we look forward to your feedback!
Read more
  • 0
  • 0
  • 1559
article-image-ai-is-driving-software-2-0-with-minimal-human-intervention-from-featured-blog-posts-data-science-central
Matthew Emerick
15 Oct 2020
6 min read
Save for later

AI is Driving Software 2.0… with Minimal Human Intervention from Featured Blog Posts - Data Science Central

Matthew Emerick
15 Oct 2020
6 min read
The future of software development will be model-driven, not code-driven. Now that my 4th book (“The Economics of Data, Analytics and Digital Transformation”) is in the hands of my publisher, it’s time to get back to work investigating and sharing new learnings.  In this blog I’ll take on the subject of Software 2.0.  And thanks Jens for the push in this direction! Imagine trying to distinguish a dog from other animals in a photo coding in if-then statements: If the animal has four legs (except when it only has 3 legs due to an accident), and if the animal has short fur (except when it is a hair dog or a chihuahua with no fur), and if the animal has medium length ears (except when the dog is a bloodhound), and if the animal has a medium length legs (except when it’s a bull dog), and if… Well, you get the point.  In fact, it is probably impossible to distinguish a dog from other animals coding in if-then statements. And that’s where the power of model-based (AI and Deep Learning) programming shows its strength; to tackle programming problems – such as facial recognition, natural language processing, real-time dictation, image recognition – that are nearly impossible to address using traditional rule-based programming (see Figure 1). Figure 1:  How Deep Learning Works As discussed in “2020 Challenge: Unlearn to Change Your Frame”, most traditional analytics are rule based; the analytics make decisions guided by a pre-determined set of business or operational rules. However, AI and Deep Learning make decisions based upon the "learning" gleaned from the data. Deep Learning “learns” the characteristics of entities in order to distinguish cats from dogs, tanks from trucks, or healthy cells from cancerous cells (see Figure 2). Figure 2: Rules-based versus Learning-based Programing This learning amplifies when there is a sharing of the learnings across a collection of similar assets – vehicles, trains, airplanes, compressors, turbines, motors, elevators, cranes – so that the learnings of one asset can be aggregated and backpropagated to the cohort of assets. The Uncertain Future of Programming A recent announcement from NVIDIA has the AI community abuzz, and software developers worrying about their future.  NVIDIA researchers recently used AI to recreate the classic video game Pac-Man.  NVIDIA created an AI model using Generative Adversarial Networks (GANs) (called NVIDIA GameGAN) that can generate a fully functional version of Pac-Man without the coding associated with building the underlying game engine.  The AI model was able to recreate the game without having to “code” the game’s fundamental rules (see Figure 3). Figure 3: “How GANs and Adaptive Content Will Change Learning, Entertainment and More” Using AI and Machine Learning (ML) to create software without the need to code the software is driving the "Software 2.0" phenomena.  And it is impressive.  An outstanding presentation from Kunle Olukotun titled “Designing Computer Systems for Software 2.0” discussed the potential of Software 2.0 to use machine learning to generate models from data and replace traditional software development (coding) for many applications. Software 2.0[1] Due to the stunning growth of Big Data and IOT, Neural Networks now have access to enough detailed, granular data to surpass conventional coded algorithms in the predictive accuracy of complex models in areas such as image recognition, natural language processing, autonomous vehicles, and personalized medicine. Instead of coding software algorithms in the traditional development manner, you train Neural Network – leveraging backpropagation and stochastic gradient descent – to optimize the neural network nodes’ weights to deliver the desired outputs or outcomes (see Figure 4). Figure 4: “Neural Networks:  Is Meta-learning the New Black?” With model-driven software development, it is often easier to train a model than to manually code an algorithm, especially for complex applications like Natural Language Processing (NLP) and image recognition.  Plus, model-driven software development is often more predictable in term of runtimes and memory usage compared to conventional algorithms For example, Google’s Jeff Dean reported that 500 lines of TensorFlow code replaced 500,000 lines of code in Google Translate. And while a thousand-fold reduction is huge, what’s more significant is how this code works: rather than half a million lines of static code, the neural network can learn and adapt as biases and prejudices in the data are discovered. Software 2.0 Challenge: Data Generation In the article “What machine learning means for software development”, Andrew Karpathy states that neural networks have proven they can perform almost any task for which there is sufficient training data. Training Neural Networks to beat Go or Chess or StarCraft is possible because of the large volume of associated training data.  It’s easy to collect training data for Go or Chess as there is over 150 years of data from which to train the models.  And training image recognition programs is facilitated by the 14 million labeled images available on ImageNet. However, there is not always sufficient data to neural network models in all cases.  Significant effort must be invested to create and engineer training data, using techniques such as noisy labeling schemes, data augmentation, data engineering, and data reshaping, to power the model-based neural network applications.  Welcome to Snorkel.  Snorkel (damn cool name) is a system for programmatically building and managing training datasets without manual labeling. Snorkel can automatically develop, clean and integrate large training datasets using three different programmatic operations (see Figure 5): Labeling data through the use of heuristic rules or distant supervision techniques Transforming or augmenting the data by rotating or stretching images Slicing data into different subsets for monitoring or targeted improvement   Figure 5: Programmatically Building and Managing Training Data with Snorkel Snorkel is a powerful tool for data labeling and data synthesis. Labeling data manually is very time-consuming, and Snorkel can address this issue programmatically, and the resulting data can be validated by human beings by looking at some samples of the data. See “Snorkel Intro Tutorial: Data Augmentation” for more information on its workings. Software 2.0 Summary There are certain, complex programming problems – facial recognition, natural language processing, real-time dictation, image recognition, autonomous vehicles, precision medicine – that are nearly impossible to address using traditional rule-based programming.  In these cases, it is easier to create AI, Deep Learning and Machine Learning models that can trained (with large data sets) to deliver the right actions versus being coded to deliver the right actions.  This is the philosophy of Software 2.0. Instead of coding software algorithms in the traditional development manner, you train a Neural Network to optimize the neural network nodes’ weights to deliver the desired outputs or outcomes. And model-driven programs have the added advantage of being able to learn and adapt… the neural network can learn and adapt as biases and prejudices in the data are discovered. However, there is not always sufficient data to neural network models in all cases.  In those cases, new tools like Snorkel can help… Snorkel can automatically develop, clean and integrate large training datasets The future of software development will be model-driven, not code-driven. Article Sources: Machine Learning vs Traditional Programming Designing Computer Systems for Software 2.0 (PDF) Software Ate the World, Now AI Is Eating Software: The road to Software 2.0         [1] Kunle Olukotun’s presentation and video.
Read more
  • 0
  • 0
  • 1200

article-image-introducing-net-live-tv-daily-developer-live-streams-from-net-blog
Matthew Emerick
15 Oct 2020
4 min read
Save for later

Introducing .NET Live TV – Daily Developer Live Streams from .NET Blog

Matthew Emerick
15 Oct 2020
4 min read
Today, we are launching .NET Live TV, your one stop shop for all .NET and Visual Studio live streams across Twitch and YouTube. We are always looking for new ways to bring great content to the developer community and innovate in ways to interact with you in real-time. Live streaming gives us the opportunity to deliver more content where everyone can ask questions and interact with the product teams. We started our journey several years ago with the .NET Community Standup series. It’s a weekly “behind the scenes” live stream that shows you what goes into building the runtimes, languages, frameworks, and tools we all love. As it grew, so did our dreams of delivering even more awesome .NET live stream content. .NET Live TV takes things to a whole new level with the introduction of new shows and a new website. It is a single place to bookmark so you can stay up to date with live streams across several Twitch and YouTube channels and with a single click can join in the conversation. Here are some of the new shows that recently launched that you could look forward to: Expanded .NET Community Standups What started as the ASP.NET Community Standup has grown into 7 unique shows throughout the month! Here is a quick guide to the schedule of the shows that all start at 10:00 AM Pacific: Tuesday: ASP.NET hosted by Jon Galloway with a monthly Blazor focus hosted by Safia Abdalla kicking off Nov 17! Wednesday: Rotating – Entity Framework hosted by Jeremy Likness & Machine Learning hosted by Bri Achtman 1st Thursday: Xamarin hosted by Maddy Leger 2nd Thursday: Languages & Runtime hosted by Immo Landwerth 3rd Thursday: .NET Tooling hosted by Kendra Havens 4th Thursday: .NET Desktop hosted by Olia Gavrysh Packed Week of .NET Shows! DayTime (PT) Show Description Monday 6:00 AM Join Jeff Fritz (csharpfritz) in this start from the beginning series to learn C# in this talk-show format that answers viewers questions and provides interactive samples with every episode. Monday 9:00 AM Join David Pine, Scott Addie, Cam Soper, and more from the Developer Relations team at Microsoft each week as they highlight the amazing community members in the .NET community. Monday 11:00 AM A weekly show dedicated to the topic of working from home. Mads Kristensen from the Visual Studio team invites guests onto the show for conversations about anything and everything related to Visual Studio and working from home. Tuesday 10:00 AM Join members from the ASP.NET teams for our community standup covering great community contributions for ASP.NET, ASP.NET Core, and more. Tuesday 12:00 PM Join Instafluff (Raphael) each week live on Twitch as he works on fun C# game related projects from his C# corner. Wednesday 10:00 AM Join the Entity Framework and the Machine learning teams for their community standups covering great community contributions. Thursday 10:00 AM Join the Xamarin, Languages & Runtime, .NET Tooling, and .NET Desktop teams covering great community contributions for each subject. Thursday 2:00 PM Join Cecil Phillip as he hosts and interviews amazing .NET contributors from the .NET teams and the community. Friday 2:00 PM Join Mads Kristensen from the Visual Studio team each week as he builds extensions for Visual Studio live! More to come! Be sure to bookmark live.dot.net as we are living streaming 5 days a week and adding even more shows soon! If you are looking for even more great developer video content be sure to check out Microsoft Learn TV where in addition to some of the shows from .NET Live TV you will find 24 hour a day streaming content of all topics. The post Introducing .NET Live TV – Daily Developer Live Streams appeared first on .NET Blog.
Read more
  • 0
  • 0
  • 4180

article-image-how-to-index-data-from-s3-via-nifi-using-cdp-data-hubs-from-cloudera-blog
Matthew Emerick
15 Oct 2020
10 min read
Save for later

How-to: Index Data from S3 via NiFi Using CDP Data Hubs from Cloudera Blog

Matthew Emerick
15 Oct 2020
10 min read
About this Blog Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e. Solr) is most commonly used for batch indexing data residing in cloud storage, or if you want to do heavy transformations of the data as a pre-step before sending it to indexing for easy exploration. NiFi (as depicted in this blog) is used for real time and often voluminous incoming event streams that need to be explorable (e.g. logs, twitter feeds, file appends etc). Our ambition is not to use any terminal or a single shell command to achieve this. We have a UI tool for every step we need to take.  Assumptions The prerequisites to pull this feat are pretty similar to the ones in our previous blog post, minus the command line access: You have a CDP account already and have power user or admin rights for the environment in which you plan to spin up the services. If you do not have a CDP AWS account, please contact your favorite Cloudera representative, or sign up for a CDP trial here. You have environments and identities mapped and configured. More explicitly, all you need is to have the mapping of the CDP User to an AWS Role which grants access to the specific S3 bucket you want to read from (and write to). You have a workload (FreeIPA) password already set. You have  DDE and  Flow Management Data Hub clusters running in your environment. You can also find more information about using templates in CDP Data Hub here. You have AWS credentials to be able to access an S3 bucket from Nifi. Here is documentation on how to acquire AWS credentials and how to create a bucket and upload files to it. You have a sample file in an S3 bucket that is accessible for your CDP user.  If you don’t have a sample file, here is a link to the one we used. Note: the workflow discussed in this blog was written with the linked ‘films.csv’ file in mind. If you use a different one, you might need to do things slightly differently, e.g. when creating the Solr collection) Pro Tip for the novice user: to download a CSV file from GitHub, view it by clicking the RAW button and then use the Save As option in the browser File menu. Workflow To replicate what we did, you need to do the following: Create a collection using Hue. Build a dataflow in NiFi. Run the NiFi flow. Check if everything went well NiFi logs and see the indexed data on Hue. Create a collection using Hue You can create a collection using the solrctrl CLI. Here we chose to use HUE in the DDE Data Hub cluster: 1.In the Services section of the DDE cluster details page, click the Hue shortcut. 2. On the Hue webUI select Indexes> + ‘Create index’ > from the Type drop down select ‘Manually’> Click Next. 3. Provide a collection Name under Destination (in this example, we named it ‘solr-nifi-demo’). 4. Add the following  Fields, using the + Add Field button: Name Type name text_general initial_release_date date 5. Click Submit. 6. To check that the collection has indeed been created, go to the Solr webUI by clicking the Solr Server shortcut on the DDE cluster details page. 7. Once there, you can either click on the Collections sidebar option or click Select an option > in the drop down you will find the collection you have just created (‘solr-nifi-demo’ in our example) > click the collection > click Query > Execute Query. You should get something very similar: {  "responseHeader":{    "zkConnected":true,    "status":0,    "QTime":0,    "params":{      "q":"*:*",      "doAs":"<querying user>",      "_forwardedCount":"1",      "_":"1599835760799"}},  "response":{"numFound":0,"start":0,"docs":[]   }} That is, you have successfully created an empty collection. Build a flow in NiFi Once you are done with collection creation, move over to Flow management Data Hub cluster. In the Services section of the Flow Management cluster details page, click the NiFi shortcut. Add processors Start adding processors by dragging the ‘Processor’ button to the NiFi canvas. To build the example workflow we did, add the following processors: 1. ListS3 This processor reads the content of the S3 bucket linked to your environment. Configuration: Config name Config value Comments Name Check for new Input Optional Bucket nifi-solr-demo The S3 bucket where you uploaded your sample file Access Key ID <my access key> This value is generated for AWS users. You may generate and download a new one from AWS Management Console > Services > IAM > Users > Select your user > Security credentials > Create access key. Secret Access Key <my secret access key> This value is generated for AWS users, together with the Access Key ID. Prefix input-data/ The folder inside the bucket where the input CSV is located. Be careful of the “/” at the end. It is required to make this work. You may need to fill in or change additional properties beside these such as region, scheduling etc. (Based on your preferences and your AWS configuration) 2. RouteOnAttribute This processor filters objects read in the previous step, and makes sure only CSV files reach the next processor. Configuration: Config name Config value Comments Name Filter CSVs Optional csv_file ${filename:toUpper():endsWith(‘CSV’)} This attribute is added with the ‘Add Property’ option. The routing will be based on this property. See in the connections section. 3.  FetchS3Object FetchS3 object reads the content of the CSV files it receives. Configuration Config name Config value Comments Name Fetch CSV from S3 Optional Bucket nifi-solr-demo The same as provided for the ListS3 processor Object Key ${filename} It’s coming from the Flow File Access Key ID <My Access Key Id> The same as provided for the ListS3 processor Secret Access Key <My Secret Access Key> The same as provided for the ListS3 processor The values for Bucket, Access Key, and Secret Key are the same as in case of the List3 processor. The Object key is autofilled by NiFi, It comes as an input from the previous processors. 4. PutSolrContentStream Configuration Config name Config value Comments Name Index Data to DDE Optional Solr Type Cloud We will provide ZK ensemble as Solr location so this is required to be set to Cloud. Solr Location <ZK_ENSEMBLE> You find this value on the Dashboard of the Solr webUI, as the zkHost parameter value. Collection solr-nifi-demo-collection Here we use the collection which has been created above. If you specified a different name there then put the same here. Content Stream Path /update Be careful of the leading “/”. Content-Type application/csv Any content type that Solr can process may be provided here. In this example we use CSV. Kerberos principal <my kerberos username> Since we use direct URL to Solr, Kerberos authentication needs to be used here. Kerberos password <my kerberos password> Password for the Kerberos principal. SSL Context Service Default NiFi SSL Context Service Just choose it from the drop down. The service is created by default from the Workflow Management template. 5. LogMessage (x4) We created four LogMessage processors too to track if everything happens as expected. a) Log Check Log message Object checked out: ${filename} b) Log Ignore Log message File is not csv. Ignored: ${filename} c) Log Fetch Log message Object fetched: ${filename} d) Log Index Log message Data indexed from: ${filename} 6. In this workflow, the log processors are the dead ends, so pick the “Automatically Terminate Relationships” option on them like this: In this example, all properties not mentioned above were left with their default values during processor setup. Depending on your AWS and environment setup, you may need to set things differently.  After setting up the processors you shall see something like this: Create connections Use your mouse to create flow between the processors. The connections between the boxes are the successful paths, except for the RouteOnAttribute processor: It has the csv_file and the unmatched routes. The FetchS3Object and the PutSolrContentStream processors have failure paths as well: direct them back to themselves, creating a retry mechanism on failure. This may not be the most sophisticated, but it serves its purpose.  This is what your flow will look like after setting the connections: Run the NiFi Flow You may start the processors one by one, or you may start the entire flow at once. If no processor is selected, by clicking the “Play” icon on the left side in the NiFi Operate Palette starts the flow. If you did the setup exactly as it is in the beginning of this post, two object are almost instantly checked out (depending, of course, on your scheduling settings if you set those too):  input-data/ – The input folder also matches with the prefix provided for the ListS3 processor. But no worries, as in the next step it will be filtered out so it won’t go further as it’s not a CSV file. films.csv – this goes to our collection if you did everything right. After starting your flow the ListS3 command based on the scheduling polls your S3 bucket and searches for changes based on the “Last modified” timestamp. So if you put something new in your input-data folder it will be automatically processed. Also if a file changes it’s rechecked too. Check the results After the CSV has been processed, you can check your logs and collection for the expected result. Logs 1. In the Services section of the Flow Management cluster details page, click the Cloudera Manager shortcut. 2. Click on the name of your compute cluster >Click NiFi in the Compute  Cluster box. > Under Status Summary  click NiFi Node  > Click on one of the nodes and click Log Files in the top menu bar. > Select Role Log File. If everything went well you will see similar log messages: Indexed data Indexed data appears in our collection. Here is what you should see on Hue:  Summary In this post, we demonstrated how Cloudera Data Platform components can collaborate with each other, while still being resource isolated and managed separately. We created a Solr collection via Hue, built a data ingest workflow in NiFi to connect our S3 bucket with Solr, and in the end, we have the indexed data ready for searching. There is no terminal magic in this scenario, we’ve only used comfortable UI features. Having our indexing flow and our Solr sitting in separate clusters, we have more options in areas like scalability, the flexibility of routing, and decorating data pipelines for multiple consuming workloads, and yet with consistent security and governance across. Remember, this was only one simple example. This basic setup, however, offers endless opportunities to implement way more complex solutions. Feel free to try Data Discovery and Exploration in CDP on your own and play around with more advanced pipelines and let us know how it goes! Alternatively, contact us for more information. The post How-to: Index Data from S3 via NiFi Using CDP Data Hubs appeared first on Cloudera Blog.
Read more
  • 0
  • 0
  • 1621
article-image-jumpcloud-launches-new-integrations-with-slack-salesforce-github-atlassian-and-aws-from-devops-com
Matthew Emerick
15 Oct 2020
1 min read
Save for later

JumpCloud Launches New Integrations with Slack, Salesforce, GitHub, Atlassian, and AWS from DevOps.com

Matthew Emerick
15 Oct 2020
1 min read
User identity lifecycle management across multiple apps from single cloud directory platform saves IT hours of onboarding / offboarding work  LOUISVILLE, CO – Oct. 15, 2020 – JumpCloud today announced new integrations that provide IT admins easier user identity lifecycle management across multiple applications from a single platform. These new integrations with Slack, Salesforce, Atlassian, GitHub, and AWS provide streamlined user management […] The post JumpCloud Launches New Integrations with Slack, Salesforce, GitHub, Atlassian, and AWS appeared first on DevOps.com.
Read more
  • 0
  • 0
  • 1464

article-image-data-governance-in-operations-needed-to-ensure-clean-data-for-ai-projects-from-ai-trends
Matthew Emerick
15 Oct 2020
5 min read
Save for later

Data Governance in Operations Needed to Ensure Clean Data for AI Projects from AI Trends

Matthew Emerick
15 Oct 2020
5 min read
By AI Trends Staff Data governance in data-driven organizations is a set of practices and guidelines that define where responsibility for data quality lives. The guidelines support the operation’s business model, especially if AI and machine learning applications are at work.  Data governance is an operations issue, existing between strategy and the daily management of operations, suggests a recent account in the MIT Sloan Management Review.  “Data governance should be a bridge that translates a strategic vision acknowledging the importance of data for the organization and codifying it into practices and guidelines that support operations, ensuring that products and services are delivered to customers,” stated author Gregory Vial is an assistant professor of IT at HEC Montréal. To prevent data governance from being limited to a plan that nobody reads, “governing” data needs to be a verb and not a noun phrase as in “data governance.” Vial writes, “The difference is subtle but ties back to placing governance between strategy and operations — because these activities bridge and evolve in step with both.” Gregory Vial, assistant professor of IT at HEC Montréal An overall framework for data governance was proposed by Vijay Khatri and Carol V. Brown in a piece in Communications of the ACM published in 2010. The two suggested the strategy is based on five dimensions that represent a combination of structural, operational and relational mechanisms. The five dimensions are: Principles at the foundation of the framework that relate to the role of data as an asset for the organization; Quality to define the requirements for data to be usable and the mechanisms in place to assess that those requirements are met; Metadata to define the semantics crucial for interpreting and using data — for example, those found in a data catalog that data scientists use to work with large data sets hosted on a data lake. Accessibility to establish the requirements related to gaining access to data, including security requirements and risk mitigation procedures; Life cycle to support the production, retention, and disposal of data on the basis of organization and/or legal requirements. “Governing data is not easy, but it is well worth the effort,” stated Vial. “Not only does it help an organization keep up with the changing legal and ethical landscape of data production and use; it also helps safeguard a precious strategic asset while supporting digital innovation.” Master Data Management Seen as a Path to Clean Data Governance Once the organization commits to data quality, what’s the best way to get there? Naturally entrepreneurs are in position to step forward with suggestions. Some of them are around master data management (MDM), a discipline where business and IT work together to ensure the accuracy and consistency of the enterprise’s master data assets. Organizations starting down the path with AI and machine learning may be tempted to clean the data that feeds a specific application project, a costly approach in the long run suggests one expert.   “A better, more sustainable way is to continuously cure the data quality issues by using a capable data management technology. This will result in your training data sets becoming rationalized production data with the same master data foundation,” suggests Bill  O’Kane, author of a recent account from tdwi.org on master data management. Formerly an analyst with Gartner, O’Kane is now the VP and MDM strategist at Profisee, a firm offering an MDM solution. If the data feeding into the AI system is not unique, accurate, consistent and time, the models will not produce reliable results and are likely to lead to unwanted business outcomes. These could include different decisions being made on two customer records thought to represent different people, but in fact describe the same person. Or, recommending a product to a customer that was previously returned or generated a complaint. Perceptilabs Tries to Get in the Head of the Machine Learning Scientist Getting inside the head of a machine learning scientist might be helpful in understanding how a highly trained expert builds and trains complex mathematical models. “This is a complex time-consuming process, involving thousands of lines of code,” writes Martin Isaksson, co-founder and CEO of Perceptilabs, in a recent account in VentureBeat. Perceptilabs offers a product to help automation the building of machine learning models, what it calls a “GUI for TensorFlow.”. Martin Isaksson, co-founder and CEO, Perceptilabs “As AI and ML took hold and the experience levels of AI practitioners diversified, efforts to democratize ML materialized into a rich set of open source frameworks like TensorFlow and datasets. Advanced knowledge is still required for many of these offerings, and experts are still relied upon to code end-to-end ML solutions,” Isaksson wrote.. AutoML tools have emerged to help adjust parameters and train machine learning models so that they are deployable. Perceptilabs is adding a visual modeler to the mix. The company designed its tool as a visual API on top of TensorFlow, which it acknowledges as the most popular ML framework. The approach gives developers access to the low-level TensorFlow API and the ability to pull in other Python modules. It also gives users transparency into how the model is architected and a view into how it performs. Read the source articles in the MIT Sloan Management Review, Communications of the ACM,  tdwi.org and VentureBeat.
Read more
  • 0
  • 0
  • 1977