Data | 1210 articles | Tech News, Tutorials & Expert Insights

article-image-machine-learning-ethics-what-you-need-to-know-and-what-you-can-do

23 Sep 2019

10 min read

Machine learning ethics: what you need to know and what you can do

23 Sep 2019

0
0
14252

article-image-bad-metadata-can-get-you-in-legal-hot-water

Guest Contributor

21 Sep 2019

6 min read

Bad Metadata can get you in legal hot water

Guest Contributor

21 Sep 2019

6 min read

Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency

0
0
4965

article-image-how-to-handle-categorical-data-for-machine-learning-algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

How to handle categorical data for machine learning algorithms

Packt Editorial Staff

20 Sep 2019

9 min read

The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems. It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ... ['green', 'M', 10.1, 'class1'], ... ['red', 'L', 13.5, 'class2'], ... ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color size price classlabel 0 green M 10.1 class1 1 red L 13.5 class2 2 blue XL 15.3 class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ... 'XL': 3, ... 'L': 2, ... 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0 M 1 L 2 XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ... enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df color size price classlabel 0 green 1 10.1 0 1 red 2 13.5 1 2 blue 3 15.3 0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df color size price classlabel 0 green 1 10.1 class1 1 red 2 13.5 class2 2 blue 3 15.3 class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray() array([[0., 1., 0.], [0., 0., 1.], [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ... ('onehot', OneHotEncoder(), [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float) array([[0.0, 1.0, 0.0, 1, 10.1], [0.0, 0.0, 1.0, 2, 13.5], [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']]) price size color_blue color_green color_red 0 10.1 1 0 1 0 1 13.5 2 0 0 1 2 15.3 3 1 0 0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ... drop_first=True) price size color_green color_red 0 10.1 1 1 0 1 13.5 2 0 1 2 15.3 3 0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([ ... ('onehot', color_ohe, [0]), ... ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[ 1. , 0. , 1. , 10.1], [ 0. , 1. , 2. , 13.5], [ 0. , 0. , 3. , 15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report

0
0
17498

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status

Savia Lobo

19 Sep 2019

5 min read

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo

19 Sep 2019

5 min read

Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs

0
0
2840

article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost

Richard Gall

19 Sep 2019

11 min read

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall

19 Sep 2019

11 min read

0
0
8663

article-image-gql-graph-query-language-joins-sql-as-a-global-standards-project-and-is-now-the-international-standard-declarative-query-language-for-graphs

Amrata Joshi

19 Sep 2019

6 min read

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

Amrata Joshi

19 Sep 2019

6 min read

On Tuesday, the team at Neo4j, the graph database management system announced that the international committees behind the development of the SQL standard have voted to initiate GQL (Graph Query Language) as the new database query language. GQL is now going to be the international standard declarative query language for property graphs and it is also a Global Standards Project. GQL is developed and maintained by the same international group that maintains the SQL standard. How did the proposal for GQL pass? Last year in May, the initiative for GQL was first time processed in the GQL Manifesto. This year in June, the national standards bodies across the world from the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started voting on the GQL project proposal. The ballot closed earlier this week and the proposal was passed wherein ten countries including Germany, Korea, United States, UK, and China voted in favor. And seven countries agreed to put forward their experts to work on this project. Japan was the only country to vote against in the ballot because according to Japan, existing languages already do the job, and SQL/Property Graph Query extensions along with the rest of the SQL standard can do the same job. According to the Neo4j team, the GQL project will initiate development of next-generation technology standards for accessing data. Its charter mandates building on core foundations that are established by SQL and ongoing collaboration in order to ensure SQL and GQL interoperability and compatibility. GQL would reflect rapid growth in the graph database market by increasing adoption of the Cypher language. Stefan Plantikow, GQL project lead and editor of the planned GQL specification, said, “I believe now is the perfect time for the industry to come together and define the next generation graph query language standard.” Plantikow further added, “It’s great to receive formal recognition of the need for a standard language. Building upon a decade of experience with property graph querying, GQL will support native graph data types and structures, its own graph schema, a pattern-based approach to data querying, insertion and manipulation, and the ability to create new graphs, and graph views, as well as generate tabular and nested data. Our intent is to respect, evolve, and integrate key concepts from several existing languages including graph extensions to SQL.” Keith Hare, who has served as the chair of the international SQL standards committee for database languages since 2005, charted the progress toward GQL, said, “We have reached a balance of initiating GQL, the database query language of the future whilst preserving the value and ubiquity of SQL.” Hare further added, “Our committee has been heartened to see strong international community participation to usher in the GQL project. Such support is the mark of an emerging de jure and de facto standard .” The need for a graph-specific query language Researchers and vendors needed a graph-specific query language because of the following limitations: SQL/PGQ language is restricted to read-only queries SQL/PGQ cannot project new graphs The SQL/PGQ language can only access those graphs that are based on taking a graph view over SQL tables. Researchers and vendors needed a language like Cypher that would cover insertion and maintenance of data and not just data querying. But SQL wasn’t the apt model for a graph-centric language that takes graphs as query inputs and outputs a graph as a result. But GQL, on the other hand, builds in openCypher, a project that brings Cypher to Apache Spark and gives users a composable graph query language. SQL and GQL can work together According to most of the companies and national standards bodies that are supporting the GQL initiative, GQL and SQL are not competitors. Instead, these languages can complement each other via interoperation and shared foundations. Alastair Green, Query Languages Standards & Research Lead at Neo4j writes, “A SQL/PGQ query is in fact a SQL sub-query wrapped around a chunk of proto-GQL.” SQL is a language that is built around tables whereas GQL is built around graphs. Users can use GQL to find and project a graph from a graph. Green further writes, “I think that the SQL standards community has made the right decision here: allow SQL, a language built around tables, to quote GQL when the SQL user wants to find and project a table from a graph, but use GQL when the user wants to find and project a graph from a graph. Which means that we can produce and catalog graphs which are not just views over tables, but discrete complex data objects.” It is still not clear when will the first implementation version of GQL will be out. The official page reads, “The work of the GQL project starts in earnest at the next meeting of the SQL/GQL standards committee, ISO/IEC JTC 1 SC 32/WG3, in Arusha, Tanzania, later this month. It is impossible at this stage to say when the first implementable version of GQL will become available, but it is highly likely that some reasonably complete draft will have been created by the second half of 2020.” Developer community welcomes the new addition Users are excited to see how GQL will incorporate Cypher, a user commented on HackerNews, “It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.” Few others mistook GQL to be Facebook’s GraphQL and are sceptical about the name. A comment on HackerNews reads, “Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.” A user commented, “I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.” Another user commented, “That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.” Other interesting news in Data Media manipulation by Deepfakes and cheap fakes refquire both AI and social fixes, finds a Data & Society report Percona announces Percona Distribution for PostgreSQL to support open source databases Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out

0
0
6640

article-image-4-important-business-intelligence-considerations-for-the-rest-of-2019

Richard Gall

16 Sep 2019

7 min read

4 important business intelligence considerations for the rest of 2019

Richard Gall

16 Sep 2019

7 min read

0
0
6440

article-image-how-artificial-intelligence-and-machine-learning-can-help-us-tackle-the-climate-change-emergency

Vincy Davis

16 Sep 2019

14 min read

How artificial intelligence and machine learning can help us tackle the climate change emergency

Vincy Davis

16 Sep 2019

14 min read

“I don’t want you to be hopeful. I want you to panic. I want you to feel the fear I feel every day. And then I want you to act on changing the climate”- Greta Thunberg Greta Thunberg is a 16-year-old Swedish schoolgirl, who is famously called as a climate change warrior. She has started an international youth movement against climate change and has been nominated as a candidate for the Nobel Peace Prize 2019 for climate activism. According to a recent report by the Intergovernmental Panel (IPCC), climate change is seen as the top global threat by many countries. The effects of climate change is going to make 1 million species go extinct, warns a UN report. The Earth’s rising temperatures are fueling longer and hotter heat waves, more frequent droughts, heavier rainfall, and more powerful hurricanes. Antarctica is breaking. Indonesia, the world's 4th most populous country, just shifted its capital from Jakarta because it's sinking. Singapore's worried investments are moving away. Last year, Europe experienced an 'extreme year' for unusual weather events. After a couple of months of extremely cold weather, heat and drought plagued spring and summer with temperatures well above average in most of the northern and western areas. The UK Parliament has declared ‘climate change emergency’ after a series of intense protests earlier this month. More than 1,200 people were killed across South Asia due to heavy monsoon rains and intense flooding (in some places it was the worst in nearly 30 years). The CampFire, in November 2018, was the deadliest and most destructive in California’s history, causing the death of at least 85 people and destroying about 14,000 homes. Australia’s most populous state New South Wales suffered from an intense drought in 2018. According to a report released by the UN last year, there are “Only 11 Years Left to Prevent Irreversible Damage from Climate Change”. Addressing climate change: How ARTIFICIAL INTELLIGENCE (AI) can help? As seen above, environmental impacts due to climate changes are clear, the list is vast and depressing. It is important to address climate change issues as they play a key role in the workings of a natural ecosystem like change in the nature of global rainfall, diminishing ice-sheets, and other factors on which the human economy and the civilization depends on. With the help of Artificial Intelligence (AI), we can increase our probability of becoming efficient, or at least slow down the damage caused by climate change. In the recently held ICLR 2019 (International Conference on Learning Representations), Emily Shuckburgh, a Climate scientist and deputy head of the Polar Oceans team at the British Antarctic Survey highlighted the need of actionable information on climate risk. It elaborated on how we can monitor, treat and find a solution to the climate changes using machine learning. Also mentioned is, how AI can synthesize and interpolate different datasets within a framework that will allow easy interrogation by users and near-real time ingestion of new data. According to MIT tech review on climate changes, there are three approaches to address climate change: mitigation, navigation and suffering. Technologies generally concentrate on mitigation, but it’s high time that we give more focus to the other two approaches. In a catastrophically altered world, it would be necessary to concentrate on adaptation and suffering. This review states that, the mitigation steps have had almost no help in preserving fossil fuels. Thus it is important for us to learn to adapt to these changes. Building predictive models by relying on masses of data will also help in providing a better idea of how bad the effect of a disaster can be and help us to visualize the suffering. By implementing Artificial Intelligence in these approaches, it will help not only to reduce the causes but also to adapt to these climate changes. Using AI, we can predict the accurate status of climate change, which will help create better futuristic climate models. These predictions can be used to identify our biggest vulnerabilities and risk zones. This will help us to respond in a better way to the impact of climate change such as hurricanes, rising sea levels, and higher temperatures. Let’s see how Artificial Intelligence is being used in all the three approaches - Mitigation: Reducing the severity of climate change Looking at the extreme climatic changes, many researchers have started exploring how AI can step-in to reduce the effects of climate change. These include ways to reduce greenhouse gas emissions or enhance the removal of these gases from the atmosphere. In view of consuming less energy, there has been an active increase in technologies to use energy smartly. One such startup is the ‘Verv’. It is an intelligent IoT hub which uses patented AI technology to give users the authority to take control of their energy usage. This home energy system provides you with information about your home appliances and other electricity data directly from the mains, which helps to reduce your electricity bills and lower your carbon footprint. ‘Igloo Energy’ is another system which helps customers use energy efficiently and save money. It uses smart meters to analyse behavioural, property occupancy and surrounding environmental data inputs to lower the energy consumption of users. ‘Nnergix’ is a weather analytics startup focused in the renewable energy industry. It collects weather and energy data from multiple sources from the industry in order to feed machine learning based algorithms to run several analytic solutions with the main goal to help any system become more efficient during operations and reduce costs. Recently, Google announced that by using Artificial Intelligence, it’s wind energy has boosted up to 20 percent. A neural network is trained on the widely available weather forecasts and historical turbine data. The DeepMind system is configured to predict the wind power output 36 hours ahead of actual generation. The model then recommends to make hourly delivery commitments to the power grid a full day in advance, based on the predictions. Large industrial systems are the cause of 54% of global energy consumption. This high-level of energy consumption is the primary contributor to greenhouse gas emissions. In 2016, Google’s ‘DeepMind’ was able to reduce the energy required to cool Google Data Centers by 30%. Initially, the team made a general purpose learning algorithm which was developed into a full-fledged AI system with features including continuous monitoring and human override. Just last year, Google has put an AI system in charge of keeping its data centers cool. Every five minutes, AI pulls a snapshot of the data center cooling system from thousands of sensors. This data is fed into deep neural networks, which predicts how different choices will affect future energy consumption. The neural networks are trained to maintain the future PUE (Power Usage Effectiveness) and to predict the future temperature and pressure of the data centre over the next hour, to ensure that any tweaks did not take the data center beyond its operating limits. Google has found that the machine learning systems were able to consistently achieve a 30 percent reduction in the amount of energy used for cooling, the equivalent of a 15 percent reduction in overall PUE. As seen, there are many companies trying to reduce the severity of climate change. Navigation: Adapting to current conditions Though there have been brave initiatives to reduce the causes of climate change, they have failed to show any major results. This could be due to the increasing demand for energy resources, which is expected to grow immensely globally. It is now necessary to concentrate more on adapting to climate change, as we are in a state where it is almost impossible to undo its effects. Thus, it is better to learn and navigate through this climate change. A startup in Berlin, called ‘GreenAdapt’ has created a software using AI, which can tackle local impacts induced both by gradual changes and changes of extreme weather events such as storms. It identifies effects of climatic changes and proposes adequate adaptation measures. Another startup called ‘Zuli’ has a smartplug that reduces energy use. It contains sensors that can estimate energy usage, wirelessly communicate with your smartphone, and accurately sense your location. A firm called ‘Gridcure’ provides real-time analytics and insights for energy and utilities. It helps power companies recover losses and boost revenue by operating more efficiently. It also helps them provide better delivery to consumers, big reductions in energy waste, and increased adoption of clean technologies. With mitigation and navigation being pursued enough, let’s see how firms are working on futuristic goals. Visualization: Predicting the future It is also equally important to visualize accurate climate models, which will help humans to cope up with the aftereffects of climate change. Climate models are mathematical representations of the Earth's climate system, which takes into account humidity, temperature, air pressure, wind speed and direction, as well as cloud cover and predict future weather conditions. This can help in tackling disasters. It’s also imperative to fervently increase our information on global climate changes which will help to create more accurate models. A startup modeling firm called ‘Jupiter’ is trying to better the accuracy of predictions regarding climate changes. It makes physics-based and Artificial Intelligence-powered decisions using data from millions of ground-based and orbital sensors. Another firm, ‘BioCarbon Engineering’ plans to use drones which will fly over potentially suitable areas and compile 3D maps. Then, it will scatter small containers over the best areas containing fertilized seeds as well as nutrients and moisture gel. In this way, 36,000 trees can be planted every day in a way that is cheaper than other methods. After planting, drones will continue to monitor the germinating seeds and deliver further nutrients when necessary to ensure their healthy growth. This could help to absorb carbon dioxide from the atmosphere. Another initiative is by a ETH doctoral student at the Functional Materials Laboratory, who has developed a cooling curtain made of a porous triple-layer membrane as an alternative to electrically powered air conditioning. In 2017, Microsoft came up with ‘AI for Earth’ initiative, which primarily focuses on climate conservation, biodiversity, etc. AI for Earth awards grants to projects that use artificial intelligence to address critical areas that are vital for building a sustainable future. Microsoft is also using its cloud computing service Azure, to give computing resources to scientists working on environmental sustainability programs. Intel has deployed Artificial Intelligence-equipped Drones in Costa Rica to construct models of the forest terrain and calculate the amount of carbon being stored based on tree height, health, biomass, and other factors. The collected data about carbon capture can enhance management and conservation efforts, support scientific research projects on forest health and sustainability, and enable many other kinds of applications. The ‘Green Horizon Project from IBM’ analyzes environmental data and predicts pollution as well as tests scenarios that involve pollution-reducing tactics. IBM's Deep Thunder’ group works with research centers in Brazil and India to accurately predict flooding and potential mudslides due to the severe storms. As seen above, there are many organizations and companies ranging from startups to big tech who have understood the adverse effects of climate change and are taking steps to address them. However, there are certain challenges/limitations acting as a barrier for these systems to be successful. What do big tech firms and startups lack? Though many big tech and influential companies boast of immense contribution to fighting climate change, there have been instances where these firms get into lucrative deals with oil companies. Just last year, Amazon, Google and Microsoft struck deals with oil companies to provide cloud, automation, and AI services to them. These deals were published openly by Gizmodo and yet didn’t attract much criticism. This trend of powerful companies venturing into oil businesses even after knowing the effects of dangerous climate changes is depressing. Last year, Amazon quietly launched the ‘Amazon Sustainability Data Initiative’.It helps researchers store many weather observations and forecasts, satellite images and metrics about oceans, air quality so that they can be used for modeling and analysis. This encourages organizations to use the data to make decisions which will encourage sustainable development. This year, Amazon has expanded its vision by announcing ‘Shipment Zero’ to make all Amazon shipments with 50% net zero by 2030, with a wider aim to make it 100% in the future. However, Shipment Zero only commits to net carbon reductions. Recently, Amazon ordered 20,000 diesel vans whose emissions will need to be offset with carbon credits. Offsets can entail forest management policies that displace indigenous communities, and they do nothing to reduce diesel pollution which disproportionately harms communities of color. Some in the industry expressed disappointment that Amazon’s order is for 20,000 diesel vans — not a single electric vehicle. In April, Over 4,520 Amazon employees organized against Amazon’s continued profiting from climate devastation. They signed an open letter addressed to Jeff Bezos and Amazon board of directors asking for a company-wide action plan to address climate change and an end to the company’s reliance on dirty energy resources. Recently, Microsoft doubled its internal carbon fee to $15 per metric ton on all carbon emissions. The funds from this higher fee will maintain Microsoft’s carbon neutrality and help meet their sustainability goals. On the other hand, Microsoft is also two years into a seven-year deal—rumored to be worth over a billion dollars—to help Chevron, one of the world’s largest oil companies, better extract and distribute oil. Microsoft Azure has also partnered with Equinor, a multinational energy company to provide data services in a deal worth hundreds of millions of dollars. Instead of gaining profit from these deals, Microsoft could have taken a stand by ending partnerships with these fossil fuel companies which accelerate oil and gas exploration and extraction. With respect to smaller firms, often it is difficult for a climate-focused conservative startup to survive due to the dearth of finance. Many such organizations are small and relatively weak as they struggle to rise in a sector with little apathy and lack of steady financing. Also startups being non-famous, it is difficult for them to market their ideas and convince people to try their systems. They always need a commercial boost to find more takers. Pitfalls of using Artificial Intelligence for climate preservation Though AI has enormous potential to help us create a sustainable future, it is only part of a bigger set of tools and pathways needed to reach the goal. It also comes with its own limitations and side effects. An inability to control malicious AI can cause unexpected outcomes. Hackers can use AI to develop smart malware that interfere with early warnings, enable bad actors to control energy, transportation or other critical systems and could also get them access to sensitive data. This could result in unexpected outcomes at crucial output points for AI systems. AI bias, is another dangerous phenomena, that can give an irrational result to a working system. Bias in an AI system mainly occurs in the data or in the system’s algorithmic model which may produce incorrect results in its functions and security. [dropcap]M[/dropcap]ore importantly, we should not rely on Artificial Intelligence alone to fight the effects of climate change. Our focus should be to work on the causes of climate change and try to minimize it, from an individual level. Even governments in every country must contribute, by initiating “climate policies” which will help its citizens in the long run. One vital task would be to implement quick responses in case of climate emergencies. Like the recent case of Odisha storms, the pinpoint accuracy by the Indian weather association helped to move millions of people to safe spaces, resulting in minimum casualties. Next up in Climate Amazon employees plan to walkout for climate change during the Sept 20th Global Climate Strike Machine learning experts on how we can use machine learning to mitigate and adapt to the changing climate Now there’s a CycleGAN to visualize the effects of climate change. But is this enough to mobilize action?

0
0
6162

article-image-the-cap-theorem-in-practice-the-consistency-vs-availability-trade-off-in-distributed-databases

Richard Gall

12 Sep 2019

7 min read

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Richard Gall

12 Sep 2019

7 min read

When you choose a database you are making a design decision. One of the best frameworks for understanding what this means in practice is the CAP Theorem. What is the CAP Theorem? The CAP Theorem, developed by computer scientist Eric Brewer in the late nineties, states that databases can only ever fulfil two out of three elements: Consistency - that reads are always up to date, which means any client making a request to the database will get the same view of data. Availability - database requests always receive a response (when valid). Partition tolerance - that a network fault doesn’t prevent messaging between nodes. In the context of distributed (NoSQL) databases, this means there is always going to be a trade-off between consistency and availability. This is because distributed systems are always necessarily partition tolerant (ie. it simply wouldn’t be a distributed database if it wasn’t partition tolerant.) Read next: Different types of NoSQL databases and when to use them How do you use the CAP Theorem when making database decisions? Although the CAP Theorem can feel quite abstract, it has practical, real-world consequences. From both a technical and business perspective the trade-offs will lead you to some very important questions. There are no right answers. Ultimately it will be all about the context in which your database is operating, the needs of the business, and the expectations and needs of users. You will have to consider things like: Is it important to avoid throwing up errors in the client? Or are we willing to sacrifice the visible user experience to ensure consistency? Is consistency an actual important part of the user’s experience Or can we actually do what we want with a relational database and avoid the need for partition tolerance altogether? As you can see, these are ultimately user experience questions. To properly understand those, you need to be sensitive to the overall goals of the project, and, as said above, the context in which your database solution is operating. (Eg. Is it powering an internal analytics dashboard? Or is it supporting a widely used external-facing website or application?) And, as the final bullet point highlights, it’s always worth considering whether the consistency v availability trade-off should matter at all. Avoid the temptation to think a complex database solution will always be better when a simple, more traditional solution will do the job. Of course, it’s important to note that systems that aren’t partition tolerant are a single point of failure in a system. That introduces the potential for unreliability. Prioritizing consistency in a distributed database It’s possible to get into a lot of technical detail when talking about consistency and availability, but at a really fundamental level the principle is straightforward: you need consistency (or what is called a CP database) if the data in the database must always be up to date and aligned, even in the instance of a network failure (eg. the partitioned nodes are unable to communicate with one another for whatever reason). Particular use cases where you would prioritize consistency is when you need multiple clients to have the same view of the data. For example, where you’re dealing with financial information, personal information, using a database that gives you consistency and confidence that data you are looking at is up to date in a situation where the network is unreliable or fails. Examples of CP databases MongoDB Learning MongoDB 4 [Video] MongoDB 4 Quick Start Guide MongoDB, Express, Angular, and Node.js Fundamentals Redis Build Complex Express Sites with Redis and Socket.io [Video] Learning Redis HBase Learn by Example : HBase - The Hadoop Database [Video] HBase Design Patterns Prioritizing availability in a distributed database Availability is essential when data accumulation is a priority. Think here of things like behavioral data or user preferences. In scenarios like these, you will want to capture as much information as possible about what a user or customer is doing, but it isn’t critical that the database is constantly up to date. It simply just needs to be accessible and available even when network connections aren’t working. The growing demand for offline application use is also one reason why you might use a NoSQL database that prioritizes availability over consistency. Examples of AP databases Cassandra Learn Apache Cassandra in Just 2 Hours [Video] Mastering Apache Cassandra 3.x - Third Edition DynamoDB Managed NoSQL Database In The Cloud - Amazon AWS DynamoDB [Video] Hands-On Amazon DynamoDB for Developers [Video] Limitations and criticisms of CAP Theorem It’s worth noting that the CAP Theorem can pose problems. As with most things, in truth, things are a little more complicated. Even Eric Brewer is circumspect about the theorem, especially as what we expect from distributed databases. Back in 2012, twelve years after he first put his theorem into the world, he wrote that: “Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations.” So, this means we must think about the trade-off between consistency and availability as a balancing act, rather than a binary design decision. Elsewhere, there have been more robust criticisms of CAP Theorem. Software engineer Martin Kleppmann, for example, pleaded Please stop calling databases CP or AP in 2015. In a blog post he argues that CAP Theorem only works if you adhere to specific definitions of consistency, availability, and partition tolerance. “If your use of words matches the precise definitions of the proof, then the CAP theorem applies to you," he writes. “But if you’re using some other notion of consistency or availability, you can’t expect the CAP theorem to still apply.” The consequences of this are much like those described in Brewer’s piece from 2012. You need to take a nuanced approach to database trade-offs in which you think them through on your own terms and up against your own needs. The PACELC Theorem One of the developments of this line of argument is an extension to the CAP Theorem: the PACELC Theorem. This moves beyond thinking about consistency and availability and instead places an emphasis on the trade-off between consistency and latency. The PACELC Theorem builds on the CAP Theorem (the ‘PAC’) and adds an else (the ‘E’). What this means is that while you need to choose between availability and consistency if communication between partitions has failed in a distributed system, even if things are running properly and there are no network issues, there is still going to be a trade-off between consistency and latency (the ‘LC’). Conclusion: Learn to align context with technical specs Although the CAP Theorem might seem somewhat outdated, it is valuable in providing a way to think about database architecture design. It not only forces engineers and architects to ask questions about what they want from the technologies they use, but it also forces them to think carefully about the requirements of a given project. What are the business goals? What are user expectations? The PACELC Theorem builds on CAP in an effective way. However, the most important thing about these frameworks is how they help you to think about your problems. Of course the CAP Theorem has limitations. Because it abstracts a problem it is necessarily going to lack nuance. There are going to be things it simplifies. It’s important, as Kleppmann reminds us - to be mindful of these nuances. But at the same time, we shouldn’t let an obsession with nuance and detail allow us to miss the bigger picture.

0
0
24069

article-image-different-types-of-nosql-databases-and-when-to-use-them

Richard Gall

10 Sep 2019

8 min read

Different types of NoSQL databases and when to use them

Richard Gall

10 Sep 2019

8 min read

Why NoSQL databases? The popularity of NoSQL databases over the last decade or so has been driven by an explosion of data. Before what’s commonly described as ‘the big data revolution’, relational databases were the norm - these are databases that contain structured data. Structured data can only be structured if it is based on an existing schema that defines the relationships (hence relational) between the data inside the database. However, with the vast quantities of data that are now available to just about every business with an internet connection, relational databases simply aren’t equipped to handle the complexity and scale of large datasets. Why not SQL databases? This is for a couple of reasons. The defined schemas that are a necessary component of every relational database will not only undermine the richness and integrity of the data you’re working with, relational databases are also hard to scale. Relational databases can only scale vertically, not horizontally. That’s fine to a certain extent, but when you start getting high volumes of data - such as when millions of people use a web application, for example - things get really slow and you need more processing power. You can do this by upgrading your hardware, but that isn’t really sustainable. By scaling out, as you can with NoSQL databases, you can use a distributed network of computers to handle data. That gives you more speed and more flexibility. This isn’t to say that relational and SQL databases have had their day. They still fulfil many use cases. The only difference is that NoSQL can offers a level of far greater power and control for data intensive use cases. Indeed, using a NoSQL database when SQL will do is only going to add more complexity to something that just doesn’t need it. Seven NoSQL Databases in a Week Different types of NoSQL databases and when to use them So, now we’ve looked at why NoSQL databases have grown in popularity in recent years, lets dig into some of the different options available. There are a huge number of NoSQL databases out there - some of them open source, some premium products - many of them built for very different purposes. Broadly speaking there are 4 different models of NoSQL databases: Key-Value pair-based databases Column-based databases Document-oriented databases Graph databases Let’s take a look at these four models, how they’re different from one another, and some examples of the product options in each. Key Value pair-based NoSQL database management systems Key/Value pair based NoSQL databases store data in, as you might expect, pairs of keys and values. Data is stored with a matching key - keys have no relation or structure (so, keys could be height, age, hair color, for example). When should you use a key/value pair-based NoSQL DBMS? Key/value pair based NoSQL databases are the most basic type of NoSQL database. They’re useful for storing fairly basic information, like details about a customer. Which key/value pair-based DBMS should you use? There are a number of different key/value pair databases. The most popular is Redis. Redis is incredibly fast and very flexible in terms of the languages and tools it can be used with. It can be used for a wide variety of purposes - one of the reasons high-profile organizations use it, including Verizon, Atlassian, and Samsung. It’s also open source with enterprise options available for users with significant requirements. Redis 4.x Cookbook Other than Redis, other options include Memcached and Ehcache. As well as those, there are a number of other multi-model options (which will crop up later, no doubt) such as Amazon DynamoDB, Microsoft’s Cosmos DB, and OrientDB. Hands-On Amazon DynamoDB for Developers [Video] RDS PostgreSQL and DynamoDB CRUD: AWS with Python and Boto3 [Video] Column-based NoSQL database management systems Column-based databases separate data into discrete columns. Instead of using rows - whereby the row ID is the main key - column-based database systems flip things around to make the data the main key. By using columns you can gain much greater speed when querying data. Although it’s true that querying a whole row of data would take longer in a column-based DBMS, the use cases for column based databases mean you probably won’t be doing this. Instead you’ll be querying a specific part of the data rather than the whole row. When should you use a column-based NoSQL DBMS? Column-based systems are most appropriate for big data and instances where data is relatively simple and consistent (they don’t particularly handle volatility that well). Which column-based NoSQL DBMS should you use? The most popular column-based DBMS is Cassandra. The software prizes itself on its performance, boasting 100% availability thanks to lacking a single point of failure, and offering impressive scalability at a good price. Cassandra’s popularity speaks for itself - Cassandra is used by 40% of the Fortune 100. Mastering Apache Cassandra 3.x - Third Edition Learn Apache Cassandra in Just 2 Hours [Video] There are other options available, such as HBase and Cosmos DB. HBase High Performance Cookbook Document-oriented NoSQL database management systems Document-oriented NoSQL systems are very similar to key/value pair database management systems. The only difference is that the value that is paired with a key is stored as a document. Each document is self-contained, which means no schema is required - giving a significant degree of flexibility over the data you have. For software developers, this is essential - it’s for this reason that document-oriented databases such as MongoDB and CouchDB are useful components of the full-stack development tool chain. Some search platforms such as ElasticSearch use mechanisms similar to standard document-oriented systems - so they could be considered part of the same family of database management systems. When should you use a document-oriented DBMS? Document-oriented databases can help power many different types of websites and applications - from stores to content systems. However, the flexibility of document-oriented systems means they are not built for complex queries. Which document-oriented DBMS should you use? The leader in this space is, MongoDB. With an amazing 40 million downloads (and apparently 30,000 more every single day), it’s clear that MongoDB is a cornerstone of the NoSQL database revolution. MongoDB 4 Quick Start Guide MongoDB Administrator's Guide MongoDB Cookbook - Second Edition There are other options as well as MongoDB - these include CouchDB, CouchBase, DynamoDB and Cosmos DB. Learning Azure Cosmos DB Guide to NoSQL with Azure Cosmos DB Graph-based NoSQL database management systems The final type of NoSQL database is graph-based. The notable distinction about graph-based NoSQL databases is that they contain the relationships between different data. Subsequently, graph databases look quite different to any of the other databases above - they store data as nodes, with the ‘edges’ of the nodes describing their relationship to other nodes. Graph databases, compared to relational databases, are multidimensional in nature. They display not just basic relationships between tables and data, but more complex and multifaceted ones. When should you use a graph database? Because graph databases contain the relationships between a set of data (customers, products, price etc.) they can be used to build and model networks. This makes graph databases extremely useful for applications ranging from fraud detection to smart homes to search. Which graph database should you use? The world’s most popular graph database is Neo4j. It’s purpose built for data sets that contain strong relationships and connections. Widely used in the industry in companies such as eBay and Walmart, it has established its reputation as one of the world’s best NoSQL database products. Back in 2015 Packt’s Data Scientist demonstrated how he used Neo4j to build a graph application. Read more. Learning Neo4j 3.x [Video] Exploring Graph Algorithms with Neo4j [Video] NoSQL databases are the future - but know when to use the right one for the job Although NoSQL databases will remain a fixture in the engineering world, SQL databases will always be around. This is an important point - when it comes to databases, using the right tool for the job is essential. It’s a valuable exercise to explore a range of options and get to know how they work - sometimes the difference might just be a personal preference about usability. And that’s fine - you need to be productive after all. But what’s ultimately most essential is having a clear sense of what you’re trying to accomplish, and choosing the database based on your fundamental needs.

0
0
22068

article-image-what-can-you-expect-at-neurips-2019

Sugandha Lahoti

06 Sep 2019

5 min read

What can you expect at NeurIPS 2019?

Sugandha Lahoti

06 Sep 2019

5 min read

0
0
4686

article-image-google-is-circumventing-gdpr-reveals-braves-investigation-for-the-authorized-buyers-ad-business-case

Bhagyashree R

06 Sep 2019

6 min read

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Bhagyashree R

06 Sep 2019

6 min read

Last year, Dr. Johnny Ryan, the Chief Policy & Industry Relations Officer at Brave, filed a complaint against Google’s DoubleClick/Authorized Buyers ad business with the Irish Data Protection Commission (DPC). New evidence produced by Brave reveals that Google is circumventing GDPR and also undermining its own data protection measures. Brave calls Google’s Push Pages a GDPR workaround Brave’s new evidence rebuts some of Google’s claims regarding its DoubleClick/Authorized Buyers system, the world’s largest real-time advertising auction house. Google says that it prohibits companies that use its real-time bidding (RTB) ad system “from joining data they receive from the Cookie Matching Service.” In September last year, Google announced that it has removed encrypted cookie IDs and list names from bid requests with buyers in its Authorized Buyers marketplace. Brave’s research, however, found otherwise, “Brave’s new evidence reveals that Google allowed not only one additional party, but many, to match with Google identifiers. The evidence further reveals that Google allowed multiple parties to match their identifiers for the data subject with each other.” When you visit a website that has Google ads embedded on its web pages, Google will run a real-time bidding ad auction to determine which advertiser will get to display its ads. For this, it uses Push Pages, which is the mechanism in question here. Brave hired Zach Edwards, the co-founder of digital analytics startup Victory Medium, and MetaX, a company that audits data supply chains, to investigate and analyze a log of Dr. Ryan’s web browsing. The research revealed that Google's Push Pages can essentially be used as a workaround for user IDs. Google shares a ‘google_push’ identifier with the participating companies to identify a user. Brave says that the problem here is that the identifier that was shared was common to multiple companies. This means that these companies could have cross-referenced what they learned about the user from Google with each other. Used by more than 8.4 million websites, Google's DoubleClick/Authorized Buyers broadcasts personal data of users to 2000+ companies. This data includes the category of what a user is reading, which can reveal their political views, sexual orientation, religious beliefs, as well as their locations. There are also unique ID codes that are specific to a user that can let companies uniquely identify a user. All this information can give these companies a way to keep tabs on what users are “reading, watching, and listening to online.” Brave calls Google’s RTB data protection policies “weak” as they ask these companies to self-regulate. Google does not have much control over what these companies do with the data once broadcast. “Its policy requires only that the thousands of companies that Google shares peoples’ sensitive data with monitor their own compliance, and judge for themselves what they should do,” Brave wrote. A Google spokesperson, as a response to this news, told Forbes, “We do not serve personalised ads or send bid requests to bidders without user consent. The Irish DPC — as Google's lead DPA — and the UK ICO are already looking into real-time bidding in order to assess its compliance with GDPR. We welcome that work and are co-operating in full." Users recommend starting an “information campaign” instead of a penalty that will hardly affect the big tech This news triggered a discussion on Hacker News where users talked about the implications of RTB and what strict actions the EU can take to protect user privacy. A user explained, "So, let's say you're an online retailer, and you have Google IDs for your customers. You probably have some useful and sensitive customer information, like names, emails, addresses, and purchase histories. In order to better target your ads, you could participate in one of these exchanges, so that you can use the information you receive to suggest products that are as relevant as possible to each customer. To participate, you send all this sensitive information, along with a Google ID, and receive similar information from other retailers, online services, video games, banks, credit card providers, insurers, mortgage brokers, service providers, and more! And now you know what sort of vehicles your customers drive, how much they make, whether they're married, how many kids they have, which websites they browse, etc. So useful! And not only do you get all these juicy private details, but you've also shared your customers sensitive purchase history with anyone else who is connected to the exchange." Others said that a penalty is not going to deter Google. "The whole penalty system is quite silly. The fines destroy small companies who are the ones struggling to comply, and do little more than offer extremely gentle pokes on the wrist for megacorps that have relatively unlimited resources available for complete compliance, if they actually wanted to comply." Users suggested that the EU should instead start an information campaign. "EU should ignore the fines this time and start an "information campaign" regarding behavior of Google and others. I bet that hurts Google 10 times more." Some also said that not just Google but the RTB participants should also be held responsible. "Because what Google is doing is not dissimilar to how any other RTB participant is acting, saying this is a Google workaround seems disingenuous." With this case, Brave has launched a full-fledged campaign that aims to “reform the multi-billion dollar RTB industry spans sixteen EU countries.” To achieve this goal it has collaborated with several privacy NGOs and academics including the Open Rights Group, Dr. Michael Veale of the Turing Institute, among others. In other news, a Bloomberg report reveals that Google and other internet companies have recently asked for an amendment to the California Consumer Privacy Act, which will be enacted in 2020. The law currently limits how digital advertising companies collect and make money from user data. The amendments proposed include approval for collecting user data for targeted advertising, using the collected data from websites for their own analysis, and many others. Read the Bloomberg report to know more in detail. Other news in Data Facebook content moderators work in filthy, stressful conditions and experience emotional trauma daily, reports The Verge GDPR complaint in EU claim billions of personal data leaked via online advertising bids European Union fined Google 1.49 billion euros for antitrust violations in online advertising

0
0
2659

article-image-key-skills-every-database-programmer-should-have

Sugandha Lahoti

05 Sep 2019

7 min read

Key skills every database programmer should have

Sugandha Lahoti

05 Sep 2019

7 min read

According to Robert Half Technology’s 2019 IT salary report, ‘Database programmer’ is one of the 13 most in-demand tech jobs for 2019. For an entry-level programmer, the average salary is $98,250 which goes up to $167,750 for a seasoned expert. A typical database programmer is responsible for designing, developing, testing, deploying, and maintaining databases. In this article, we will list down the top critical tech skills essential to database programmers. #1 Ability to perform Data Modelling The first step is to learn to model the data. In Data modeling, you create a conceptual model of how data items relate to each other. In order to efficiently plan a database design, you should know the organization you are designing the database from. This is because Data models describe real-world entities such as ‘customer’, ‘service’, ‘products’, and the relation between these entities. Data models provide an abstraction for the relations in the database. They aid programmers in modeling business requirements and in translating business requirements into relations. They are also used for exchanging information between the developers and business owners. During the design phase, the database developer should pay great attention to the underlying design principles, run a benchmark stack to ensure performance, and validate user requirements. They should also avoid pitfalls such as data redundancy, null saturation, and tight coupling. #2 Know a database programming language, preferably SQL Database programmers need to design, write and modify programs to improve their databases. SQL is one of the top languages that are used to manipulate the data in a database and to query the database. It's also used to define and change the structure of the data—in other words, to implement the data model. Therefore it is essential that you learn SQL. In general, SQL has three parts: Data Definition Language (DDL): used to create and manage the structure of the data Data Manipulation Language (DML): used to manage the data itself Data Control Language (DCL): controls access to the data Considering, data is constantly inserted into the database, changed, or retrieved DML is used more often in day-to-day operations than the DDL, so you should have a strong grasp on DML. If you plan to grow in a database architect role in the near future, then having a good grasp of DDL will go a long way. Another reason why you should learn SQL is that almost every modern relational database supports SQL. Although different databases might support different features and implement their own dialect of SQL, the basics of the language remain the same. If you know SQL, you can quickly adapt to MySQL, for example. At present, there are a number of categories of database models predominantly, relational, object-relational, and NoSQL databases. All of these are meant for different purposes. Relational databases often adhere to SQL. Object-relational databases (ORDs) are also similar to relational databases. NoSQL, which stands for "not only SQL," is an alternative to traditional relational databases useful for working with large sets of distributed data. They provide benefits such as availability, schema-free, and horizontal scaling, but also have limitations such as performance, data retrieval constraints, and learning time. For beginners, it is advisable to first start with experimenting on relational databases learning SQL, gradually transitioning to NoSQL DBMS. #3 Know how to Extract, Transform, Load various data types and sources A database programmer should have a good working knowledge of ETL (Extract, Transform Load) programming. ETL developers basically extract data from different databases, transform it and then load the data into the Data Warehouse system. A Data Warehouse provides a common data repository that is essential for business needs. A database programmer should know how to tune existing packages, tables, and queries for faster ETL processing. They should conduct unit tests before applying any change to the existing ETL process. Since ETL takes data from different data sources (SQL Server, CSV, and flat files), a database developer should have knowledge on how to deal with different data sources. #4 Design and test Database plans Database programmers o perform regular tests to identify ways to solve database usage concerns and malfunctions. As databases are usually found at the lowest level of the software architecture, testing is done in an extremely cautious fashion. This is because changes in the database schema affect many other software components. A database developer should make sure that when changing the database structure, they do not break existing applications and that they are using the new structures properly. You should be proficient in Unit testing your database. Unit tests are typically used to check if small units of code are functioning properly. For databases, unit testing can be difficult. So the easiest way to do all of that is by writing the tests as SQL scripts. You should also know about System Integration Testing which is done on the complete system after the hardware and software modules of that system have been integrated. SIT validates the behavior of the system and ensures that modules in the system are functioning suitably. #5 Secure your Database Data protection and security are essential for the continuity of business. Databases often store sensitive data, such as user information, email addresses, geographical addresses, and payment information. A robust security system to protect your database against any data breach is therefore necessary. While a database architect is responsible for designing and implementing secure design options, a database admin must ensure that the right security and privacy policies are in place and are being observed. However, this does not absolve database programmers from adopting secure coding practices. Database programmers need to ensure that data integrity is maintained over time and is secure from unauthorized changes or theft. They need to especially be careful about Table Permissions i.e who can read and write to what tables. You should be aware of who is allowed to perform the 4 basic operations of INSERT, UPDATE, DELETE and SELECT against which tables. Database programmers should also adopt authentication best practices depending on the infrastructure setup, the application's nature, the user's characteristics, and data sensitivity. If the database server is accessed from the outside world, it is beneficial to encrypt sessions using SSL certificates to avoid packet sniffing. Also, you should secure database servers that trust all localhost connections, as anyone who accesses the localhost can access the database server. #6 Optimize your database performance A database programmer should also be aware of how to optimize their database performance to achieve the best results. At the basic level, they should know how to rewrite SQL queries and maintain indexes. Other aspects of optimizing database performance, include hardware configuration, network settings, and database configuration. Generally speaking, tuning database performance requires knowledge about the system's nature. Once the database server is configured you should calculate the number of transactions per second (TPS) for the database server setup. Once the system is up and running, and you should set up a monitoring system or log analysis, which periodically finds slow queries, the most time-consuming queries, etc. #7 Develop your soft skills Apart from the above technical skills, a database programmer needs to be comfortable communicating with developers, testers and project managers while working on any software project. A keen eye for detail and critical thinking can often spot malfunctions and errors that may otherwise be overlooked. A database programmer should be able to quickly fix issues within the database and streamline the code. They should also possess quick-thinking to prioritize tasks and meet deadlines effectively. Often database programmers would be required to work on documentation and technical user guides so strong writing and technical skills are a must. Get started If you want to get started with becoming a Database programmer, Packt has a range of products. Here are some of the best: PostgreSQL 11 Administration Cookbook Learning PostgreSQL 11 - Third Edition PostgreSQL 11 in 7 days [ Video ] Using MySQL Databases With Python [ Video ] Basic Relational Database Design [ Video ] How to learn data science: from data mining to machine learning How to ace a data science interview 5 barriers to learning and technology training for small software development teams

0
0
14614

article-image-how-to-learn-data-science-from-data-mining-to-machine-learning

Richard Gall

04 Sep 2019

6 min read

How to learn data science: from data mining to machine learning

Richard Gall

04 Sep 2019

6 min read

Data science is a field that’s complex and diverse. If you’re trying to learn data science and become a data scientist it can be easy to fall down a rabbit hole of machine learning or data processing. To a certain extent, that’s good. To be an effective data scientist you need to be curious. You need to be prepared to take on a range of different tasks and challenges. But that’s not always that efficient: if you want to learn quickly and effectively, you need a clear structure - a curriculum - that you can follow. This post will show you what you need to learn and how to go about it. Statistics Statistics is arguably the cornerstone of data science. Nate Silver called data scientists “sexed up statisticians”, a comment that was perhaps unfair but still nevertheless contains a kernel of truth in it: that data scientists are always working in the domain of statistics. Once you understand this everything else you need to learn will follow easily. Machine learning, data manipulation, data visualization - these are all ultimately technological methods for performing statistical analysis really well. Best Packt books and videos content for learning statistics Statistics for Data Science R Statistics Cookbook Statistical Methods and Applied Mathematics in Data Science [Video] Before you go any deeper into data science, it’s critical that you gain a solid foundation in statistics. Data mining and wrangling This is an important element of data science that often gets overlooked with all the hype about machine learning. However, without effective data collection and cleaning, all your efforts elsewhere are going to be pointless at best. At worst they might even be misleading or problematic. Sometimes called data manipulation or data munging, it's really all about managing and cleaning data from different sources so it can be used for analytics projects. To do it well you need to have a clear sense of where you want to get to - do you need to restructure the data? Sort or remove certain parts of a data set? Once you understand this, it’s much easier to wrangle data effectively. Data mining and wrangling tools There are a number of different tools you can use for data wrangling. Python and R are the two key programming languages, and both have some useful tools for data mining and manipulation. Python in particular has a great range of tools for data mining and wrangling, such as pandas and NLTK (Natural Language Toolkit), but that isn’t to say R isn’t powerful in this domain. Other tools are available too - Weka and Apache Mahout, for example, are popular. Weka is written in Java so is a good option if you have experience with that programming language, while Mahout integrates well with the Hadoop ecosystem. Data mining and data wrangling books and videos If you need to learn data mining, wrangling and manipulation, Packt has a range of products. Here are some of the best: Data Wrangling with R Data Wrangling with Python Python Data Mining Quick Start Guide Machine Learning for Data Mining Machine learning and artificial intelligence Although Machine learning and artificial intelligence are huge trends in their own right, they are nevertheless closely aligned with data science. Indeed, you might even say that their prominence today has grown out of the excitement around data science that we first we witnessed just under a decade ago. It’s a data scientist’s job to use machine learning and artificial intelligence in a way that can drive business value. That could, for example, be to recommend products or services to customers, perhaps to gain a better understanding into existing products, or even to better manage strategic and financial risks through predictive modelling. So, while we can see machine learning in a massive range of digital products and platforms - all of which require smart development and design - for it to work successfully, it needs to be supported by a capable and creative data scientist. Machine learning and artificial intelligence books for data scientists Machine Learning Algorithms Machine Learning with R - Third Edition Machine Learning with Apache Spark Quick Start Guide Machine Learning with TensorFlow 1.x Keras Deep Learning Cookbook Data visualization A talented data scientist isn’t just a great statistician and engineer, they’re also a great communicator. This means so-called soft skills are highly valuable - the ability to communicate insights and ideas with key stakeholders is essential. But great communication isn’t just about soft skills, it’s also about data visualization. Data visualization is, at a fundamental level, about organizing and presenting data in a way that tells a story, clarifies a problem, or illustrates a solution. It’s essential that you don’t overlook this step. Indeed, spending time learning about effective data visualization can also help you to develop your soft skills. The principles behind storytelling and communication through visualization are, in truth, exactly the same when applied to other scenarios. Data visualization tools There are a huge range of data visualization tools available. As with machine learning, understanding the differences between them and working out what solution will work for you is actually an important part of the learning process. For that reason, don’t be afraid to spend a little bit of time with a range of data visualization tools. Many of the most popular data visualization tools are paid for products. Perhaps the best known of these is Tableau (which, incidentally was bought by Salesforce earlier this year). Tableau and its competitors are very user friendly, which means the barrier to entry is pretty low. They allow you to create some pretty sophisticated data visualizations fairly easily. However, sticking to these tools is not only expensive, it can also limit your abilities. We’d recommend trying a number of different data visualization tools, such as Seabor, D3.js, Matplotlib, and ggplot2. Data visualization books and videos for data scientists Applied Data Visualization with R and ggplot2 Tableau 2019.1 for Data Scientists [Video] D3.js Data Visualization Projects [Video] Tableau in 7 Steps [Video] Data Visualization with Python If you want to learn data science, just get started! As we've seen, data science requires a number of very different skills and takes in a huge breadth of tools. That means that if you're going to be a data scientist, you need to be prepared to commit to learning forver: you're never going to reach a point where you know everything. While that might sound intimidating, it's important to have confidence. With a sense of direction and purpose, and a learning structure that works for you, it's possible to develop and build your data science capabilities in a way that could unlock new opportunities and act as the basis for some really exciting projects.

0
0
5021

article-image-how-to-ace-a-data-science-interview

Richard Gall

02 Sep 2019

12 min read

How to ace a data science interview

Richard Gall

02 Sep 2019

12 min read

0
0
4922

How-To Tutorials - Data

Machine learning ethics: what you need to know and what you can do

Bad Metadata can get you in legal hot water

How to handle categorical data for machine learning algorithms

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

The best business intelligence tools 2019: when to use them and how much they cost

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

4 important business intelligence considerations for the rest of 2019

How artificial intelligence and machine learning can help us tackle the climate change emergency

The CAP Theorem in practice: The consistency vs. availability trade-off in distributed databases

Different types of NoSQL databases and when to use them

Trending Topics

What can you expect at NeurIPS 2019?

Google is circumventing GDPR, reveals Brave's investigation for the Authorized Buyers ad business case

Key skills every database programmer should have

How to learn data science: from data mining to machine learning

How to ace a data science interview