




















































The CodeSearchNet corpus contains around 6 million functions from open-source code spanning six programming languages including Go, Java, Python, JavaScript, PHP, and Ruby. For collecting a large dataset of functions, the team used TreeSitter infrastructure, a parser generator tool and an incremental parsing library. The team is also releasing its data preprocessing pipeline for others to use as it will be a starting point in applying machine learning to code. This data is not directly related to code search but if used with related natural language description, it can help in training models.
CodeSearchNet corpus contains automatically generated query-like natural language for around 2 million functions. It also includes the metadata that indicates the original location where the data was found.
The team collects the corpus from publicly available open-source non-fork GitHub repositories and uses libraries.io for identifying all projects which are used by at least one other project. They further sort these projects based on their ‘popularity’ by checking the number of stars and forks. The team removes the projects that do not have a license or whose license does not allow the re-distribution of parts of the project.
The team has also tokenized all the functions, including Go, JavaScript, Python, Java, PHP and Ruby with the help of TreeSitter. For generating the training data for the CodeSearchNet Challenge, the team considers those functions in the corpus hat have documentation associated with them.
The team collected an initial set of code search queries for evaluating code search models. They started by collecting the common search queries that had high click-through rates from Bing and then combined these with queries from StaQC. The team manually filtered out those queries that were clearly ‘technical keywords’ for obtaining a set of 99 natural language queries.
The team then used a standard Elasticsearch installation and baseline models for obtaining 10 results per query from their CodeSearchNet Corpus. They then asked data scientists, programmers, and machine learning researchers for annotating the results for relevance to the query. For evaluating the CodeSearchNet Challenge, a method should return a set of results from CodeSearchNet Corpus for each of 99 pre-defined natural language queries.
Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?
ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system
GitLab 12.3 releases with web application firewall, keyboard shortcuts, productivity analytics, system hooks and more