Secure Your Code Via AI
In this presentation Eliezer Kanal, a Technical Manager at CERT, talks about the possibilities involved with writing secure software that is not vulnerable to cyber-attacks. Developers have created many techniques to help write secure code. While many of these techniques have existed for a while, applying machine learning techniques is able to enhance the efficiency of which these methods are implemented.
Eliezer frames the solution to solve this problem from an NLP perspective. Natural Language Processing is a machine learning algorithms’ attempt to understand, categorize, and predict language. This is done in three steps. First, the data needs to be acquired, and as is typical of machine learning algorithms the more data the better. This data consists of anything that is written and is combined with an algorithm to produce a model.
The next step involves processing this data within the model and using it to take new raw data and produce a representation of what that raw data means. The final step is to generate new language. This last step can be thought of as an autocomplete function that you might see when performing a google search or writing a text message.
“You try to know a word by the company it keeps.”
How a machine learning algorithm attempts to dissect language can be thought of in terms of morphology, lexical analysis, and semantics. Morphology includes breaking up words in to component parts. Lexical analysis converts a sequence of characters into a sequence of tokens or strings with an assigned and thus identified meaning. Finally, semantics tries to determine what you are supposed to do with the information gathered. All of this can become quite complicated when looking at normal speech and text. Fortunately, coding is more structured than normal language practices which actually makes NLP better suited for applications related to coding.
One way that machine learning algorithms attempt to tackle these NLP problems is through N-grams. N-grams remove all of the context within a body of text and only analyzes the last “n” words to try and predict the next word. In this case “n” is the variable. A bigram would be a 2n-gram, where the algorithm would look at the last two adjacent words to predict the next. Essentially these n-gram algorithms tell the probability of the next word given the previous “n” words. Again, since code is much more regular than normal language Eliezer explains that looking back 3 grams is usually enough for accurate predictions. Additionally, to the benefit of code, it is very easy to get large data sets to train and test on though sites like GitHub.
Word to vector is a newer machine learning process that uses ontology to build a giant linked dictionary that contains relationships between words. While these are difficult to make, they are accurate. By looking at the words around a single word, the algorithm can start to build a relational understanding between words. It is able to translate these relationships into a mathematical interpretation. A few examples of how this relationship can be defined are:
- man + many = men
- king – man + woman = queen
Eliezer goes on to give a few examples of how these different NLP algorithms are specifically applied to code. A common convention within programming is writing clean code. This is code that is understandable by others and allows for different processes to be passed and shared amongst individuals and team. Machine learning can look for similarities between an agreed upon correct code base and compared to new code to give warnings about what is similar and what is not. Additionally, NLP algorithms can be written to try and find bugs within the code itself. This concept might look for code that is very similar to other code with only small differences that might represent errors. If these tokens are almost identical then a warning would pop up notifying the user of a potential error.
NLP is a powerful tool that has been implemented to determine sentiment analysis, aid in spell check, allow voice text messaging, and is used in home voice recognition devices like Siri and Alexa. The application base is broad and seems adequately suited to be implemented towards cleaning and writing code as Eliezer points out in this presentation. As these processes continue to be perfected the security that they are able to provide increases as well.