|
MASPLAS
'09
Mid-Atlantic Student Workshop on Programming Languages and Systems Saturday, April 18th, 2009 host: Haverford College in cooperation with ACM SIGPLAN Haverford College
Haverford, PA 19041 |
||||||||
|
|
||||||||
Abstract: Bug-finding tools that are based on approximate analysis often report false positives. Burned enough by false positives, a skeptical programmer wants to see convincing evidence that a bug report is indeed feasible. Symbolic analysis can often help produce such evidence in the form of a concrete test case that can reproduce a feasible bug. In this talk, I will describe recent work my colleagues and I have done on symbolic analysis of Java programs. In much of the talk, I will focus on the technology needed to make symbolic analysis work effectively on real-world Java applications, without requiring hand-written specifications of procedures, and without inlining procedure calls away. I will report on our experience with building a bug report validator based on symbolic analysis, and discuss some open research questions. I will also demonstrate additional applications of our symbolic analysis engine.
Abstract: Distributed Shared Memory, such as that provided by Intel's Cluster OpenMP, lets programmers treat the combined memory systems of a cluster of workstations as a single large address space. This can greatly facilitate the creation of a correct program, by relieving the programmer of the burden of explicitly transferring data: a correct OpenMP program should still work with Cluster OpenMP. However, by hiding data transfers, such systems also hide a major performance factor: correct OpenMP programs with poor locality-of-reference become correct but intolerably slow Cluster OpenMP programs.
Scalable Locality describes the program property of locality that increases with problem size (just as Scalable Parallelism describes the property of parallelism that increases with problem size. Thus, the combination of an optimization that exposes scalable locality and a distributed shared memory system should, in principle, produce a system that combines the programmer-friendly shared-memory paradigm with good performance on a cluster.
In this article, we present the results of an experiment to test this hypothesis.
Abstract: Given the complexity of the metatheoretic reasoning involved with current programming languages and their type systems, techniques for mechanical formalization and checking of the metatheory have received much recent attention. In previous work, we advocated a combination of locally nameless representation and cofinite quantification as a lightweight style for carrying out such formalizations in the Coq proof assistant. As part of the presentation of that methodology, we described a number of operations associated with variable binding and listed a number of properties, called ``infrastructure lemmas,'' about those operations that needed to be shown. The proofs of these infrastructure lemmas are generally straightforward, given a specification of the binding structure of the language.
In this work, we present LNgen, a prototype tool for automatically generating these definitions, lemmas, and proofs from Ott-like language specifications. Furthermore, the tool also generates a recursion scheme for defining functions over syntax, which was not available in our previous work. We also show the soundness and completeness of our tool's output. For untyped lambda terms, we prove the adequacy of our representation with respect to a fully concrete representation, and we argue that the representation is complete---that we generate the right set of lemmas---with respect to Gordon and Melham's ``Five Axioms of Alpha-Conversion.'' Finally, we claim that our recursion scheme is simpler to work with than either Gordon and Melham's recursion scheme or the recursion scheme of Nominal Logic.
Abstract: Models of software systems enable efficient and effective automated analyses to support software construction and maintenance. Historically, models of software systems and their components focused on representing the program's static structure and dynamic behavior, ignoring the programmer knowledge implicitly contained in identifiers and comments. We argue that taking advantage of both structural relationships and linguistic relationships can improve software engineering tools. In this poster, we present a novel software word usage model (SWUM) that facilitates creation of software engineering tools that can exploit both structural and linguistic relationships using a single representation of the program. We present an instantiation of SWUM to automatically generate phrases derived from method declarations and invocations, as well as an evaluation demonstrating that our approach achieves 84% accuracy in extracting linguistic relationships and generating phrases.
Abstract: Currently, many software systems are large and complex. Maintaining these software systems requires that programmers comprehend scattered pieces of code that are often produced by more than one developer. Programmers comprehend code by identifying segments that are relevant to their particular maintenance task, comprehending the relevant segments, and then making modifications to perform the maintenance task. We hypothesize that software maintainers use both the programmers' words, which we call textual clues, and the knowledge of programming language structures, which we call program structure, to identify and understand relevant code segments. This work focuses on using data mining of large open-source software projects to obtain verbs that are used interchangeably as equivalent with the goal of learning what textual and structural clues programmers use in the identification and understanding phases of software maintenance.
Abstract: Identifiers play an essential role in helping engineers understand programs. This research is particularly interested in identifiers that appear as the names of fields in structures. Such identifiers are important because their usage can be distributed throughout a program, and in the case of an API, they are used by many programmers. Therefore, domain information is likely to be contained within the name. This work examines the structure of field names, since structure can provide additional information about meaning. In a study of about 50 programmers creating APIs from a specification, the structural patterns from natural language such as part-of-speach are examined to determine if universal structural conventions are followed and if the specifications effect the structure of the identifiers.
Good identifier names are essential in giving humans the ability to comprehend source code. Naming these identifiers can be a challenging task since the meaning of the identifier often needs to be conveyed in abbreviated form. Natural language plays a large part in the naming of these identifiers. Patterns found in natural language often are found in source code identifiers. The specifications given to programmers for a task can affect the way that they name their identifiers. We are interested in looking at the natural language patterns found in specifications and source code in order to examine the way that programmers name their identifiers.
Abstract: Good error recovery for compilers depends on accurate and complete diagnosis of errors. Improving error recovery requires comprehension of a large and complex code base, in order to locate the places where errors can be identified and places which handle them. This involves significant amounts of cognitive effort to identify these error-related static locations in the compiler, and the dynamic points in compilation where errors are recognized. Essentially, the error recovery concern is a global design issue which is entangled with many other functional concerns, and whose implementation is frequently scattered across different program elements. Current implementations do not explicitly identify error-related control dependencies and fail to separately characterize the actions to be taken.
In the context of the AspectJ compiler, we modularize this error concern as join points-and-advice aspects which provide improved modularity by explicitly declaring the loci of error detection, and providing extension points as advice to handle these errors. We apply this modularization to support a diverse set of examples of error recovery. As a result, the compiler writer no longer needs to navigate and understand the entire compiler source in order to replace or extend error recovery actions.
Abstract: Modern software systems are typically large and complex, making comprehension of these systems extremely difficult. Experienced programmers comprehend code by seamlessly processing synonyms and other word relations. Thus, we believe that automated comprehension and software tools can be significantly improved by leveraging word relations in software. In this paper, we perform a comparative study of six state of the art, English-based semantic similarity techniques and evaluate their effectiveness on words from the comments and identifiers in software. Our results suggest that applying English-based semantic similarity techniques to software without any customization could be detrimental to the performance of the client software tools. We propose strategies to customize the existing semantic similarity techniques to software, and describe how various program comprehension tools can benefit from word relation information.
Abstract: Automated software analysis tools and recommendation systems increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which pose challenges to automatic identifier splitting.
In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining the frequency of potential substrings from source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.