Pages

Friday, November 30, 2012

New Version of PDFResurrect... Now with enhanced EOF locating.

The title, while any computer scientist will immediately think is a joke... is in-fact not a joke. The latest release (Version 0.12) is a bug fix release and is now available here. The main fix regards how the tool was locating the EOF token that PDF writing utilities place at the end of different versions of PDFs. Previously, if an EOF token was split across a 256-byte boundary, then Mr. Resurrect would not have found it. The new algorithm might be a tad slower; however, it is more precise.

-Matt

Monday, November 26, 2012

Answering a begging question... what does the heap look like?

So I was curious. What does a process' heap look like? If I were to take a hint from conservative garbage collectors, and scan the memory of a process, what would it look like? Well, to answer this, I threw together some software here, and ran a few GNU programs to see what they look like. I then try to relate memory allocation calls with the heap layout. The conservative garbage collector hint I used was merely the following. When my tool scans the heap, address by address, if the value at each address looks like another address residing in the heap, then my tool says the heap address points to the other address. Anyways the results are munged to crap out a .dot file. More details and such are here.

-Matt

Sunday, November 18, 2012

Syntax Party 2012

Wow. So I decided to make it down to the Syntax party 2012... a Melbourne demoscene party. It's nice to see the creativity that people put into their productions, and to see that a demoscene is still rocking.

-Matt

Monday, November 12, 2012

Go Go Gadget Now().Nanosecond()

Before I even get started, the "issue" I discuss below was resolved just three days ago. Of course, there will still be some time before this makes it to the next release of Go and also when GCC updates their libraries. Thanks Brad for pointing this out! I had no idea. Well, I don't want to delete this post... so for historical reasons it follows:

Time. It is such an unusual thing. I suppose it is "unusal" as it being such an abstract quantity. I cannot just reach out into the air and pull down some time. Yet, we humans try to quantize the abstract. After all, humans like to understand how things work. And we like to measure the things that do work, or measure things to figure out why they work the way they do. In other words: we quantize time, break it up into distinct bits and pieces, so that we can measure and understand it.

The more we chop-up time, the more accurate we can define something in all four dimensions (three dimensions of space, and one of time). For instance, I can tell a person to meet me at a particular location on the earth, which has 3 dimensions (e.g. meet me on floor 4 of the building at corner of Foo and Bar streets). But when should my acquaintance meet me? Well, I tell them a time. Possibly a really really precise time, at 7PM and 4 seconds and 3001 nanoseconds. I'm a robot. Anyways, I am digressing from the purpose of this post.

It is not uncommon for programming languages to provide a way (library) to access time. This time comes from the system that is executing the program. For instance, "gettimeofday()" and "clock_gettime()" are just two such functions that ask a POSIX (e.g. Linux, BSD...) system for what the time is. These two functions return a value that represents time. But how precise the result is, defines how fine you can measure an event. I want to look at the time routine provided by the Go programming language.

The following is my analysis of the Go 1 "time" package's "Now()" function, as provided by gcc 4.7.1 and 4.7.2 for a 64bit x86 architecture. We investigate the possible lack of precision from the resulting Nanosecond member. This is not an analysis of how the Google 6g compiler and provided Go 1 "time" package implements things. I did peer through the Go 1 "time" package provided by the 6g Google Go compiler version 1.0.3. It seems that their routine works similarly, so for all intensive purposes, we can assume what I write below for gcc is similar to the 6g compiler.

Firstly, I notice that the Go API provides a function for getting the nanoseconds from the current time of day. That's nifty. The following code should accomplish this.

package main
import "time"
func main() {
    println("The current nanosecond is:", time.Now().Nanosecond())
}

The above function obtains the current time, and then returns which nanosecond of the current second the time is. See, time, in this case the second, can be divided into smaller increments. The API says that the nanosecond returned is the nanosecond of the current second of time. Now, recall that a nanosecond represents just 1^-9 (0.000000001) of a second. That's a pretty darn small increment of time. This means that the "time" package (library) of Go should use a data structure large enough to contain 1^9 or 1000000000 values.

Well, a look at the Go runtime library (gcc and 6g) uses a 32-bit signed integer (int) to represent the current nanosecond of the current second. As any good data-structure-athiest-skeptic needs to ask is: "Are 32-bits large enough to hold a value of 1000000000?" Actually, is the "int" data type wide enough to contain the maximum value of a nanosecond? Remember that an "int" is signed, so one bit does not represent a time, rather that single bit is dedicated to representing a positive/negative (signedness) state of the value. I can't imagine a negative time, and zombie-Einstein might puke if you were to venture into the past.

The time library for Go provided by gcc uses the "gettimeofday()" system call, or virtual dynamic shared object vDSO, function provided by the operating system. A POSIX compliant operating system should provide the latter routine which obtains the time from the system. The Go "time" library calls this routine to obtain time. Go will then (convert) the returned value into a Go structure of time. I did not see in the Google 6g version of the library where they make an explicit call to "gettimeofday()." However, I did see where the library accesses members of the result, which are similar as to what "gettimeofday()" returns.

An integer can hold a range of values and has to be large enough to at least contain the values of such a range. 31-bits of value and one to represent signedness. Go implements "int" as being 32-bits, is is the data structure used for storing nanoseconds. So 2^31 - 1 or a maximum value of 2147483647. Which... YES... is large enough to contain enough nanoseconds for a second. (I had a typo in my original post and I thank the anonymous poster who cleared that up for me).

The POSIX standard defines that the "gettimeofday()" function returns seconds and MICROSECONDS... I'll say that again, microseconds, since the epoch of bell-bottoms and afro-sheen (January 1, 1970). In fact, the POSIX standard is a bit loose, and says it should return at least seconds and microseconds (leaving the doors open for other values). But, a browse of the code in gcc and 6g shows that they take a microsecond value and treat that as a base for nanoseconds.

So, what then is the Go "time" library doing? Well, when we call "Now()" we get a "Time" object. This object is initialized by an architecture specific routine which queries the system (using a system call) and sets-up the "Time" object, including the nanosecond member, from the result of that system call. What the x86 64-bit Linux implementations seems to be doing, for the "Now()" routine is effectively querying "gettimeofday()" and then taking the microsecond value and multiplying it by 1000 to represent nanoseconds. So, for those using the nanosecond value returned by a "Time" object created by "Now()", the last three digits will be zero. This "Time" object can be further manipulated (e.g. using "time.Add()", "time.Date(), or "time.Unix()") which might modify the nanosecond value, presenting a more precise representation of nanoseconds. This means that the initial return of time from "Now()" will have a nanosecond value that is always missing the lower three digits (and bits) of the value. Well, it turns out that 999999000, the maximum value of nanoseconds Go can return by the "Time" object created by "Now()" is still .99 of a second, but it's just not the high precision that might be desired. See, the lower three bits are missing... that latter value converted to binary: 0b111011100110101100011000011000

For most people this is not a big deal. But being aware is always a good thing. However, if you are using "Now()" for nanoseconds, you might in fact be looking for a high precision version of time.

TLDR; The Go runtime "time" package returns a scaled microsecond value to represent nanoseconds and not the actual nanoseconds to a high granularity, when time is created using the "Now()" library function. The lower three digits (1000 values) are always zero.

-Matt

Additionl Resources:
GCC
Go (and Google Go 6g Compiler)
gettimeofday()
C Standards Website

Sunday, November 11, 2012

Brian Cox

Last Friday night the amazing physicist from Manchester University, and once-music-star of D:Ream, Brian Cox gave a talk about the universe and life, from the infinitesimal to the very very large. Note that this was not a TED event, and the latter link is just a video from a TED talk he gave (I wanted to show the uninformed how bad-ass he is). To top it off, it was Carl Sagan's birthday. Truly an amazing evening. I have said it many times, but I think a sign of genius is having the ability to communicate and explain incredibly complicated subjects with the utmost elegance. Brian does this... like the greats of Feynman and Sagan.

-Matt

Friday, November 9, 2012

Devil Hands: The Wonders of a Loosely Typed Compiler

GCC has been something that has captured my attention for years. It seems there is a set of chosen developers whom have bowed to the darklord himself, to create a tool which is just mind-blowingly fascinating. gcc takes text and makes binaries! Sure, that's what most compilers do, but gcc just towered over me with as being, almost, a pinnacle of hackery. It's just shear black magic! And I like the dark arts.

For years I have enjoyed being a software developer; however, I wanted to get my hands dirty on a lower-level than just application development. For some reason, I wanted to work on tools that others use to construct their tools (and applications). Enter the realm of compilers. Since gcc is open source, and many people and organizations rely on it, I felt I would sell my soul. After all, idle hands do the devil's work. So, I decided to put my devil hands to work.

Well, part of my wanting to get into compilers eventually led me towards a PhD... still working on it. Well, I should be working on it, but I am writing this post instead. Anyways, during a PhD meeting the other day, one of my advisers mentioned that gcc has something like over 100 different representations for a certain data structure used inside the compiler. This structure is used to represent various things about the program that gcc is compiling. Things like variables, statements, declarations, etc. This adviser is right, in fact, he is never wrong. No, really, he knows all.

Well, I have been dipping my hands in the secret sauce that is gcc for a while now. In fact, one of the reasons I wanted to get a PhD was to hack on gcc. So my compiler driven region based memory management uses GCC-plugins to accomplish the research. Anyways, I have mentioned to my advisors many times that, yes, in fact the "tree" node data structure in gcc can be anything!

So what is the big deal if a data structure, used inside the compiler, can represent anything? Isn't that polymorphism at its finest? Well... no. It means that the internals of gcc are essentially untyped. Heck, a "tree" is defined as a union consisting of 41 structures. To top it off, there is another file (tree.def) that defines about 195 (in gcc 4.7.1) different "things" a tree can be! This is all fine and dandy, if you know what you are doing. But it means one must be an 31337 ninja when writing gcc code. The danger comes in referencing data inside the tree node. If accessed improperly, the data will not have the correct meaning. While there are a handful of data structure inside a "tree" node, there are over a hundred of possibilities which use certain fields or not. And if you do not ask the type, at runtime, what the type is, and perform your reading of the data properly then *boom* you get bad data. This is what I call a type bomb. The result of a lack of typing on a data structure. In other words, you have a grey box, and you basically have two things you can do, before pulling data out of it. It has structure, but certain parts are not always populated. Either you can ask the black box what is inside it, or you can assume what is inside it. A lack of type safety makes reading data can be dangerous. Since it permits the possibility that you can read the data incorrectly. But, I like this, it makes playing with gcc kinda fun, in a masochistic way.

To reiterate, how you read data in a "tree" instance can vastly change. Suppose that you want to check that whatever the "tree" represents is a global (e.g. global variable). Well, you can use the "is_global_var()" predicate. Just pass a tree instance to it, and it will return true or false. But, one must be very cautious. See, "is_global_var()" only works on declaration nodes, and not the handful of other "things" a "tree" can represent. I can pass a variable declaration "tree" node and the predicate will produce a proper truth value. However, if I pass it a particular instance of a variable, such as given by a SSA representation, then the result is not sound. Since the predicate will always return true-or-false, a value will be returned, it just might be wrong.

Consider the following block of C code:

void foo(void)
{
    int x;
    x = 1;
}

During a walk over the statements in this code, well actually, gcc will be walking a more verbose representation, its intermediate (GIMPLE) representation. Anyways, suppose your compiler pass wants to check if the 'x' variable in the declaration statement 'int x;' is global. The "is_global_var()" will return a proper truth value. However, if you pass the 'x' in the 'x = 1;' assignment statement, the value returned by the predicate cannot be trusted. Since a tree node can be thought of as an opaque type kinda (like a "void *") you can read a "variable" type as if it were a "declaration." And there is nothing wrong with that, no compiler warning. The "is_global_var()" does not check first what type of "tree" node it was passed, and will just report the result if the node were a declaration. This is dangerous, and is what causes the lack of sanity. The type enforcement comes in the comment above the "is_global_var()":
From gcc-4.7.1:

/* Return true if T (assumed to be a DECL) is a global variable.
      A variable is considered global if its storage is not automatic.  */

TLDR; Untyped data structures are convenient, but require that the programmer be an omniscient monkey-saurus and know what is up with the data they are flinging. Tread lightly young warriors. Oh, and to get an idea of what "things" a tree can be, check here. From gcc 4.7.1.

Resources: gcc-4.7.1

-Matt

Sunday, November 4, 2012

Pickin' up the Trash

I assume a few of y'all are wondering what I have been up to. Well, I am working less hours on the timing stuff and more hours on my PhD. In fact, I just wrote a pretty basic garbage collector. I want to weave it into my region memory allocator. In fact, after some more thought, this collector is not entirely complete. It does collect unused data, but it makes a few assumptions, mainly that a pointer does not point into the middle of an array/allocated chunk of memory. This is a horribly bad assumption, especially for what I want, a precise (as opposed to a conservative) collector. My collector works by collecting type and function call data at compile time (gcc plugin that I wrote). Then the collector runtime logic is linked into the resulting binary. The logic behind the allocator is a variation of Cheney's Copying Collector Algorithm. Anyways, I'm proud of the collector. The assumption I have made is part of the problem, and corresponding solution, that my advisers and myself want to write-up for a paper. I hope that goes well. And me? Well, I have a few more months (July is when my scholarships die). So I hope I can start thesis writing soon.

-Matt

Thursday, November 1, 2012

Meanwhile in Australia...

I found this in the bathroom at work earlier today. Really?! Are the upper-decker and pre-bowl really problems in this country? I surly hope most adults know how to handle themselves, and have been properly potty trained. I don't know if I should let this government poster be an insult to my intelligence or not.
-Matt