C++ - the new old thing

Monday 23 February 2015

Semantic errors

I'm overhauling my semantic error handling in general. Right now, every error just throws an exception, but if I want to implement nice error handling in an environment, then it's going to have to be more intelligent than that. I think that right now, I'm going to classify errors into three or four kinds.

I have "Fuck, this operation failed" errors. These errors basically imply that the user attempted to do something they should not *and* that this affects the immediate semantic meaning of this operation- for example, an expression that tries to access a nonexistent member. You can't calculate the semantic properties of this expression in this circumstance. My current intention is that I will throw them as exceptions and catch them at AST analysis points. These points already support incremental re-analysis and such things so extending them to handle this situation would not be severely problematic.

I will also have "This operation failed, but not in a way that immediately affects analysis outcome". A simple example would be a function that is exported as an incompatible signature. When this occurs then it's obviously impossible for code generation to occur but I could continue analysing any use of that function as if there was no error. So far I basically intend to just make a note of each one and then I need to somehow "tie" it to the originating condition.

Thirdly, I'm thinking about configuration errors. For example, right now I have an error where if you try to index an array, we throw an exception if it's out of bounds, but generating this depends on typeid() which depends on <typeinfo> headers. So if the typeinfo headers aren't found, this is the compiler being plain misconfigured.

Finally, we have our good old friend Internal Compiler Error. This would effectively be a bug in the analyzer. Currently I have various assertions scattered throughout the code but at least a few of these could be refactored as ICEs.

Thursday 27 November 2014

Clang interop and VS extension

I've gotta get back cracking on Clang interop. This is one of the main things I needed to do before, but that got delayed due to circumstances. I need to present an interface they can use for IR interop behind the codegen layer, then ideally, provide something like an ExternalCodegenSource you can use.

From memory and according to Trello, I fixed Clang interop so I could lay out Wide types however I like. However, this still requires a direct pointer index to find all members, which means that ABI-indirected types can't be supported with a native interface in C++.

In addition, now that I'm finished refactoring my parser, it's time to take another swing at VS extensibility and the C API. That hasn't been maintained in a long time, but I have the power to offer many more capabilities now. For example, in the past, I had to call QuickInfo every time I wanted to provide QuickInfo. Now I can offer the reverse- as in, the user provides a source location, and I can tell them what's there. I think that not only is this more efficient, but it makes for a nicer interface, and I can also provide multiple results for example if it's in a template function with many arguments. There can also be other supported functions like automated refactoring. I think that if I can demo some R# style refactoring, this would be nice.

This also implies directly that I can offer incremental re-lexing and incremental re-parsing and potentially opens a gateway into incremental re-analysis too.

Sunday 28 September 2014

Random-access into UTF8

I've heard many people state that you can't random-access (that is, read a random codepoint in O(1)) into UTF-8, or other variable-length encodings. This, however, is simply not true. It's possible to generalize deque to provide this functionality.

For the purpose of this article, we consider deque to be an array of arrays- where each subarray has a constant maximum size. For simplicity, we'll consider it as vector<unique_ptr<array<T, N>>>. Thus, for some index i, we simply use i / N to find the subarray, and i % N to find the subarray index. This gives us the final element location in our array of arrays.

The key insight here is that since each subarray has a constant maximum size, it actually doesn't matter whether we use an algorithm with a linear complexity to locate the final element inside this subarray, since it's linear in a constant factor.

So imagine a specialization of deque<codepoint>, which is a vector<unique_ptr<vector<codeunit>>>. Each subarray can hold N codepoints, just like before. So to find the subarray holding the codepoint corresponding to index i, we perform the same i / N step. Now we perform a linear scan decoding each UTF-8 codepoint in the array, but since it's linear in our constant factor of N, it still has constant complexity.

For somewhat less simplicity, we could simply go with unique_ptr<codepoint[]>, and then use the fact that all but the last subarray must be full of N codepoints to find the end. This gives us the exact same core structure as before and as deque<int> it's just that the raw size of the subarrays can shrink to accommodate smaller sizes.

Arguably, it's questionable as to whether this is superior to just using a deque of 32bit codepoints directly, since in theory it offers memory savings but I'm not sure how it plays out in reality, and it's definitely questionable as to how useful it would be to random-access codepoints anyway.

Monday 22 September 2014

Employment

Welp, I found myself work, so I won't be spending all day dicking around on Wide anymore. Life's tough. And hopefully financially independent. For once.

Saturday 13 September 2014

Constants, variables, and laziness.

Today I finally got rid of that dumb "Every string literal has a unique type" thing. It's a holdover from pre-constant-expression days. I nuked the code paths that used to handle it. And I discovered that this code path also handled my local variable as reference solution.

So I decided to just whack it and change that.

Now I will use "var : type = value" for variables whose type needs to be explicitly specified. And it will also be useful for members (contrast "var : type;" with "var := value;" with NSDMIs or function arguments). Speaking of which, I should look into function defaulted arguments and such again.

I've been thinking about using type & type as a tuple syntax- so you could do f() := int64 & int64 to denote a function returning a tuple. My current module export code REQUIRES that all types have a notation, and I'm not sure that's a bad thing, it's certainly motivational to fix such issues. As a pairing, I've been thinking about using | to denote a kind of language-implemented variant. One of the keys here is that the compiler can translate it into different run-time semantics. For example, if you said

f(arg : int32 | int64)

then the compiler has no obligation to differentiate them at runtime. It could also simply generate two different overloads and branch on call if necessary.

I'm also thinking about implementing something like tuple[i], as long as i is a constant expression.

Long story short, I'm a smidge burned out on modules. I've been faffing around with dependencies and implements too long.

Wednesday 10 September 2014

Dead Zone

I've been suffering a double whammy of 3RFTS and Planetary Annihilation. My brain is concrete. Using the same machine for both fun and work is problematic.

Friday 5 September 2014

Module dependency data

So today I successfully exported a user-defined type with opaque members (i.e, the data members were not exposed to the caller). There are three more key features I want to finish up for modules.

First is "virtual headers". How this will work is (somewhat) simple. All you do is nominate a directory as a "header" directory. I copy all the headers in that directory (recursively) and stick them in the module archive in a certain folder. When the consumer uses the interface, the headers in the archive can be accessed as their filepath relative to the original directory they came from- so effectively, import/headers is added as an include directory. So if you create, say, a "Boost" module, then you can add Boost headers right into the module and ship it, so the user doesn't need to crap around with getting the headers.

Second, I want to separate interface and implementation modules. Right now they are all the same thing, but I need to split them off so that you can create a module against an interface of another module, then link in various implementations later. As part of this feature, I also need to decide what information modules need to hold about their direct and indirect dependencies, and how to match implementations with interfaces.

I've been thinking about marking each interface and implementation with a UUID, and then referring to each of them that way. Then you could get implementations from a central database by just asking for an implementation of a given UUID. For direct and indirect dependencies, I could simply list them as a UUID. If I kept around the full interface then if you want to use a dependency directly, it would be simpler.

Thirdly, I need to implement one of the primary features - exporting an implementation against an existing interface. My existing hack for imports and exports (create valid Wide source with some hidden attributes for binary interfacing and then just include it directly) is probably not going to function here. I think I can still re-use the basic subcomponents of the lexer and parser but the analyzer will likely require special support.

I think that #1 will be easy to finish up, the first part of #2 shouldn't be too hard but the second part may be harder, and #3 will be moderately difficult. Fortunately the Wide compiler is already mostly extremely abstract in how it represents things, so implementing them in a funky-tastic way shouldn't be too bad. I think that it will make a clear case that Wide can do things better when I can add data members and change the size/alignment without breaking binary compatibility.