Wednesday, June 22, 2005

Here are some questions I was recently asked

1. Can you describe an example of when a calculated cell would be used?  What about an example having multiple passes?

In 2005, Scripts basically replace calculated cells (although calc cells are still supported). They are alternate mechanisms for accomplishing the same internal behavior. In 2000, one simple way in which I used calc cells was to have an IIF where value if true was a number and value if false was a string—I just made a calc cell for the subcube that needed the string and applied the string as the value. (In 2005, you can have different types for value if true than for value if false so this is no longer necessary, but you could definitely use a calc cell (or script command) to apply a special format (e.g., a background color) to a specific subcube (e.g., members of a heading level).

More typically scripts are intended for financial calculations—one example would be if you need a different formula for net income in the current year (where taxes are not known) than in the prior years where they are finalized. A script command could also apply a security layer, since scripts are applied before security. So, for example, you could use a script command to replace the actual salary at the leaf level with the average salary of the group (without affecting higher level totals), and end users would not be able to see the detail.

A classic example of multiple passes is when you need to calculate a tax amount based on net income—after tax. So you have a circularity that you need to iterate down to a negligible amount.

2. What does the flattened command in DMX do?  Why does DM data need to be flattened?

I don’t know. I don’t write DMX, because the DTS Data Mining Query task can create it for you. It happens to put FLATTENED in, so I let it—in the same way that it puts in PREDICTION JOIN. I don’t have to understand DMX syntax in order to use Data Mining. And to me that’s a good thing. (But you have to put up with my soapbox that to use Data Mining effectively, you do have to understand at least enough statistics to know what would make a good source data set.)

3. What is the meaning of native vs. managed code?  How is that relevant to SSIS work?

Managed code is .Net and uses the Common Language Runtime (CLR). It’s “managed” in the sense that code that uses the CLR exclusively can be marked as “safe”, which means it cannot damage other processes. Older code—VB6, C, C++, etc.—is compiled directly to the operating system machine code (at least indirectly), so it is “Native”. SSIS (and its libraries) was written in C++, which means the code generated is “native” or “not .Net”. In order to access the libraries from a .Net language, a “wrapper” is needed—and provided. In principle, “native” code is “faster” than “managed” code—because there’s a layer of overhead missing, and that’s the ostensible reason SSIS is written using native code. But Microsoft encourages managed code (partly for ease of writing and partly for security), and even if you write a custom transformation, you can do it using C# or VB.Net and it will be “almost” as fast as if you used native code.

My own opinion is that it’s easier to use the SSIS libraries if you keep in mind that they are not fundamentally .Net code, so you can set aside some expectations of how a typical .Net object model works when using the SSIS libraries. Most .Net developers I’ve talked with find the SSIS library functions confusing (more confusing, for example, than Analysis Services’ AMO libraries).

4. Why do you say that there is no SSIS server?  What runs the packages?

When you write a SSIS custom transformation and are testing it in debug mode, the SSIS designer runs the package. You can debug an SSIS package without having the service running. But you can’t attach a DLL (your custom transformation) until the package is running. That’s why you need to put a breakpoint into the package and get it started before you can attach your DLL to the debug process. This is different from SSAS, where even during design time, you can’t browse data until you deploy (perhaps to a test server), and it’s the msmdsrv process that serves the data, so on the one hand, you can’t debug your SSAS User Defined Function DLL without starting the SSAS Service, but on the other hand, you can attach your DLL to the msmdsrv process without having to pre-set a breakpoint in the cube somewhere.

It was my understanding (in the Beta 2 timeframe) that the SSIS Server service did not need to be running in order to run the package (using DTEXEC or DTEXECUI), and that the service allowed you to a) kill running jobs if necessary and b) manage storage of packages either in MSDB or a “relative” file system. But back then, I had a hard time getting DTEXECUI to keep from crashing at all anyway. In the more recent CTP builds, where DTEXEC actually works, it appears that the service does need to be running in order to execute a package, but I’m not 100% sure of that.

5. In DM slides you discuss how statistics creates the model then looks for the statistics and in DM it is the reverse.  Could you please provide a brief example so I can get a better sense of what this means.

In the statistics class I took as part of my MBA program, we had to build regression models. We had to first think of logical relationships, second create a mathematical model to represent the relationship, and finally test the model by running data through it. When we tried to shortcut the process and run regressions against the data in search of good correlation scores (and then made up a logical “reason” afterwards), the professor accused us of “data mining”. The problem with “data mining” was (and is) that you can find correlations in historical data that may not have predictive value. (As your mutual fund prospectus states: “Past performance is no indication of future results” or as the slide says, quoting my statistics professor “Correlation does not imply causation.”)

Data mining models today do exactly what we were told not to do back in the late Jurassic period when I was in B-school. They look for correlations in the data and create coefficients for the models based on that data. One big difference is that the models are themselves designed to look for specific types of correlation (note for example Association model for Market Basket analysis), but it is still important to test the model (perhaps by splitting the available historical data in half) and to think about the logical basis for the correlations.

6. In the old AS2K course we created ROLAP aggs and went out to the db to see them.  This was great for helping people to understand what an aggregation is.  I tried doing this in our AS Core demo in Yukon, but could not see the aggs.  Do you know how to do this?

I couldn’t get ROLAP aggregation tables either, but I haven’t tried since Beta2. I agree that they were useful for instructional purposes, but not for production—so I would have mixed feelings if it really is not possible in RTM.

7. When discussing the idea of the UDM, a key characteristic mentioned is that it can decide whether of not to pull data from the cubes or the relational db.  Does this decision process occur in any situations when proactive caching has not been enabled?

It really doesn’t have a lot to do with proactive caching. When you create a dimension or measure group, you can specify (in the “UDM”) whether storage should be ROLAP or MOLAP. That’s what makes the fundamental decision.

It is true that when you enable pro-active caching, it’s possible for even a query against a MOLAP measure group to get redirected to the relational source, but I can’t imagine that being desirable in very many situations. My fundamental belief about query speed is that consistency is even more important than speed. So if a query takes 2 seconds 95% of the time, but occasionally takes 20 minutes, that’s worse than taking 4 seconds 100% of the time. That’s one of the problems I have with (data driven) proactive caching on a measure group—if it increases the risk of fluctuation in query speed, I won’t want to use it (very often). If you can force it to keep using the old MOLAP cache until the new one gets completed, that would make it more acceptable to me. My informal surveys lead me to believe that it’s a relatively small percentage of applications that require the level of “real time” update that would require data driven proactive caching on measure groups.

8. Is the UDM really just a fancy\new term for a cube?  Or is there something different between a cube and a UDM?

My understanding is that the goal in 2005 is to be able to put the entire data model into one logical cube, in which case cube = UDM. If, for some reason, you split the data into multiple cubes, I would think that the combination of cubes would = UDM. I have been trying to find a case where it is truly impossible to put an entire data model into a single cube. The only examples I’ve seen have been cases where it could have been put into a single cube. In the April CTP, Adventure Works (the SSAS model) has a separate cube for some data mining models. I don’t know whether that is permanent and/or necessary.

[This part is a total tangent from your question, but whatever…] The real situation where I think you would want to separate the data model into multiple cubes is one where a large enterprise wants to have regional servers with data specific to each region, and then a corporate server with high-level aggregations from all the regional servers. I don’t think this would be possible to model with a single cube or even with multiple cubes in a single UDM—because you don’t want all the data on a single server; you want it distributed. I think you would need to create two logical models ( = UDM = Cube), one for the region and one for the corporate, and create several physical instances of one (one for each region) and one instance for the corporate summary. But this would not allow for seamless browsing from a single client PivotTable. It might be interesting to think through whether this scenario would be possible with a single UDM, using multiple distributed partitions for the regions—but I haven’t gotten that far yet.

9. Is there a way for an attribute hierarchy to be excluded from the "real" cube.  For some reason I thought I saw somewhere that there is, but I am not seeing how you could have an attribute hierarchy that is not in the cube.

I’m not quite sure what you mean. If you want to exclude a dimension from some measure groups while keeping it in other measure groups, you do that in Dimension Usage. If you want an attribute to be part of a dimension, but don’t want it to be a structural part (of any measure group), then you set AttributeHierarchyEnabled = False for that one attribute.

If a dimension is associated with a measure group, the granularity attribute and all attributes that are (directly or indirectly) member properties of the granularity attribute are associated with that measure group. You can make an attribute unavailable for physical storage (i.e, for inclusion in an aggregation), by setting the AggregationUsage property of the attribute (in the Cube Definition) to None—but that (unfortunately) applies to all instances of the attribute in the entire cube, not just a specific measure group (or better, a specific partition).

You could use Perspectives (or Security) to make an attribute hierarchy invisible, but that is logical, not physical, so it would still be part of the “real” cube. I think it’s better to think of dimensions  

10. If there are 30 attribute hierarchies in a dimension, is that like having 30 dimensions in the old AS2K?  Is there tremendous data explosion with something like that?

It’s more like having 30 levels in AS2K because (by default) in AS2K each level was included in the structure of the data partition and was available for inclusion in an aggregation. In AS2K, a cube with a lot of levels would require a lot of time for designing aggregations, even if the size of the final cube wasn’t that big, just because AS has to consider all the possible permutations when designing aggregations for a given amount of disk space. In 2005, it’s a lot easier to have tons of attributes than it was in 2K to have tons of levels, which is why so many attributes are left out of the right column of the aggregation design wizard—just to keep the number of permutations manageable. Data explosion is still limited by the aggregations you design (i.e., Max size of data limit in Aggregation Wizard). The place I can see a potential increase is in the size of a single data partition “row”. In AS2K, a data partition row included a key for each level in the cube. If the same is true in 2005, the row for a measure group data partition could be a lot bigger because there might be a lot more attribute hierarchies than there were levels in 2K. But there is no documentation (for either 2K or 2005) on the internal data structure, and I haven’t tried hacking the files to figure out the structure (yet) in 2005.

11. Why are min and max considered to be pseudo-additive measures?

My definition of pseudo-additive is that it has all the same benefits as Sum (i.e., can take advantage of strategic aggregations), but is not technically addition. Min and Max definitely can be included in strategic aggregation tables, but they are not technically addition. I just used to loosely talk about “additive measures” as including anything that could have strategic aggregations, but too many people would get confused about referring to Min, Max, weighted average, etc, as “additive” so I made up “pseudo-additive” as a way to say the same thing without being technically incorrect.

12. Can you please explain the purpose of the xml.defaults statement in the AMO code set.  What is the outcome of setting this to true vs. false.

Do you mean the CaptureXML property? Setting that to True means that an Update statement generates the XML (as when creating an AS project in Visual Studio), while setting it to False means that an Update statement makes the change on the deployed database (as when attaching to a database from Visual Studio—which I think is generally a very bad idea.)

13. How would you explain why AS was completely re-written.

In one sense, the underlying conception is the same—the way strategic aggregations work. So to say it’s completely re-architected is a slight overstatement.

But to answer your question, AS2K dimensions were completely founded on levels in a hierarchy. That is a very rigid structure. One example I like is Calendar vs. Fiscal hierarchies. While the Year and Quarter levels may be different, often the Month and Day levels are identical between the two hierarchies. In AS2K, this was simply impossible to properly model. Another example is with non-orthogonal attributes—such as Size vs Color: both are attributes of a Product, but they don’t form any kind of natural hierarchy, so in AS2K, there was no way to recognize any relationships that might actually exists (for example, that size 44 exists only for Red and White, but not for Black or Gray).

In 2005, a dimension is founded on a leaf-level key with additional attributes of that key. Those attributes may or may not form hierarchical relationships. This is a very flexible structure. You can easily model both the Calendar/Fiscal and the Color/Size problems I mentioned earlier. With that flexibility comes some additional complexity, but based on the changes in Apr CTP vs Beta 2, many of those complexities are likely to be handled by default, rather than by manual work.

 - Reed


No comments: