- OS: Windows 7 (64 bit) Service Pack 1
- Processor: Intel i5 Quad core
- 4GB RAM
- Video card capable of running OpenGL 3.3
- Storage: 10GB of free hard drive space available
- OS: Windows 10 (64 bit)
- Processor: Intel i7 Quad core
- 16GB RAM
- Video card capable of running OpenGL 4.0
- Storage: 30GB of free SDD space available
Licensing and Activation
When the application is launched for the first time, you will need to enter a license key.
To start the trial period, press the ‘Try’ button. The trial is fully functional and lasts for 14 days. A license can be purchased through the Pricing page. After purchasing, your license key will be sent to your email. You can use this key to activate the application. An internet connection is required to complete the activation process. If you encounter any issues during the licensing process, please contact us.
After activating the product it is possible to work without an internet connection, but periodically a connection with the licensing server will be established to validate the license. If connecting to this server continues to fail for a period of time, you will be asked to connect to the internet before continuing.
After an initial activation has been completed, it is possible to manage your license by selecting Tools/Licensing from the main menu.
Here you can see your current license key and you have the option to enter a new license key. This can, for instance, be used to upgrade an activated trial license into a purchased license, or to renew an existing license. In cases where you want to migrate your license to another machine, you can manually deactivate your license on your machine so that the key can be activated on another machine. After deactivation, the application will no longer be usable on the machine it was deactivated on, until a new key is entered.
Sampling and Instrumentation
Sampling and instrumentation each have their advantages and disadvantages. Advantages of sampling:
- You can hit the ground running without making code modifications
- You can spot problems that you did not anticipate. With instrumentation, you need to insert tags in places that you suspect could be problematic
- Sampling can give you kernel-level stacks or stacks from libraries that you do not have control over
However, instrumentation has advantanges over sampling:
- Instrumentation is more precise. Despite the high sampling frequency, absolute precision is better achieved with instrumentation.
- Instrumentation can provide context. What file were you loading? What frame are you in? What state is your code currently in?
Unlike traditional profilers, the Superluminal profiler doesn't force you to make a tradeoff between sampling and instrumentation. Out of the box, the Superluminal profiler is a sampling profiler that runs at 8 kHz.
You can begin simply by starting a profiling session and incrementally add instrumentation events by using the Performance API as you discover where you want to place these events. The sampling data is then combined with the instrumentation events you add, giving you the best of both worlds. See the Instrumentation Timings View and Threads View sections for more information.
Starting a new profiling session
To start a new profiling session, enter the New Session page. If the page is not already visible, you can click File/New Session in the menu bar to go the New Session page.
Please be aware that in order to get symbol information, the application you're profiling must be built with certain settings. See Compiler & Linker settings for more information.
To get started with profiling, you can choose to launch a new process or to attach to a currently running process. To launch a new application, press ‘Run’.
To start measuring right from the beginning of the application, check ‘Start profiling immediately’. If you prefer to launch the application first, and then select a specific time to start profiling, uncheck this box. Enter an absolute application path and optionally a working directory and commandline arguments, and click ‘Run’ to launch the target application.
Note: a popup will appear asking whether the ProfilerCollectionService is allowed to run. This is the Superluminal service that will collect profiling data. Click ‘Yes’ to continue. If you chose to profile right from the start, the following screen will appear:
As soon as the timer is running, performance data of the target application is being captured. Press 'Stop profiling' to end the capture and start loading it. Press 'Cancel profiling' to return to the New Session page.
In case you decided not to start profiling from the start, the recording screen will be slightly different and look like this:
Here you can see that the timer is not running yet, which means the capture has not started yet, although the target application is already running. Click 'Start profiling' to start capturing performance statistics, or click 'Cancel profiling' to return to the New Session page.
Once a capture is stopped, it will first be written to disk, after which it will be loaded.
When this is run for the first time, symbols need to be downloaded and converted for use by Superluminal. Depending on the internet speed and the amount of symbols required, this may take some time. After they are downloaded and converted, a local cache will make sure that this only needs to happen once. For more information on configuring symbol resolve settings, please see Symbol Resolving.
Quick overview of the UI
The User Interface is divided into four major UI components:
- Instrumentation timings
- Threads view
- Callgraph, Flat and thread interaction views
- Source & dissasembly
When doing performance analysis, people naturally go through a few stages of determining where bottlenecks lie. When a program is not running as expected, you'd want to find hotspots as soon as possible. As soon as the hotspot is found, you dig deeper to understand the context of the problem. And finally, you can drill down into the details by inspecting timings in source code or even on a per-assembly instruction level. This top-down flow is reflected in the UI as follows:
- The Instrumentation Timings view will be empty until instrumentation events are sent from within the application. Instrumentation events are optional and without them, it is still easily possible to get a great overview of bottlenecks through the Threads view. However, when some high-level events are sent to the application, like a per-frame event, it is very convenient to spot performance spikes, or understand what your average framerate is.
- The Threads view displays a full recording of your threads on a timeline. At the top of each thread, an overview of the thread's activity is displayed. When the thread is opened up, the full recording of thread activity is displayed. Traditional sampling profilers do not have this temporal view of data and are mostly centered around callgraphs. This view is incredibly powerful because it displays the full context around any hotspot.
- The CallGraph and Flat views are traditional callgraph and butterfly views, except for the fact that they can filter out on any time range quickly.
- The Source & disassembly view can display per-line and per-instruction level timings.
To understand how the UI 'flows' even better, it's good to understand how time selections work. It is best explained by example:
- We selected an instrumentation event we would like to inspect from the events view. By clicking on it, the timerange for that event is selected in the Threads view.
- The Threads view highlights the selection.
- The callgraph and flat views respond to the selection and displays the information for that time area only, allowing you to inspect just that particular piece of code.
- The source code view reflects only the timings for this time range, allowing you to inspect the area you are interested in.
Navigating the UI
The toolbar buttons on the left side of the graph allow for quick access of the various navigation modes. Clicking them will switch navigation modes. To use the selected mode, click and drag the graph to pan or zoom. To return to regular select mode, click the select toolbar button.
The navigation scrollbar underneath the graphs will let you both pan en zoom the graph. The center section of the scrollbar pans the graph, while the outer buttons control the zoom level. Click and drag them to zoom and pan.
Panning and zooming can also be controlled by using the configured keys from the settings dialog. This is the quickest way of navigating through the views. Clicking the left mouse button while holding the shortcut key and moving the mouse will let you zoom and pan the graph. Alternatively, for zooming, the mouse wheel can be used instead of dragging the mouse. To reconfigure the keys or to modify the sensitivity, select Tools/Settings in the menu, and then select the 'Controls' tab.
The Instrumentation Timings view
When instrumentation events are added to the target application, the Instrumentation Timings view can be used to plot all instances in a chart.
Inspecting instrumentation events
By default, the Instrumentation Timings view will not contain any data until instrumentation events are sent from the target application. This can be achieved using the Performance API. This is optional, as the default sampling engine already provides a great starting point for making the first steps into profiling. See also Sampling and Instrumentation. When events are sent to the profiler, the UI is populated with information:
In this example we sent a "Render" event each frame so that we can measure framerate and get an overview of the framerate. You can add anything that helps you in organizing and adding context to your profiling session. We can select the instrumentation event type that was sent, as well as a thread. The chart itself can be controlled as explained in Navigating the UI. Because the chart can be zoomed in and out, a single bar can represent multiple instrumentation events. In such cases, a single bar in the chart is colored to represent the average and maximum length of the events that it represents. The lighter blue represent the average time, the darker blue represents the maximum time. When hovering over a bar, this information is displayed in a tooltip:
In this example, the average Render time (~0.6ms) is much shorter than the peak Render time (~2ms). When zooming in on this graph, the combined bar will split into separate bars until you reached the zoom level where each bar is drawn separately. In this case, the bar is always light blue. The following image clearly displays the variation in framerate and why the average time is much lower than the peak time:
Note: if we had selected a bar that represents multiple instrumentation events, the entire timerange from the first event until the last event would have been selected in the threads view.
Controlling the chart
For explanation how to zoom in and out of the graph, see Navigating the UI. In addition, to quickly zoom to extents, use the 'zoom to extents' toolbar button. To control the height of the chart, you can either:
- Use the vertical scrollbar
- Press the 'normalize' toolbar button to normalize the height of the chart
- Press the 'average' toolbar button to set the height of the chart to the average of the entire chart
The Threads view
The Threads view contains the data for all threads, with one row per thread. Each row displays the thread ID and name for that particular thread, as well as an overview of the high-level thread activity. Each row can be expanded to inspect more detailed data about that thread's activity. For information on how to set the thread name, see the Thread names section.
Thread activity and interaction
Each row is initially in a collapsed state, giving you a high-level overview of thread activity and how threads interact with each other.
The green color in the overview means the thread is in an executing state. Any other colors are variations on wait states. When hovering over the various colors, a tooltip is displayed explaining what the thread was doing at that time.
Depending on the zoom level, arrows are visible that indicate how threads interact with each other: how a thread is unblocked by another thread. This is very convenient to see how threads interact with each other. When hovering over a wait state, the arrow for that wait state will become visible and will animate. In the following example, the Streaming thread was blocking and waiting to be unblocked. It eventually got unblocked by 'Job Scheduler 3'. From what we can deduce at this point, it appears that a streaming thread is waiting for some command to (possibly) read or write data. A job in the job scheduler eventually kicks it to perform that operation.
A thread can be unblocked by another thread, but it may not yet be scheduled in by the thread scheduler. The length of the horizontal part of the arrow indicates the duration between the thread being set into a ready state and the time it was actually scheduled in by the OS. When hovering over a wait state, or when hovering over an arrow, click the arrow or wait state to get more information about the blocking and unblocking callstacks. We can now determine more precisely what was going on in our example.
We can see that Job Scheduler 3 called RequestSave. By clicking on the function, we can see the source code for that function. The source code clearly shows that we unlock a condition, allowing the streaming thread to perform the write. If we want to navigate between the blocking and unblocking stack, we can click the toolbar buttons on top of the stacks to navigate to the various stacks in the Threads view quickly.
Examining thread execution
Each thread can be opened up by either clicking the arrow to the left of the thread ID or by dragging the thread separator down.
When opening a thread, a full recording of that thread is displayed.
In this example, we see our Streaming thread executing code. The light-blue bars are regular sampled functions, the darker blue bars are instrumentation bars. To add instrumentation events, use the Performance API. Notice that the instrumentation is merged into this view as if it was part of a regular callstack. Also notice that these events have additional information on them. The numbers in red are file sizes that we sent to the profiler. This is convenient for us to understand the ratio between write time, size and compression size. Another example is when we hover over one of the events:
The tooltip displays the length of the bar, and in the case of instrumentation event, the context. In this particular case we can see the filename that we were writing to. This can all be accomplished by providing context to instrumentation events. When we click on a bar, we select it and the CallGraph and Flat views will respond by displaying the information that is related to this particular bar and time range. To understand more about time selections, see Selecting ranges on the timeline.
Selecting ranges on the timeline
When clicking on a bar in the Threads view, a time selection is made. This is visualized by dimming the area outside of the selection and by the vertical bars surrounding the selection. The selected bar highlights in yellow.
The callgraph and flat views will respond to the selection that was made and display only the information for this selection.
What we actually do when we click on a bar, is select the timerange for that bar within a thread. Other views like the CallGraph view, Flat view and the Find window can either filter on time range selection or the entire time range. We know we can create a time selection by clicking a bar, but we can alter the selection on the timeline in different ways as well:
- By simply clicking the left mouse button and dragging
- By dragging the vertical selection bars left and right with the mouse
To find a function within an arbitrary time range, drag-select the timerange, and filter on 'Selected time range' in the Find window. Similarly, if you want to filter out a callgraph or flat view on an arbitrary time range, just make a selection and choose 'Selected time range' in the CallGraph/Flat view toolbar if not already selected.
It can be convenient to measure how long something takes. The measure function can be accessed in two ways. The first one is by clicking the measure toolbar button:
When in measure mode, click the left mouse button and drag the mouse. You will see the timing for that time range.
To exit the measure mode, click the Select button in the toolbar. A quicker way to measure the length is to use the shortcut for measuring. This is set to SHIFT by default, but can be altered by selecting Tool/Setting in the menu, and then selecting Controls.
The CallGraph view
The CallGraph view can either operate on the entire time range or on the selected time range. For information on how to control the timerange, see Selecting ranges on the timeline. Switching between the time selection and the entire time range can be done by using the combobox in the toolbar. By default, the entire range is selected, so when opening a session, you can browse the callgraph for a thread immediately. When clicking on a bar in the Threads view, the CallGraph view will switch to the thread for that bar, and the time range selection for that bar.
Here we can see statistics for each function:
- Inclusive time, the time of the function itself and all its children, recursively
- Exclusive time, the time spent only in the function itself
- Thread state, how much time was spent executing, waiting (and in what wait states)
The pie chart on the right side of the CallGraph mirrors the statistics of the callgraph, but in a graphical way to make sure you can see the distribution of the timings at a glance. The pie chart can be navigated as well and remains in sync with the CallGraph on the left side. By hovering over the pie chart, the timings are displayed in a tooltip and the full name of the function is displayed in the header.
The pie chart can be navigated by hovering over a pie piece and clicking the left mouse button. To go back to the caller, right-click anywhere on the pie chart.
When navigating through the CallGraph either by selecting functions in the tree or in the pie chart, the Source and Disassembly view is updated to display timings based on the selection that was made.
The Flat view
The Flat view displays a flat, sorted list of all the functions within the entire or selected time range. The view can be sorted on inclusive time or exclusive time by clicking on the column headers:
- The Inclusive time is the time of the function itself and all its children, recursively
- The Exclusive time is the time spent only in the function itself
The CallGraph view is an excellent tool for understanding the performance costs of a single path in the code, but in a CallGraph view, it is more difficult to find the combined cost of multiple invocations of the same function from different code paths. The flat view is therefore very convenient for finding the combined time spent in a function, either inclusive or exclusive. In the following example you can see the Flat view sorted on exclusive time:
One of the things that pop out in this example are functions for clearing and copying memory (qt_memfill32 and memcpy). This is typically something that happens from many different locations and therefore typically harder to spot in the callgraph. The combined time, however, can be significant as shown in this example. By clicking on an item in the callgraph, we will update the Source and Disassembly view. The source view will display all the time spent in all invocations of the function within the time range. Also, after an item is clicked, we can find out where the function was called from and what code paths were responsible for what portion of the total time spent. The trees on the right of the list are traditionally known as a 'butterfly' view: it shows the callers ('called by') and callees ('calls'):
- The Called by tree shows what functions called the selected function, and how much time was spent in that code path.
- The Calls tree shows all functions that are being called by the function, and how much time was spent in that code path.
Double-clicking nodes in the trees will center the function in the flat list.
In our example we see that RenderNineGridInternal is responsible for the largest portion of memcpy calls. We can open the subnodes to further investigate the paths leading RenderNineGridInternal and memcpy. In our example it turns out that clearing UI backgrounds for many different UI elements add up to quite a bit of total time, something that we can optimize.
It is also possible to search within the flat list. The text box above the list functions as a filter. In the following image you can see how the list is limited to functions that only contain the 'Render' substring:
Similar to how the CallGraph works, the Flat view operates on either the entire or selected time range. For information on how to control the timerange, see Selecting ranges on the timeline. Switching between the time selection and the entire time range can be done by using the combobox in the toolbar. By default, the entire range is selected, so when opening a session, you can browse the Flat list for a thread immediately. When clicking on a bar in the Threads view, the Flat view will switch to the thread for that bar, and the time range selection for that bar.
The Source & Disassembly view
Here you can see how much time was spent, and in what thread states, per source code line. When hovering over the thread state, more timing information is displayed.
If the source file could not be resolved but the image file (DLL, exe) is present on the disk, a disassembly view is displayed. For instance, when clicking on a Windows DLL function, the disassembly is displayed if the signatures of the DLL match and if the process has access to the file. Per-instruction timings will be available:
If the source file could be resolved and the image file is present on the disk, mixed-mode disassembly can be displayed by clicking the disassembly icon in the toolbar.
We can find text in the Source and Disassembly view either by pressing CTRL+F, or by clicking the find button in the toolbar. Like traditional Windows applications, F3 and SHIFT+F3 will go the next and previous find results.
The Find window
The Find window is a window specific for finding functions in the Threads view. There is also a local find window in the Source and Disassembly view that simply finds text. To find functions in the Threads view, press CTRL+F, or click the find button in the main toolbar. The Find window will be displayed in the top right of the Threads view.
When you start typing, a list of suggestions is made. Any sampled or instrumented function will be suggested that has a partial match with the typed text:
After pressing ENTER, or selecting a function from the list, the entire session is searched for all functions that match the name, thread and time range. For information about time ranges, see Selecting ranges on the timeline. The window will display how many hits it found, and the Threads view will highlight all hits in yellow.
To browse through the results, click the next and previous buttons, or use F3/SHIFT+F3 to cycle through them.
Using the Performance API
The PerformanceAPI can be used to communicate with Superluminal Performance from the target application. It is used to markup code with instrumentation events and to give names to threads.
The Performance API (DLL, lib and header) is located in the ‘API’ subfolder of the installation directory. Libraries are provided for both Visual Studio 2015 and 2017. All instrumentation events support UTF8 encoding. To send an instrumentation event, you use any of the following macros. The scope that you are measuring is determined by the scope that the macro is in.
- PERFORMANCEAPI_INSTRUMENT(Name). This sends an instrumentation event with a custom name.
- PERFORMANCEAPI_INSTRUMENT_CONTEXT(Name, Context). This sends an instrumentation event with a custom name and a context.
- PERFORMANCEAPI_INSTRUMENT_FUNCTION(). This sends an instrumentation event with the name of the current function.
- PERFORMANCEAPI_INSTRUMENT_FUNCTION_CONTEXT(Context). This sends an instrumentation with the name of the current function, and a context.
The optional context parameter in these macros are strings that can differ per instance. This could be a filename, a specific section of the code like ‘startup’ or ‘processing’, it could be a counter, or anything that helps you add context to your profiling session.
Thread names can be visualized in the profiler. If you are already using the Windows SetThreadDescription API, the names will automatically appear in the profiler. Note however, that this function is only available starting with Windows 10, build 1607 (Anniversary Update). The Performance API also has a function to set thread names. It will use SetThreadDescription internally when available, but falls back to a custom method that will send the names to the profiler manually. To use it, call PerformanceAPI::SetCurrentThreadName for the currently active thread.
In order to resolve symbols, the application you're profiling must be configured correctly. In addition, Superluminal Performance can be configured to retrieve symbols and source files from symbol and source servers, respectively.
Compiler & Linker settings
The following is a list of compiler & linker settings that need to be enabled in the configuration properties of the application (and related modules) you're profiling, in order to be able to correctly resolve symbols.
- C/C++ -> General -> Debug Information Format
- C7 compatible (/Z7) or
- Program Database (/Zi)
- C/C++ -> General -> Debug Information Format
- VS2015 and earlier:
- Linker -> Debugging -> Generate Debug Info -> Yes (/DEBUG)
- VS2017 and newer:
- Linker -> Debugging -> Generate Debug Info -> Generate Debug Information optimized for sharing and publishing (/DEBUG:FULL)
- VS2015 and earlier:
You can add or edit locations that should be searched when symbol (*.pdb) or image files (*.exe, *.dll) are needed during symbol resolving by going to the Tools/Settings menu and selecting the 'General' tab.
An arbitrary number of symbol locations can be added. Do keep in mind that a large number of symbol locations may slow down the symbol resolving process, as each location must be tried for any unmatched symbol or image file. When symbol & image files are retrieved from symbol locations, they are cached in the Symbol Cache directory, which can also be edited through the Settings menu. It is recommended to place the Symbol Cache directory on a drive with sufficient space, as it may grow quite large.
Finally, there are two ‘types’ of symbol locations that can be specified: symbol servers and local directories.
A Symbol Server is nothing more than a directory with a structure as produced by Microsoft’s SymStore tool. The Symbol Server can be accessed as a local directory, over HTTP(S), or through a network share. For example, the following are all valid Symbol Server locations:
Local directories & Network shares
A local directory or network share can be used to load symbols, even if it's not in the Symbol Store format. These directories should contain flat lists of PDBs and/or image files. Symbols or image files in this directory will only be used if they match the signature of the required PDB/image.
Source Server / Source Indexing
If your PDBs are correctly source indexed, Superluminal will retrieve source files through your source server when required by the Source view. Superluminal supports both regular source servers (i.e. retrieved through source control such as Perforce) and HTTP(S)-based source servers.
See the Windows documentation for more information about source indexing.
When the application is launched, an update check is performed to see if there are new versions available. This happens at most once a day. To manually search for updates, you can select 'Help/Check for updates' from the main menu. It is also possible to change the auto update settings by selecting Tool/Settings, and then selecting the 'Auto Update' tab:
You can disable automatic updates here. Early adoptors can also opt-in for the Insider releases. Those releases contain the latest features that have not been tested as extensively as the stable builds.
In case you want to switch back to a previous stable release of Superluminal, go to our downloads page. Any previous release can just be downloaded and installed over a newer release.
Third party software
Superluminal Performance uses various third party software. To view a list of all used third party software and their licenses, please select Help/About from the main menu.