|Version 34 (modified by simonmar, 11 months ago)|
Cabal and GHC do not support multiple instances of the same package version installed at the same time. If a second instance of a package version is installed it is overwritten on the file system as well as in the PackageDB. This causes packages that depended upon the overwritten instance to break. The idea is to never overwrite an installed package. As already discussed in http://hackage.haskell.org/trac/ghc/wiki/Commentary/Packages/MultiInstances the following changes need to be made:
- Cabal should install packages to a location that does not just depend on name and version,
- ghc-pkg should always add instances to the PackageDB and never overwrite them,
- ghc --make, ghci, and the configure phase of Cabal should select suitable instances according to some rule of thumb (similar to the current resolution technique),
- we want to be able to make more fine-grained distinctions between package instances than currently possible, for example by distinguishing different build flavours or "ways" (profiling, etc.)
- cabal-install should still find an InstallPlan, and still avoid unnecessarily rebuilding packages whenever it makes sense
- some form of garbage collection should be offered to have a chance to reduce the amount of installed packages
Hashes and identifiers
There are three identifiers:
- XXXX: the identifier appended to the installation directory so that installed packages do not clash with each other
- YYYY: the InstalledPackageId, which is an identifier used to uniquely identify a package in the package database.
- ZZZZ: the ABI hash derived by GHC after compiling the package
The current situation:
- XXXX: is empty, which is bad (two instances of a package install in the same place)
- YYYY: is currently equal to ZZZZ, which is bad because we need to make more distinctions:
- we need to distinguish between two packages that have identical ABIs but different behaviour (e.g. a bug was fixed)
- we need to distinguish between two instances of a package that are compiled against different dependencies, or with different options, or compiled in a different way (profiling, dynamic)
- XXXX must be decided before we begin compiling, because we have to generate the Paths_P.hs file that is compiled along with the package, whereas ZZZZ is only available after we have compiled the package.
- ZZZZ is not uniquely determined by the compilation inputs (see #4012), although in the future we hope it will be
- It is desirable that when two packages have identical YYYY values, then they are compatible, even if they were built on separate systems. Note that this is not guaranteed even if YYYY is a deterministic function of the compilation inputs, because ZZZZ is non-deterministic (previous point). Hence YYYY must be dependent on ZZZZ.
- It is desirable that YYYY be as deterministic as possible, i.e. we would rather not use a GUID, but YYYY should be determined by the compilation inputs and ZZZZ. We know that ZZZZ is currently non-deterministic, but in the future it will be, and at that point YYYY will become deterministic too, in the meantime YYYY should be no less deterministic than ZZZZ.
- We define a new Cabal Hash that hashes the compilation inputs (the LocalBuildInfo and the contents of the source files)
- XXXX is a GUID.
- Why not use the Cabal Hash? We could, but then there could conceivably be a clash. (Andres - please expand this point, I have forgotten the full rationale).
- YYYY is the combination of the Cabal Hash and ZZZZ (concatenated)
- ZZZZ is recorded in the package database as a new field abi-hash.
- When two packages have identical ZZZZs then they are interface-compatible, and the user might in the future want to change a particular dependency to use a different package but the the same ZZZZ. We do not want to make this change automatically, because even when two packages have identical ZZZZs, they may have different behaviour (e.g. bugfixes).
Install location of installed Cabal packages
Currently the library part of packages is installed to $prefix/lib/$pkgid/$compiler. For example the GLUT package of version 22.214.171.124 when compiled with GHC 7.4.1 when installed globally lands in /usr/local/lib/GLUT-126.96.36.199/ghc-7.4.1/. This is the default path. It is completely customizable by the user. In order to allow multiple instances of this package to coexist we need to change the install location to a path that is unique for each instance. Several ways to accomplish this have been discussed:
Use a hash to uniquely identify package instances and make the hash part of both the InstalledPackageId? and the installation path.
The ABI hash currently being used by GHC is not suitable for unique identification of a package, because it is nondeterministic and not necessarily unique. In contrast, the proposed Cabal hash should be based on all the information needed to build a package.
This approach requires that we know the hash prior to building the package, because there is a data directory (per default under $prefix/share/$pkgid/) that is baked into Paths_foo.hs in preparation of the build process.
Use a unique number as part of the installation path.
A unique number could be the number of packages installed, or the number of instances of this package version already installed, or a random number. It is important that the numbers are guaranteed to be unique system-wide, so the counter-based approaches are somewhat tricky.
The advantage over using a hash is that this approach should be very simple to implement. On the other hand, identifying installed packages (see below) could possibly become more difficult, and migrating packages to other systems is only possible if the chance of collisions is reasonably low (for example, if random numbers are being used).
- The unique number is also part of the installed package id.
- We can use another unique identifier (for example, a Cabal hash) to identify installed packages. In this case, that identifier would be allowed to depend on the output of a package build.
ghc-pkg currently identifies each package by means of an InstalledPackageId. At the moment, this id has to be unique per package DB and is thereby limiting the amount of package instances that can be installed in a single package DB at one point in time.
In the future, we want the InstalledPackageId to still uniquely identify installed packages, but in addition to be unique among all package instances that could possibly be installed on a system. There's still the option that one InstalledPackageId? occurs in several package DBs at the same time, but in this case, the associated packages should really be completely interchangeable. [If we want to be strict about this, we'd have to include the ABI hash in the InstalledPackageId.]
Even though, as discussed above, the ABI hash is not suitable for use as the InstalledPackageId given these changed requirements, we will need to keep the ABI hash as an essential piece of information for ghc itself.
ghc-pkg is responsible for storing all information we have about installed packages. Depending on design decisions about the solver and the Cabal hash, further information may be required in ghc-pkg's description format (see below).
Simplistic dependency resolution
The best tool for determining suitable package instances to use as build inputs is cabal-install. However, in practice there will be many situations where users will probably not have the full cabal-install functionality available:
- invoking GHCi from the command line,
- invoking GHC directly from the command line,
- invoking the configure phase of Cabal (without using cabal-install).
In these cases, we have to come up with a suitable selection of package instances, and the only info we have available are the package DBs plus potential command line flags. Cabal will additionally take into account the local constraints of the package it is being invoked for, whereas GHC will only consider command-line flags, but not modules it has been invoked with.
Currently if GHC is invoked by the user it does some adhoc form of dependency resolution. The most common case of this is using ghci. If there are multiple instances of the same package in the PackageDBStack the policy used to select a single one prefers DBs higher in the stack. It then prefers packages with a higher version. Once we allow package instances with the same version within a single package DB, we need to refine the algorithm. Options are:
- pick a random / unspecified instances
- use the time of installation
- user-specified priorities
- use the order in the PackageDB
- look at the transitive closure of dependencies and their versions
- build a complex solver into GHC
Picking a random version is a last resort. A combination of installation time and priorities seems rather feasible. It makes conflicts unlikely, and allows to persistently change the priorities of installed packages. Using the order in the package DB is difficult if directories are being used as DBs. Looking at the transitive closure of dependencies makes it hard to define a total ordering of package instances. Adding a complex solver is unattractive unless we find a way to reuse cabal-install's functionality within GHC, but probably we do not want to tie the two projects together in this way.
Once we distinguish several package instances with the same version, we have a design decision how precise we want that distinction to be.
The minimal approach would be to just take the transitive dependencies into account. However, we might also want to include additional information about builds such as Cabal flag settings, compiler options, profiling, documentation, build tool versions, external (OS) dependencies, and more.
These differences have to be tracked. The two options we discuss are to store information in the ghc-pkg format, or to incorporate them in a Cabal hash (which is then stored). Both options can be combined.
The Cabal hash
[A few notes about where to find suitable information in the source code:]
A build configuration consists of the following:
The Cabal hashes of all the package instances that are actually used for compilation. This is the environment. It is available in the installedPkgs field of LocalBuildInfo which is available in every step after configuration. It can also be extracted from an InstallPlan after dependency resolution.
The compiler, its version and its arguments and the tools and their version and their arguments. Available from LocalBuildInfo? also. More specifically: compiler, withPrograms, withVanillaLib, withProfLib, withSharedLib, withDynExe, withProfExe, withOptimization, withGHCiLib, splitObjs, stripExes. And a lot more. [Like what?]
The source code. This is necessary because if the source code changes the result of compilation changes. For released packages I would assume that the version number uniquely identifies the source code. A hash of the source code should be available from hackage to avoid downloading the source code. For an unreleased package we need to find all the source files that are needed for building it. Including non-haskell source files. One way is to ask a source tarball to be built as if the package was released and then hash all the sources included in that.
OS dependencies are not taken into account because i think it would be very hard.
Released and Unreleased packages
If we cabal install a package that is released on hackage we call this a clean install. If we cabal install an unreleased package we call this a dirty install. Clean installs are mainly used to bring a package into scope for ghci and to install applications. While they can be used to satisfy dependencies this is discouraged. For released packages the set of source files needed for compilation is known. For unreleased packages this is currently not the case.
Dependency resolution in cabal-install
There are two general options for communicating knowledge about build flavors to the solver:
- the direct way: i.e., all info is available to ghc-pkg and can be communicated back to Cabal and therefore the solver can figure out if a particular package is suitable to use or not, in advance;
- the agnostic way: this is based on the idea that the solver at first doesn't consider installed packages at all. It'll just do resolution on the source packages available. Then, taking all build parameters into account, Cabal hashes will be computed, which can then be compared to hashes of installed packages.
Reusing installed packages instead of rebuilding them is then an optimization of the install plan.
The agnostic way does not require ghc-pkg to be directly aware of all the build parameters, as long as the hash computation is robust
The options are to support either both by putting all info into InstalledPackageInfo or to support only the second option by just putting a hash into InstalledPackageInfo. The disadvantage of supporting both is that InstalledPackageInfo would have to change more often. This could be fixed by explicitly making the InstalledPackageInfo format extensible in a backwards-compatible way.
The advantages of having all info available, independently of the solver algorihm, are that the info might be useful for other tools and user feedback.
Possible disadvantages of the agnostic approach could be that is is a rather significant change and can probably not be supported in a similar way for other Haskell implementation. Also, in the direct approach, we could in principle allow more complex compatibility rules, such as allowing non-profiling libraries to depend on profiling libraries.
Also, even if we go for the agnostic approach, we still have to be able to handle packages such as base or ghc-prim which are in general not even available in source form.
On the other hand, the agnostic approach might lead to more predictable and reproducible solver results across many different systems.
The proposed changes will likely lead to a dramatic increase of the number of installed package instances on most systems. This is particularly relevant for package developers who will conduct lots of dirty builds that lead to new instances being installed all the time.
It should therefore be possible to have a garbage collection to remove unneeded packages. However, it is not possible for Cabal to see all potential reverse dependencies of a package, so automatic garbage collection would be extremely unsafe.
Options are to either offer an interactive process where packages that look unused are suggested for removal, or to integrate with a sandbox mechanism. If, for example, dirty builds are usually installed into a separate package DB, that package DB could just be removed completely by a user from time to time.
Currently open design decisions
InstalledPackageId and install path
Options for uniquely identifying InstalledPackageId:
- Cabal hash only
- Cabal + ABI hash (truly unique)
- random number
Options for identifying install path:
- Cabal hash
- random number
ABI hash cannot be in install path because it's only available after build.
Handling of dirty builds
How should hash computation work for dirty builds?
- Use a random number even if we otherwise use hashes
- Hash the complete build directory
- Attempt to make a clean (sdist-like) copy or linked copy of the sources and hash and build from that.
- Use the Cabal file to determine the files that would end up in an sdist and hash those directly without copying.
The third option has the advantage(?) that the build is more guaranteed to use only files actually mentioned in the Cabal file.
To what degree should we distinguish package instances?
- Only package versions transitively
- Ways and Cabal flags
- Everything Haskell-specific info that we can query
- Even non-Haskell-specific inputs such as OS dependencies
InstalledPackageInfo and solver algorithm
Options for InstalledPackageInfo:
- Only add Cabal hash.
- Add (nearly) all information, but in an extensible format.
- Add all information in a way that ghc-pkg itself can use it.
[These aren't necessarily mutually exclusive.]
Options for the solver:
- Direct (see above): requires a certain amount of info in the InstalledPackageInfo.
- Agnostic (except for builtin packages): could be done with only the Cabal hash in InstalledPackageInfo.
Simplistic dependency resolution
Options (in order of preference):
- use the time of installation
- user-specified priorities
- pick a random / unspecified instances
- (build a complex solver into GHC)
A combination of the first two seems possible and useful.
In the following, we discuss some other issues which are related to the multi-instance problem, but not necessarily directly relevant in order to produce an implementation.
Separating storage and selection of packages
Currently the two concepts of storing package instances (cabal store) and selecting package instances for building (environment) are conflated into a PackageDB. Sandboxes are used as a workaround to create multiple different environments. But they also create multiple places to store installed packages. The disadvantages of this are disk usage, compilation time and one might lose the overview. Also if the multi-instance restriction is not lifted sandboxes will eventually suffer from the same unintended breakage of packages as non-sandboxed PackageDBs. There should be a separation between the set of all installed packages called the cabal store and a subset of these called an environment. While the cabal store can contain multiple instances of the same package version an environment needs to be consistent. An environment is consistent if for every package version it contains only one instance of that package version.
First class environments
It would be nice if we had some explicit notion of an environment.
Questions to remember
What about builtin packages like ghc-prim, base, rts and so on?
Who has assumptions about the directory layout of installed packages?
Custom Builds and BuildHooks??
Other Compilers, backwards compatibility?
What is ComponentLocalBuildInfo? for?