IJCAI 2018, second day

IJCAI 2018 is being held in Stockholm from July the 13th to July the 19th as a part of the Federated AI Meeting. In this and other posts I will give some short comments on interesting things I’ve heard. If anything is wrong or misrepresented, please don’t hesitate to get in touch with me.

On the fifth day the main IJCAI conference continued. The conference has many parallel tracks, comments here are from sessions that I thought might be interesting for me.

The second day also included the Angry Birds competition.

Sister Conferences Best Papers: Machine Learning

Emergent Tangled Program Graphs in Multi-Task Learning

This paper plays Atari games using a learning approach that is does not use deep learning, but instead learning something called tangled program graphs (TPG). These are learned using genetic programming. The goal is to have “simpler” or “more natural” behaviors.

The learning is multi task, in that learning is done for many games and not just one game.

A TPG is a program structure where nodes look at the current state, and can make a choice on which node to go to. As a comparison with DQN/deep learning is that DQN uses around a million weight parameters training on each game for 50M frames. TPG starts very simple and grows, but remains several orders of magnitude smaller.

Looking at the graphs produced, it is possible to see that certain nodes are only used in some specific games.

Make Evasion Harder: An Intelligent Android Malware Detection System

A phone that is infected with malware is catastrophic, since it can be used to steal credentials and even money directly. This paper extracts features and API-calls from Android app and groups in. This is then used with something called multi-kernel learning to find if an app contains malware.

Compared with standard methods on an industry data-set the new method improves the detection rate.

Time Series Chains: A Novel Tool for Time Series Data Mining

The key idea here is that extracting series of data that is locally similar. Failure detection is done when two extracted data values that are further apart are not equal enough.

An example of failure detection that can use this method is checking a freezer power consumption. It will have dips when the compressor does not need to run. As the freezer fails gradually, the dips will become shorter and shorter since the compressor will need to work more to keep the temperature while any individual pair of consecutive steps are similar enough.

TensorCast: Forecasting Time-Evolving Networks with Contextual Information

The example domain here is tweets by users at certain times with certain hash tags. Creating a multi-dimensional matrix indexed on for example users, hash tags, and time or users, users, and time, can be used to extract information about how things relate using matrix factorizations. Such matrices can then be combined to make future predictions.

Lots of math later, it is possible to also rank the predictions on likelihood. The methods take linear time in the number of non-zeros in the matrices.

Another example domain is by applying this to where authors will publish. The data is from DBLP, and uses co-authorship and conferences, and it is possible to make fairly good predictions.

It is also possible to actually predict novel hash tags and authors publishing in new conferences, although naturally with lower accuracy.

A Genetic Programming Approach to Designing Convolutional Neural Network Architectures

Designing Convolutional Neural Networks (CNN) is very hard. The paper uses a graph that represents a CNN with does representing typical things such as convolutional layers and tensor operations and then applies genetic programming to optimize the CNN.

Distributing Frank-Wolfe via Map-Reduce

Frank-Wolfe is an algorithm to optimize a function over a convex set of constraints. The paper expresses this algorithm using map-reduce so that it is parallelizable, and implements it in Spark and tests it on couple of hundred CPUs.

The Moral Machine Experiment by Jean-Francois Bonnefon

The moral machine experiment can be very depressing and people are going to die. The goal is a rush hour scenario with self driving cars

On the way there accidents will happen.

Google

An accident with a pedestrian may be considered very undesirable and worse than an accident with another vehicle.

Ford

Whose lives are [the cars] going to save? There are a lot of ethical issues as a society we have to work through.

Mercedes

We will implement both the legal framework and what is deemed socially acceptable

Start with easy scenarios, kill 1 or 10 people on the road. Should the car kill 5 people on the road, or 1 passenger. People say that the car should kill its passenger, even if they are the passenger. But they would not buy the car.

It becomes much more hard to answer when adding differences in the people. What about an old couple jay walking vs a little girl behaving legally. Adding features makes the number of possible scenarios immense. And the culture of the people responding might influence the answer.

In order to get these answer, the moral machine web site was set up to elicit preferences from people around the world for very many different scenarios. A lot of work went into the web site to make it viral, getting people into the fun part of choosing who to kill. One such trick is telling people what kind of person you are. When youtubers started posting videos of themselves playing the moral machine, it became very popular. Even Obama started talking about it.

Each run of the moral machine tests one random scenario and two each of comparing

  • Number
  • Species
  • Gender
  • Age
  • Health
  • Status

While species is not that interesting a question, it was very useful for making it more viral. In total, more than 40M choices made, from millions of visitors, and 500K+ surveys filled in.

The results show that children and pregnant women are worth most, and that a criminal is worth more than a cat but less than a dog. When comparing scenarios, changing an animal to a person makes it much more likely (60% more likely) to save that group (all else being equal). However, adding just more people it requires adding 3 people to get the same effect.

One scary feature is that a homeless person gets the same effect as jay walking.

Comparing countries, Europe is very similar. However, it is more interesting comparing over the world. Clustering each country blindly using the moral decisions, three big clusters emerge. One is the western cluster with Europe, the US and British colonies, the eastern cluster with Asian and Muslim countries, and the souther countries with both the whole southern America and with France and French colonies.

The western cluster is mostly “average”, but since it represents half of the population. In the easter cluster, age is not as important (that is, they don’t kill old people as much) nor is number of people. Southern clusters care a lot about saving women and social status (that is, they kill the homeless person more).

The results correlate quite well with social factors in the different countries. Social inequality correlates with caring about social status, missing female babies (perhaps due to selective abortion or infanticide) correlates with killing women more.

Germany made standards that says that saving people is mandatory, using number of people is optional, but characteristics is forbidden (for example, saving young is not allowed).

Important caveat: The moral machine should not be used to vote on ethics, but as a start for discussion.

CSAT-SAT- Satisfiability

DMC: A Distributed Model Counter

Model counting is about finding the number of satisfying formulas for a SAT problem. This can be used for model checking, hardware testing, and other cases. However, it is very difficult (#P complete), and even hard for some classes of SAT which are easy to solve.

The class #P is hard according to Toda’s theorem, which says that the polynomial hierarchy is contained in P given an oracle that can solve #P.

There are both exact model counters (search based and compilation based) and approximate model counters, with heuristics and dedicated pre-processing. Still, it is hard to solve for many interesting cases.

Instead of trying to find better algorithms, how should distribution among different computers be done. The approach uses a master node and many workers and work-stealing.

Personally, I’m wondering if it would not be simpler to just generate k*n additional clauses partitioning the search space, and distributing these among the n computers. This draws on both Xor-model counting and the embarrassingly parallel search. The former says that xor constraints are useful for partitioning the search space, and the latter says that with a value k of one to two orders of magnitude larger than n the workload will be similar enough.

Conflict Directed Clause Learning for Maximum Weighted Clique Problem

Maximum weighted clique is a common subproblem. In this work, the problem is attacked by writing a global constraint for the Minicsp solver which is a hybrid SAT/CP solver.

The global constraint says that “X is a clique of weight larger than k”. An upper bound can be found using coloring for the unweighted case. Using multi coloring where each node requires many colors can be used for upper bounds for weighted cliques. This can be done using a method from Tavares that takes cubic time in the number of nodes using shades of colors.

Pruning is done by counting colors if a node is or is not in the clique. It is also possible to extract explanations from the graph based on the number of colors.

Comparing with other clique solvers, it is better given enough time.

My reflection is that I hope the propagator would also be effective in a standard CP system such as Gecode, and not only in hybrid SAT/CP systems. That is, how much benefit is from pruning and how much is from explanations.

Divide and Conquer: Towards Faster Pseudo-Boolean Solving

Introduces a new pseudo boolean solver called RoundingSat that uses division instead of saturation for conflict analysis. This makes it possible to use only fixed precision arithmetic instead of arbitrary precision arithmetic.

In competition the new solver was the best one in the tricks where it was supposed to work well. In particular RoundingSat is best when cutting planes reasoning is needed, while for pure CDCL style reasoning it is still competitive.

My comment is that the title is a great pun!

Seeking Practical CDCL Insights from Theoretical SAT Benchmarks

In theory SAT should be exponentially hard, but practitioners didn’t care and implemented surprisingly effective solvers. How do we understand what the implementation tricks actually means?

The applied approach is to run examples and try to analyze. This turns out to be hard since instances are very diverse. Theoretical approaches using proof complexity is hard since proof complexity only cares about existence of proofs, not how they are found.

The combined approach is to use interesting formulas from proof complexity literature and apply these using instrumented solvers. The gain here is that it is possible to test the full matrix of solver options against families of benchmarks. In total more than 1k instances from 27 families tested against more than 650 solver configurations.

Lots of very interesting results (to hard to summarize here, the paper seems well worth studying.

For example, the new LDB clause purging heuristic seems to have actual theoretical advantage over classical activity based purging. Branching is very important, and interestingly all the heuristics are better than random.

Complexity of n-Queens Completion

The paper actually started as a discussion on Facebook on a StackExchange question. The question here is about solving the problem of n-queens completion, where k queens are placed, and the problem is to place the remaining n-k queens.

This is a problem first proposed in 1850, and now we know that it is NP-Complete.

one interesting aspect of the construction is that a constraint solver was used to find the minimum configuration of boards embedded in a larger board, and found a minimal 19 * 19 multi-board.

Model-free, Model-based, and General Intelligence

Hector Geffner gave an invited talk on integrating planning and learning.

We are not in the singularity yet, at least not until we can solve Blocks World

CSAT-ML - Constraints, Satisfiability and Learning

Descriptive Clustering: ILP and CP Formulations with Applications

Contrary to normal clustering where clusters are first constructed and then explanations are generated for these, this paper tries to construct clusters alongside their explanations.

Given a data set with both features and descriptions, the object is to create clusters based on the features that are also reasonable given the tags. three different objectives are defined with different levels or requirements.

As an example, classifying animal pictures based on SIFT features and measuring based on tags such as “has tail”, “fuzzy”, and so on. The results gives very natural clusters.

The implementation uses both ILP (Gurobi) and CP (Gecode).

Machine Learning and Constraint Programming for Relational-To-Ontology Schema Mapping

The goal of this application paper is to find a common standard ontology for a variety of data sources, and then to map those data sources to the common ontology. The This is normally done using machine learning, but here it is done using a combination of machine learning and CP.

The CP part is essentially solving a Steiner tree problem (published in AAAI 2016) combined with frequent pattern mining and further side constraints. The nice feature is that it was very easy to experiment with the model.

Faster Training Algorithms for Structured Sparsity-Inducing Norm

Interestingly, this was a no-show.

Learning SMT(LRA) Constraints using SMT Solvers

Example is inheriting an ice cream factory and only knowing the order book, (amount of ice cream, type, produced), and learning a model for how the factory works.

The main idea here is that the learning problem for the model in SMT(LRA) is in actuality itself encoded as an SMT problem. LRA is the theory for linear real arithmetic.

In order to make it effective, only a few examples are encoded at first, and then the found model is checked against all others. Failing instances are added to the model and solving is done again to find a new model.

Learning Optimal Decision Trees with SAT

Just using neural networks can be very effective, but we have a problem that the results are not explainable. Decision trees on the other hand are very explainable, since they directly encode their decisions.

The problem here is to minimize the size of decision trees in order to hopefully get as understandable decision tree as possible. The approach is to encode the construction of the decision tree as a SAT problem. The constraints are to ensure that what is constructed is a) a tree, and b) a correct classifier.

the variables are

  • Node i is leaf
  • Node i is left child of j
  • Node i is right child of j
  • Node i is a parent of j

Over these the natural constraints are added.

To ensure that it is a classifier, for each example that is positive (negative) they must be discriminated on a path from the root to a negative (positive) end node.

Lots of implied constraints added to speed up solving.

The new encoding is linear in the number of example, compared to previous encodings for the problem that was quadratic in the number of examples.

Search and Planning

The FastMap Algorithm for Shortest Path Computations

The FastMap algorithm takes objects with all pairwise distances, it embeds the objects in the Euclidean plane. The problem is to adjust this to a sparse graph. A heuristic algorithm is shown in the paper. The algorithm runs in time O(e + v log v), which is quite fast.

Using this embedding creates distances that, while not actual true distances, describe distance reasonably well. It depends on the graph how well it can be represented in Euclidean space. The distance between points can then be used for the A* algorithm.

Experiments are made on grids with some cells blocked.

A Fast Algorithm for Optimally Finding Partially Disjoint Shortest Paths

The problem is on finding a set of k paths that are pairwise disjoint. The paper presents an algorithm for this that is reasonably efficient. Details best read in paper.

Search Progress and Potentially expanded States in Greedy Best-First Search

This paper presents an interesting way to segment search expansions in GBFS using high water marks and benches, which are defined over all the possible paths to a node. These can be found by exploring the search space and then to use a dynamic programming algorithm that is polynomial in the size of the state space.

The follow up work Best-Case and Worst-Case Behavior of Greedy Best-First Search is a paper at IJCAI18.

This best robot paper from ICAPS attacks planning for robots in a continuous domain while taking into account the actual dynamics of robot motion, which is a differential system.

Problems include narrow corridors where robots need to move away to make way for other robots. The initial solutions are not optimized.

A motion tree is created based on a search to find places a robot can go. To make it tractable a discrete abstraction is made from a sampling of the motion trees, and to use multi-agent search in that domain. From the solution in the abstract domain, the paths are transferred to the motion tree, and collisions are checked for. If any would occur, a re-plan is done.