QuerySet Extensions
===================

MySQL-specific Model and QuerySet extensions. To add these to your
``Model``/``Manager``/``QuerySet`` trifecta, see :doc:`installation`. Methods
below are all ``QuerySet`` methods; where standalone forms are referred to,
they can be imported from ``django_mysql.models``.

.. currentmodule:: django_mysql.models

.. _approximate-counting:

Approximate Counting
--------------------

.. method:: approx_count(fall_back=True, \
                         return_approx_int=True, \
                         min_size=1000)

    By default a QuerySet's `count()` method runs `SELECT COUNT(*)` on a table.
    Whilst this is fast for ``MyISAM`` tables, for ``InnoDB`` it involves a
    full table scan to produce a consistent number, due to MVCC keeping several
    copies of rows when under transaction. If you have lots of rows, you will
    notice this as a slow query - `Percona have some more details
    <https://www.percona.com/blog/2006/12/01/count-for-innodb-tables/>`_.

    This method returns the approximate count found by running ``EXPLAIN SELECT
    COUNT(*) ...``. It can be out by 30-50% in the worst case, but in many
    applications it is closer, and is good enough, such as when presenting many
    pages of results but users will only practically scroll through the first
    few. For example:

    .. code-block:: pycon

        >>> Author.objects.count()  # slow
        509741
        >>> Author.objects.approx_count()  # fast, with some error
        531140

    Three arguments are accepted:

    .. attribute:: fall_back=True

        If ``True`` and the approximate count cannot be calculated, ``count()``
        will be called and returned instead, otherwise ``ValueError`` will be
        raised.

        The approximation can only be found for ``objects.all()``, with no
        filters, ``distinct()`` calls, etc., so it's reasonable to fall back.

    .. attribute:: return_approx_int=True

        When ``True``, an ``int`` is not returned (excpet when falling back),
        but instead a subclass called ``ApproximateInt``. This is for all
        intents and purposes an ``int``, apart from when cast to ``str``, it
        renders as e.g. **'Approximately 12345'** (internationalization
        ready). Useful for templates you can't edit (e.g. the admin) and you
        want to communicate that the number is not 100% accurate. For example:

        .. code-block:: pycon

            >>> print(Author.objects.approx_count())  # ApproximateInt
            Approximately 531140
            >>> print(Author.objects.approx_count() + 0)  # plain int
            531140
            >>> print(Author.objects.approx_count(return_approx_int=False))  # plain int
            531140

    .. attribute:: min_size=1000

        The threshold at which to use the approximate algorithm; if the
        approximate count comes back as less that this number, ``count()`` will
        be called and returned instead, since it should be so small as to not
        bother your database. Set to ``0`` to disable this behaviour and always
        return the approximation.

        The default of ``1000`` is a bit pessimistic - most tables won't take
        long when calling ``COUNT(*)`` on tens of thousands of rows, but it
        *could* be slow for very wide tables.

.. method:: count_tries_approx(activate=True, fall_back=True, \
                               return_approx_int=True, min_size=1000)

        This is the 'magic' method to make pre-existing code, such as Django's
        admin, work with ``approx_count``. Calling ``count_tries_approx`` sets
        the QuerySet up such that then calling ``count`` will call
        ``approx_count`` instead, with the given arguments.

        To unset this, call ``count_tries_approx`` with ``activate=False``.

        To 'fix' an Admin class with this, simply do the following (assuming
        ``Author`` inherits from ``django_mysql``'s ``Model``):

        .. code-block:: python

            class AuthorAdmin(ModelAdmin):
                def get_queryset(self, request):
                    qs = super(AuthorAdmin, self).get_queryset(request)
                    return qs.count_tries_approx()

        You'll be able to see this is working on the pagination due to the word
        **'Approximately'** appearing:

        .. figure::  images/approx_count_admin.png
           :align:   center

        You can do this at a base class for all your ``ModelAdmin`` subclasses
        to apply the magical speed increase across your admin interface.


.. _query_hints:

Query Hints
-----------

The following methods add extra features to the ORM which allow you to access
some MySQL-specific syntax. They do this by inserting special comments which
pass through Django’s ORM layer and get re-written by a function that wraps the
lower-level ``cursor.execute()``.

Because not every user wants these features and there is a (small) overhead to
every query, you must activate this feature by adding to your settings:

.. code-block:: python

    DJANGO_MYSQL_REWRITE_QUERIES = True

Once you’ve done this, the following methods will work.

.. method:: label(comment)

    Allows you to add an arbitrary comment to the start of the query, as the
    second thing after the keyword. This can be used to 'tag' queries so that
    when they show in the `slow_log` or another monitoring tool, you can easily
    back track to the python code generating the query. For example, imagine
    constructing a QuerySet like this:

    .. code-block:: python

        qs = Author.objects.label("AuthorListView").all()

    When executed, this will have SQL starting:

    .. code-block:: mysql

        SELECT /*AuthorListView*/ ...

    You can add arbitrary labels, and as many of them as you wish - they will
    appear in the order added. They will work in ``SELECT`` and ``UPDATE``
    statements, but not in ``DELETE`` statements due to limitations in the way
    Django performs deletes.

    You should not pass user-supplied data in for the comment. As a basic
    protection against accidental SQL injection, passing a comment featuring
    ``*/`` will raise a ``ValueError``, since that would prematurely end the
    comment. However due to `executable comments
    <https://mariadb.com/kb/en/mariadb/comment-syntax/>`_, the comment is still
    prone to some forms of injection.

    However this is a feature - by not including spaces around your string,
    you may use this injection to use `executable comments
    <https://mariadb.com/kb/en/mariadb/comment-syntax/>`_ to add hints that are
    otherwise not supported, or to use `MySQL 5.7+ optimizer hints
    <https://www.percona.com/blog/2015/04/30/optimizer-hints-mysql-5-7-7-missed-manual/>`_.

.. method:: straight_join()

    Adds the ``STRAIGHT_JOIN`` hint, which forces the join order during a
    ``SELECT``. Note that you can’t force Django’s join order, but it tends to
    be in the order that the tables get mentioned in the query.

    Example usage:

    .. code-block:: python

        # Note from Adam: sometimes the optimizer joined books -> author, which
        # is slow. Force it to do author -> books.
        Author.objects.distinct().straight_join().filter(books__age=12)[:10]

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/select.html>`__ /
    `MariaDB <https://mariadb.com/kb/en/mariadb/select/#straight_join>`__.

    The MariaDB docs also have a good page `Index Hints: How to Force Query
    Plans”
    <https://mariadb.com/kb/en/mariadb/index-hints-how-to-force-query-plans/>`_
    which covers some cases when you might want to use ``STRAIGHT_JOIN``.

.. method:: sql_small_result()

    Adds the ``SQL_SMALL_RESULT`` hint, which avoids using a temporary table in
    the case of a ``GROUP BY`` or ``DISTINCT``.

    Example usage:

    .. code-block:: python

        # Note from Adam: we have very few distinct birthdays, so using a
        # temporary table is slower
        Author.objects.values("birthday").distinct().sql_small_result()

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/select.html>`__ /
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_small_result-sql_big_result>`__.

.. method:: sql_big_result()

    Adds the ``SQL_BIG_RESULT`` hint, which forces using a temporary table in
    the case of a ``GROUP BY`` or ``DISTINCT``.

    Example usage:

    .. code-block:: python

        # Note from Adam: for some reason the optimizer didn’t use a temporary
        # table for this, so we force it
        Author.objects.distinct().sql_big_result()

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/select.html>`__ /
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_small_result-sql_big_result>`__.

.. method:: sql_buffer_result()

    Adds the ``SQL_BUFFER_RESULT`` hint, which forces the optimizer to use a
    temporary table to process the result. This is useful to free locks as soon
    as possible.

    Example usage:

    .. code-block:: python

        # Note from Adam: seeing a lot of throughput on this table. Buffering
        # the results makes the queries less contentious.
        HighThroughputModel.objects.filter(x=y).sql_buffer_result()

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/select.html>`__ /
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_buffer_result>`__.

.. method:: sql_cache()

    Adds the ``SQL_CACHE`` hint, which means the result set will be stored in
    the `Query Cache
    <https://dev.mysql.com/doc/refman/en/query-cache.html>`_. This only has
    an effect when the MySQL system variable ``query_cache_type`` is set to
    ``2`` or ``DEMAND``.

    .. warning::

        The query cache was removed in MySQL 8.0, and is disabled by default
        from MariaDB 10.1.7.

    Example usage:

    .. code-block:: python

        # Fetch recent posts, cached in MySQL for speed
        recent_posts = BlogPost.objects.sql_cache().order_by("-created")[:5]

    Docs:
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_cache-sql_no_cache>`__.

.. method:: sql_no_cache()

    Adds the ``SQL_NO_CACHE`` hint, which means the result set will not be
    fetched from or stored in the `Query Cache
    <https://dev.mysql.com/doc/refman/en/query-cache.html>`_. This only has
    an effect when the MySQL system variable ``query_cache_type`` is set to
    ``1`` or ``ON``.

    .. warning::

        The query cache was removed in MySQL 8.0, and is disabled by default
        from MariaDB 10.1.7.

    Example usage:

    .. code-block:: python

        # Avoid caching all the expired sessions, since we’re about to delete
        # them
        deletable_session_ids = (
            Session.objects.sql_no_cache().filter(expiry__lt=now()).values_list("id", flat=True)
        )

    Docs:
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_cache-sql_no_cache>`__.

.. method:: sql_calc_found_rows()

    Adds the ``SQL_CALC_FOUND_ROWS`` hint, which means the total count of
    matching rows will be calculated when you only take a slice. You can access
    this count with the ``found_rows`` attribute of the ``QuerySet`` after
    filling its result cache, by e.g. iterating it.

    This can be faster than taking the slice and then again calling
    ``.count()`` to get the total count.

    .. warning::

        This is deprecated in MySQL 8.0.17+.

    Example usage:

    .. code-block:: pycon

        >>> can_drive = Customer.objects.filter(age=21).sql_calc_found_rows()[:10]
        >>> len(can_drive)  # Fetches the first 10 from the database
        10
        >>> can_drive.found_rows  # The total number of 21 year old customers
        1942

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/select.html>`__ /
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/select/#sql_calc_found_rows>`__.

.. method:: use_index(*index_names, for_=None, table_name=None)

    Adds a ``USE INDEX`` hint, which affects the index choice made by MySQL's
    query optimizer for resolving the query.

    Note that index names on your tables will normally have been generated by
    Django and contain a hash fragment. You will have to check your database
    schema to determine the index name.

    If you pass any non-existent index names, MySQL will raise an error. This
    means index hints are especially important to test in the face of
    future schema changes.

    ``for_`` restricts the scope that the index hint applies to. By default it
    applies to all potential index uses during the query; you may supply one of
    ``'JOIN'``, ``'ORDER BY'``, or ``'GROUP BY'`` to restrict the index hint to
    only be used by MySQL for index selection in their respective stages of
    query execution. For more information see the MySQL/MariaDB docs (link
    below).

    ``table_name`` is the name of the table that the hints are for. By default,
    this will be the name of the table of the model that the ``QuerySet`` is
    for, however you can supply any other table that may be joined into the
    query (from e.g. ``select_related()``). Be careful - there is no validation
    on the table name, and if it does not exist in the final query it will be
    ignored. Also it is injected raw into the resultant SQL, so you should not
    use user data otherwise it may open the potential for SQL injection.

    Note that ``USE INDEX`` accepts no index names to mean 'use no indexes',
    i.e. table scans only.

    Example usage:

    .. code-block:: pycon

        # SELECT ... FROM `author` USE INDEX (`name_12345`) WHERE ...
        >>> Author.objects.use_index("name_12345").filter(name="John")
        # SELECT ... FROM `author` USE INDEX (`name_12345`, `name_age_678`) WHERE ...
        >>> Author.objects.use_index("name_12345", "name_age_678").filter(name="John")
        # SELECT ... FROM `author` USE INDEX FOR ORDER BY (`name_12345`) ... ORDER BY `name`
        >>> Author.objects.use_index("name_12345", for_="ORDER BY").order_by("name")
        # SELECT ... FROM `book` INNER JOIN `author` USE INDEX (`authbook`) ...
        >>> Book.objects.select_related("author").use_index("authbook", table_name="author")

    Docs:
    `MySQL <https://dev.mysql.com/doc/refman/en/index-hints.html>`__ /
    `MariaDB
    <https://mariadb.com/kb/en/mariadb/index-hints-how-to-force-query-plans/>`__.


.. method:: force_index(*index_names, for_=None)

    Similar to the above ``use_index()``, but adds a ``FORCE INDEX`` hint. Note
    that unlike ``use_index()`` you must supply at least one index name. For
    more information, see the MySQL/MariaDB docs.


.. method:: ignore_index(*index_names, for_=None)

    Similar to the above ``use_index()``, but adds an ``IGNORE INDEX`` hint.
    Note that unlike ``use_index()`` you must supply at least one index name.
    For more information, see the MySQL/MariaDB docs.


.. _smart-iteration:

'Smart' Iteration
-----------------

Here's a situation we've all been in - we screwed up,
and now we need to fix the data. Let's say we accidentally set the address of all
authors without an address to "Nowhere", rather than the blank string. How can
we fix them??

The simplest way would be to run the following:

.. code-block:: python

    Author.objects.filter(address="Nowhere").update(address="")

Unfortunately with a lot of rows ('a lot' being dependent on your database
server and level of traffic) this will stall other access to the table, since
it will require MySQL to read all the rows and to hold write locks on them in
a single query.

To solve this, we could try updating a chunk of authors at a time; such code
tends to get ugly/complicated pretty quickly:

.. code-block:: python

    min_id = 0
    max_id = 1000
    biggest_author_id = Author.objects.order_by("-id")[0].id
    while True:
        Author.objects.filter(id__gte=min_id, id__lte=...)
        # I'm not even going to type this all out, it's so much code

Here's the solution to this boilerplate with added safety features - 'smart'
iteration! There are two classes; one yields chunks of the given ``QuerySet``,
and the other yields the objects inside those chunks. Nearly every data update
can be thought of in one of these two methods.

.. class:: SmartChunkedIterator(queryset, atomically=True, \
                                status_thresholds=None, pk_range=None, \
                                chunk_time=0.5, chunk_size=2, \
                                chunk_min=1, chunk_max=10000, \
                                report_progress=False, total=None)

    Implements a smart iteration strategy over the given ``queryset``. There is
    a method ``iter_smart_chunks`` that takes the same arguments on the
    ``QuerySetMixin`` so you can just:

    .. code-block:: python

        bad_authors = Author.objects.filter(address="Nowhere")
        for author_chunk in bad_authors.iter_smart_chunks():
            author_chunk.update(address="")

    Iteration proceeds by yielding primary-key based slices of the queryset,
    and dynamically adjusting the size of the chunk to try and take
    ``chunk_time`` seconds. In between chunks, the
    :func:`~django_mysql.status.GlobalStatus.wait_until_load_low` method of
    :class:`~django_mysql.status.GlobalStatus` is called to ensure the database
    is not under high load.

    .. warning::

        Because of the slicing by primary key, there are restrictions on what
        ``QuerySet``\s you can use, and a ``ValueError`` will be raised if the
        queryset doesn't meet that. Specifically, only ``QuerySet``\s on models
        with integer-based primary keys, which are unsliced, and have no
        ``order_by`` will work.

    There are a lot of arguments and the defaults have been picked hopefully
    sensibly, but please check for your case though!

    .. attribute:: queryset

        The queryset to iterate over; if you're calling via
        ``.iter_smart_chunks`` then you don't need to set this since it's the
        queryset you called it on.

    .. attribute:: atomically=True

        If true, wraps each chunk in a transaction via django's
        :func:`transaction.atomic() <django:django.db.transaction.atomic>`.
        Recommended for any write processing.

    .. attribute:: status_thresholds=None

        A dict of status variables and their maximum tolerated values to be
        checked against after each chunk with
        :func:`~django_mysql.status.GlobalStatus.wait_until_load_low`.

        When set to ``None``, the default,
        :class:`~django_mysql.status.GlobalStatus` will use its default of
        ``{"Threads_running": 10}``. Set to an empty dict to disable status
        checking - but this is not really recommended, as it can save you from
        locking up your site with an overly aggressive migration.

        Using ``Threads_running`` is the most recommended variable to check
        against, and is copeid from the default behaviour of
        ``pt-online-schema-change``. The default value of 10 threads is
        deliberately conservative to avoid locking small database servers.
        You should tweak it up based upon the live activity of your server -
        check the running thread count during normal traffic and add some
        overhead.

    .. attribute:: pk_range=None

        Controls the primary key range to iterate over with slices. By default, with
        ``pk_range=None``, the QuerySet will be searched for its minimum and
        maximum ``pk`` values before starting. On QuerySets that match few
        rows, or whose rows aren't evenly distributed, this can still execute a
        long blocking table scan to find these two rows.
        You can remedy this by giving a value for ``pk_range``:

        * If set to ``'all'``, the range will be the minimum and maximum PK
          values of the entire table, excluding any filters you have set up -
          that is, for ``Model.objects.all()`` for the given ``QuerySet``'s
          model.

        * If set to a 2-tuple, it will be unpacked and used as the minimum and
          maximum values respectively.

        .. note::

            The iterator determines the minimum and maximum at the start of
            iteration and does not update them whilst iterating, which is
            normally a safe assumption, since if you're "fixing things" you
            probably aren't creating any more bad data. If you do need to
            process *every* row then set ``pk_range`` to have a maximum far
            greater than what you expect would be reached by inserts that occur
            during iteration.

    .. attribute:: chunk_time=0.5

        The time in seconds to aim for each chunk to take. The chunk size is
        dynamically adjusted to try and match this time, via a weighted average
        of the past and current speed of processing. The default and algorithm
        is taken from the analogous ``pt-online-schema-change`` flag
        `--chunk-time <https://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-change.html#cmdoption-pt-online-schema-change--chunk-time>`_.

    .. attribute:: chunk_size=2

        The initial size of the chunk that will be used. As this will be
        dynamically scaled and can grow fairly quickly, the initial size of 2
        should be appropriate for most use cases.

    .. attribute:: chunk_min=1

        The minimum number of objects in a chunk. You do not normally need to
        tweak this since the dynamic scaling works very well, however it might
        be useful if your data has a lot of "holes" or if there are other
        constraints on your application.

    .. attribute:: chunk_max=10000

        The maximum number of objects in a chunk, a kind of sanity bound. Acts
        to prevent harm in the case of iterating over a model with a large
        'hole' in its primary key values, e.g. if only ids 1-10k and 100k-110k
        exist, then the chunk 'slices' could grow very large in between 10k and
        100k since you'd be "processing" the non-existent objects 10k-100k very
        quickly.


    .. attribute:: report_progress=False

        If set to true, display out a running counter and summary on
        ``sys.stdout``. Useful for interactive use. The message looks like
        this:

        .. code-block:: pycon

            AuthorSmartChunkedIterator processed 0/100000 objects (0.00%) in 0 chunks

        And uses ``\r`` to erase itself when re-printing to avoid spamming your
        screen.  At the end ``Finished!`` is printed on a new line.

    .. attribute:: total=None

        By default the total number of objects to process will be calculated
        with :func:`~django_mysql.models.QuerySetMixin.approx_count`, with
        ``fall_back`` set to ``True``. This ``count()`` query could potentially
        be big and slow.

        ``total`` allows you to pass in the total number of objects for
        processing, if you can calculate in a cheaper way, for example if you
        have a read-replica to use.


.. class:: SmartIterator

    A convenience subclass of ``SmartChunkedIterator`` that simply unpacks the
    chunks for you. Can be accessed via the ``iter_smart`` method of
    ``QuerySetMixin``.

    For example, rather than doing this:

    .. code-block:: python

        bad_authors = Author.objects.filter(address="Nowhere")
        for authors_chunk in bad_authors.iter_smart_chunks():
            for author in authors_chunk:
                author.send_apology_email()

    You can do this:

    .. code-block:: python

        bad_authors = Author.objects.filter(address="Nowhere")
        for author in bad_authors.iter_smart():
            author.send_apology_email()

    All the same arguments as ``SmartChunkedIterator`` are accepted.


.. class:: SmartPKRangeIterator

    A subclass of ``SmartChunkedIterator`` that doesn't return the chunk's
    ``QuerySet`` but instead returns the start and end primary keys for the
    chunk. This may be useful when you want to iterate but the slices need to
    be used in a raw SQL query. Can be accessed via the
    ``iter_smart_pk_ranges`` method of ``QuerySetMixin``.

    For example, rather than doing this:

    .. code-block:: python

        for authors_chunk in Author.objects.iter_smart_chunks():
            limits = author_chunk.aggregate(min_pk=Min("pk"), max_pk=Max("pk"))
            authors = Author.objects.raw(
                """
                SELECT name from app_author
                WHERE id >= %s AND id <= %s
            """,
                (limits["min_pk"], limits["max_pk"]),
            )
            # etc...

    ...you can do this:

    .. code-block:: python

        for start_pk, end_pk in Author.objects.iter_smart_pk_ranges():
            authors = Author.objects.raw(
                """
                SELECT name from app_author
                WHERE id >= %s AND id < %s
            """,
                (start_pk, end_pk),
            )
            # etc...

    In the first format we were forced to perform a dumb query to determine the
    primary key limits set by ``SmartChunkedIterator``, due to the ``QuerySet``
    not otherwise exposing this information.

    .. note::
         There is a **subtle** difference between the two versions. In the
         first the end boundary, ``max_pk``, is a closed bound, whereas in the
         second, the ``end_pk`` from ``iter_smart_pk_ranges`` is an open bound.
         Thus the ``<=`` changes to a ``<``.

    All the same arguments as ``SmartChunkedIterator`` are accepted.

.. _pt-visual-explain:

Integration with pt-visual-explain
----------------------------------

How does MySQL *really* execute a query? The ``EXPLAIN`` statement
(docs: `MySQL <https://dev.mysql.com/doc/refman/8.0/en/explain.html>`__ /
`MariaDB <https://mariadb.com/kb/en/mariadb/explain/>`__),
gives a description of the execution plan, and the ``pt-visual-explain``
`tool <https://www.percona.com/doc/percona-toolkit/2.2/pt-visual-explain.html>`_
can format this in an understandable tree.

This function is a shortcut to turn a ``QuerySet`` into its visual explanation,
making it easy to gain a better understanding of what your queries really end
up doing.

.. method:: pt_visual_explain(display=True)

    Call on a ``QuerySet`` to print its visual explanation, or with
    ``display=False`` to return it as a string. It prepends the SQL of the
    query with 'EXPLAIN' and passes it through the ``mysql`` and
    ``pt-visual-explain`` commands to get the output. You therefore need the
    MySQL client and Percona Toolkit installed where you run this.

    Example:

    .. code-block:: pycon

        >>> Author.objects.all().pt_visual_explain()
        Table scan
        rows           1020
        +- Table
           table          myapp_author

    Can also be imported as a standalone function if you want to use it on a
    ``QuerySet`` that does not have the ``QuerySetMixin`` added, e.g. for
    built-in Django models:

    .. code-block:: pycon

        >>> from django_mysql.models import pt_visual_explain
        >>> pt_visual_explain(User.objects.all())
        Table scan
        rows           1
        +- Table
           table          auth_user