Features

Metadata Format & Structure

All dataframes types are cast as follows:

    def _assign_types(self, df: pd.DataFrame) -> pd.DataFrame:
        """Casts df columns to specified types.

        tmdb_id is shared among all metadata files but only listed for mapping.csv.

        Args:
          df: dataframe to be casted
        Returns:
          df with casted columns.
        """
        types = {
            # mapping.csv
            "tmdb_id": "int32",
            "tmdb_id_man": "int32",
            "input": object,  # can be both a string and a path!
            "canonical_input": str,
            # cast.csv
            "cast.adult": bool,
            "cast.gender": "int8",
            "cast.id": int,
            "cast.known_for_department": "category",
            "cast.name": str,
            "cast.original_name": str,
            "cast.popularity": float,
            "cast.profile_path": str,
            "cast.cast_id": "int8",
            "cast.character": str,
            "cast.credit_id": str,
            "cast.order": "int8",
            # collections.csv
            "collection.id": int,
            "collection.name": str,
            "collection.poster_path": str,
            "collection.backdrop_path": str,
            # crew.csv
            "crew.adult": bool,
            "crew.gender": "int8",
            "crew.id": int,
            "crew.known_for_department": "category",
            "crew.name": str,
            "crew.original_name": str,
            "crew.popularity": float,
            "crew.profile_path": str,
            "crew.credit_id": str,
            "crew.department": "category",
            "crew.job": str,
            # genres.csv
            "genres.id": "int8",
            "genres.name": str,
            # production_companies.csv
            "production_companies.id": "int32",
            "production_companies.logo_path": str,
            "production_companies.name": "category",
            "production_companies.origin_country": "category",
            "production_countries.iso_3166_1": "category",
            "production_countries.name": str,
            # spoken_languages.csv
            "spoken_languages.english_name": "category",
            "spoken_languages.iso_3166_1": "category",
            "spoken_languages.name": str,
            # details.csv
            "adult": bool,
            "backdrop_path": str,
            "budget": int,
            "homepage": str,
            "imdb_id": str,
            "original_language": "category",
            "original_title": str,
            "overview": str,
            "popularity": float,
            "poster_path": str,
            "release_date": "datetime64[ns]",
            "revenue": int,
            "runtime": "int16",
            "status": "category",
            "tagline": str,
            "title": str,
            "video": bool,
            "vote_average": float,
            "vote_count": "int16",
        }

        for k, v in types.items():
            if k in df.columns:
                df[k] = df[k].astype(v)
        return df

tmdb_id is listed in mapping.csv only, but also present in all other dataframes.

Manual ID Mapping

Due to grabbing the most popular match for a given title and release year from the TMDB API, results might not always be the movie you’re looking for.

If you notice that tmdb_id doesn’t match your expected movie, you can specify a tmdb_id_man yourself in mapping.csv:

tmdb_id	tmdb_id_man	input	canonical_input
123	603	1999 The Matrix	1999 The Matrix

The next time you run movieparse, metadata will be looked up for both ids 123 and 603.

Parsing Styles

Supported Patterns & Examples

For extracting release year and title multiple patterns are provided. These will be showcased with this data:

year	title
1999	The Matrix

Pattern #	Input
0	1999 The Matrix
1	1999 - The Matrix
2	The Matrix - 1999

Tip

You can click the link in the second column to get a visual representation on regex-vis. You can also test if your examples match!

To get a list of valid styles, you can run Movieparse.get_parsing_patterns():

    @staticmethod
    def get_parsing_patterns() -> dict[int, re.Pattern[str]]:
        """Lists all valid patterns for extracting title and (optionally release year) from input.

        Returns:
          A dict mapping integer keys to their regex pattern.
        """
        return {
            0: re.compile(r"^(?P<disk_year>\d{4})\s{1}(?P<disk_title>.+)$"),
            1: re.compile(r"^(?P<disk_year>\d{4})\s-\s(?P<disk_title>.+)$"),
            2: re.compile(r"^(?P<disk_title>.+)\s(?P<disk_year>\d{4})$"),
        }

Unsupported Patterns

If you feel like a pattern is missing, feel free to create a Pull Request!

Automatic Choice of Patterns

Leaving the parsing_style argument empty defaults to -1, therefore the best for your input is estimated.

Your input gets extracted by all patterns and the pattern with the highest accuracy gets used for the lookups. For input that doesn’t comply with the chosen pattern, no lookups will be made.

If you’re using strict: True, a fallback lookup may be made with title only.

Error Codes / Invalid Ids

Movieparse.default_codes

    default_codes = {
        "DEFAULT": 0,
        "NO_RESULT": -1,
        "NO_EXTRACT": -2,
        "BAD_RESPONSE": -3,