Features
Metadata Format & Structure
All dataframes types are cast as follows:
def _assign_types(self, df: pd.DataFrame) -> pd.DataFrame:
"""Casts df columns to specified types.
tmdb_id is shared among all metadata files but only listed for mapping.csv.
Args:
df: dataframe to be casted
Returns:
df with casted columns.
"""
types = {
# mapping.csv
"tmdb_id": "int32",
"tmdb_id_man": "int32",
"input": object, # can be both a string and a path!
"canonical_input": str,
# cast.csv
"cast.adult": bool,
"cast.gender": "int8",
"cast.id": int,
"cast.known_for_department": "category",
"cast.name": str,
"cast.original_name": str,
"cast.popularity": float,
"cast.profile_path": str,
"cast.cast_id": "int8",
"cast.character": str,
"cast.credit_id": str,
"cast.order": "int8",
# collections.csv
"collection.id": int,
"collection.name": str,
"collection.poster_path": str,
"collection.backdrop_path": str,
# crew.csv
"crew.adult": bool,
"crew.gender": "int8",
"crew.id": int,
"crew.known_for_department": "category",
"crew.name": str,
"crew.original_name": str,
"crew.popularity": float,
"crew.profile_path": str,
"crew.credit_id": str,
"crew.department": "category",
"crew.job": str,
# genres.csv
"genres.id": "int8",
"genres.name": str,
# production_companies.csv
"production_companies.id": "int32",
"production_companies.logo_path": str,
"production_companies.name": "category",
"production_companies.origin_country": "category",
"production_countries.iso_3166_1": "category",
"production_countries.name": str,
# spoken_languages.csv
"spoken_languages.english_name": "category",
"spoken_languages.iso_3166_1": "category",
"spoken_languages.name": str,
# details.csv
"adult": bool,
"backdrop_path": str,
"budget": int,
"homepage": str,
"imdb_id": str,
"original_language": "category",
"original_title": str,
"overview": str,
"popularity": float,
"poster_path": str,
"release_date": "datetime64[ns]",
"revenue": int,
"runtime": "int16",
"status": "category",
"tagline": str,
"title": str,
"video": bool,
"vote_average": float,
"vote_count": "int16",
}
for k, v in types.items():
if k in df.columns:
df[k] = df[k].astype(v)
return df
tmdb_id is listed in mapping.csv only, but also present in all other dataframes.
Manual ID Mapping
Due to grabbing the most popular match for a given title and release year from the TMDB API, results might not always be the movie you’re looking for.
If you notice that tmdb_id doesn’t match your expected movie, you can specify a tmdb_id_man yourself in mapping.csv:
tmdb_id |
tmdb_id_man |
input |
canonical_input |
|---|---|---|---|
123 |
603 |
1999 The Matrix |
1999 The Matrix |
The next time you run movieparse, metadata will be looked up for both ids 123 and 603.
Parsing Styles
Supported Patterns & Examples
For extracting release year and title multiple patterns are provided. These will be showcased with this data:
year |
title |
|---|---|
1999 |
The Matrix |
Pattern # |
Input |
|---|---|
0 |
|
1 |
|
2 |
Tip
You can click the link in the second column to get a visual representation on regex-vis. You can also test if your examples match!
@staticmethod
def get_parsing_patterns() -> dict[int, re.Pattern[str]]:
"""Lists all valid patterns for extracting title and (optionally release year) from input.
Returns:
A dict mapping integer keys to their regex pattern.
"""
return {
0: re.compile(r"^(?P<disk_year>\d{4})\s{1}(?P<disk_title>.+)$"),
1: re.compile(r"^(?P<disk_year>\d{4})\s-\s(?P<disk_title>.+)$"),
2: re.compile(r"^(?P<disk_title>.+)\s(?P<disk_year>\d{4})$"),
}
Unsupported Patterns
If you feel like a pattern is missing, feel free to create a Pull Request!
Automatic Choice of Patterns
Leaving the parsing_style argument empty defaults to -1, therefore the best for your input is estimated.
Your input gets extracted by all patterns and the pattern with the highest accuracy gets used for the lookups. For input that doesn’t comply with the chosen pattern, no lookups will be made.
If you’re using strict: True, a fallback lookup may be made with title only.
Error Codes / Invalid Ids
default_codes = {
"DEFAULT": 0,
"NO_RESULT": -1,
"NO_EXTRACT": -2,
"BAD_RESPONSE": -3,